Unsupervised Subsequence Anomaly Detection in Large Sequences Paul Boniol Supervised by: Themis Palpanas, Mohammed Meftah, Emmanuel Remy EDF R&D, LIPADE, Université de Paris paul.boniol@edf.fr ABSTRACT Anomaly detection is a well studied task [2, 13, 16, 8] Subsequence anomaly detection in long sequences is an im- that can be tackled by either examining single values, or portant problem with applications in a wide range of do- sequences of points. In the specific context of sequences, mains. However, the approaches that have been proposed which is the focus of this paper, we are interested in iden- so far in the literature have severe limitations: they either tifying anomalous subsequences [16, 12, 8], which are not require prior domain knowledge that is used to design the single abnormal values, but rather an abnormal sequence of anomaly discovery algorithms, or become cumbersome and values. In real-world applications, this distinction becomes expensive to use in situations with recurrent anomalies of crucial: in certain cases, even though every individual point the same type. In this Ph.D. work, we address these prob- may be normal, the trend exhibited by the sequence of these lems, and present unsupervised methods suitable for domain same values may be anomalous. Evidently, failing to iden- agnostic subsequence anomaly detection. We explore two tify such situations could lead to severe problems that are possible way to represent the normal behavior of a long only detected when it is too late. data series that lead to fast and accurate identification of Some existing techniques explicitly look for a set of pre- abnormal subsequences. These normal representations are determined types of anomalies [1]. These are techniques that either based on subsequences (using a data structure called have been specifically designed to operate in a particular the normal model), or on graphs, by taking advantage of setting, they require domain expertise, and cannot general- graph properties to encode the normal transitions between ize. Other techniques identify as anomalies the subsequences neighboring subsequences of a long series. The experimental with the largest distances to their nearest neighbors (termed results, on a large set of synthetic and real datasets, demon- discords) [16, 12]. The assumption is that the most distant strate that the proposed approaches correctly identify single subsequence is completely isolated from the ”normal” sub- and recurrent anomalies of various types, without any prior sequences. knowledge of the characteristics of these anomalies. Our However, this definition fails in the case where an anomaly approaches outperform by a large margin several competing repeats itself (approximately the same). In this situation, approaches in accuracy, while being up to orders of magni- anomalies will have other anomalies as close neighbors, and tude faster. will not be identified as discords. In order to remedy this situation, the mth discord approach has been proposed [15], which takes into account the multiplicity m of the anomalous 1. INTRODUCTION subsequences that are similar to one another, and marks as Time series1 anomaly detection is a crucial problem with anomalies all the subsequences in the same group. However, application in a wide range of domains. Examples of such this approach assumes that we know the cardinality of the applications can be found in manufacturing, astronomy, en- anomalies, which is not true in practice (otherwise, we need gineering, and other domains [10, 11]. This implies a real to try several different m values, increasing drastically the need by relevant applications for developing methods that execution time). Furthermore, the majority of the previous can accurately and efficiently achieve this goal. approaches require prior knowledge of the anomaly length, 1 and their performance deteriorates significantly when the A time series, or data series, or sequence, is an ordered se- correct length value is not used. quence of real-valued points. If the dimension that imposes Our work addresses the aforementioned problems. We the ordering is time then we talk about time series, but it could also be mass, angle, position, etc. We will use the proposed two approaches aiming to build a normal represen- terms time series, data series, and sequence interchangeably. tation of the data series, which enables to identify anomalous subsequences. The two approaches use different data struc- tures to represent the normal behavior of the sequences; they are summarized below: • NormA [3, 4], a normal-model based subsequence anomaly detection method in large data series, for which the normal data structure is a set of sub- sequences paired with a weight. The higher those Proceedings of the VLDB 2020 PhD Workshop, August 31st, 2020. Tokyo, Japan. Copyright (C) 2020 for this paper by its authors. Copying permitted weights are, the more ”normal” their paired subse- for private and academic purposes. quences are. Detected '- Detected • Series2Graph [5, 6], an unsupervised graph-based ap- !",ℓ '(5 !",ℓ Anomalies Anomalies 7" '6 proach for subsequence anomaly detection in large '(6 data series, for which the data structure is modeled !%,ℓ '* 8ℓ !%,ℓ '(* as a directed graph. Nodes fo this graph can be seen 7% ℬ '(4 '4 as subsequences of the data series. '(- 7& '5 '(3 !&,ℓ '3 !&,ℓ 2. NORMAL BEHAVIOR In this section we describe our proposed approaches to the (a) d defined as the distance to the Normal (b) Normal represtnation of the data series as * Model '( = '( , + * , … , '(- , + - ; 8ℓ .The path to reach the red dots has smaller aforementioned problems. In the next part we describe the ℬ is the weighted barycenter of NM weight than those to reach the blue dots. notions and the elements used in our solutions. We begin by introducing some formal notations useful for Figure 1: Illustration of the two subsequence the rest of the paper. A data series T ∈ Rn is a sequence anomaly definitions proposed approaches: (a) of real-valued numbers Ti ∈ R [T1 , T2 , ..., Tn ], where n = |T | NormA; (b) Series2Graph. is the length of T , and Ti is the ith point of T . We are typically interested in local regions of the data series, known as subsequences. A subsequence Ti,` ∈ R` of a data series T is a continuous subset of the values from T of length ` the normal behaviors of the data series will be represented starting at position i. Formally, Ti,` = [Ti , Ti+1 , ..., Ti+`−1 ]. in the normal model, while unusual behaviors will not (or Given two sequences, A and B, of the same length, `, we will have a very low weight). can calculate theirqZ-normalized Euclidean distance, dist, Figure 1(a) is an illustration of a Normal Model. As de- Ai −µA picted, the Normal Model NM is a weighted combination of − Biσ−µ P` B 2 as follows: dist = i=1 ( σA B ) , where µ and a set of subsequences (points within the dotted circles). The σ represent respectively the mean and standard deviation combination of these subsequences and their related weights of the sequences. For the remaining of this paper, we will returns distances di , dj , dk that are high enough to be differ- simply use the term distance. entiated from the normal points/subsequences. These dis- Given a subsequence Ti,` , we say that its mth Nearest tances can be seen as the distance between subsequences Neighbor (mth NN) is Tj,` if Tj,` has the mth shortest dis- and a weighted barycenter B (in green) that represents NM . tance to Ti,` among all the subsequences of length ` in T , Note that we do not actually compute this barycenter; we excluding trivial matches; a trivial match of Ti,` is a subse- illustrate it in Figure 1(a) for visualization purposes. quence Ta,` , where |i − a| < `/2 (i.e., the two subsequences We choose `NM > ` in order to make sure that we do overlap by more than half their length). Since we are in- not miss useful subsequences, i.e., subsequences with a large terested in subsequence anomalies, we first define the set overlap with an anomalous subsequence. For instance, for of all subsequences of length ` in a given data series T : a given subsequence of length `, a normal model of length T` = {Ti,` |∀i.0 ≤ i ≤ |T | − ` + 1}. In general, we assume `NM = 2` will also contain the subsequences overlapping that T` contains both normal and anomalous subsequences. with the first and last half of the anomalous subsequence. We then need a way to characterize normal behavior. For We can now define the Abnormal subsequences as follows [3]. the purpose of the paper, we abstractly define the normal behavior of the data series as follows: Definition 2 (Subsequence Anomaly). Assume a data series T , the set T` of all its subsequences of length Definition 1 (Normal Behavior, NB ). Given a `, and the Normal Model NM of T . Then, the subsequence data series T , NB is a model that represents the normal Tj,` ∈ (i.e., not anomalous) trends and patterns in T . P T` with i anomaly score,  i.e., distance i to NM , dj = NMi w ∗ minx∈[0,`N −`] dist(Tj,` , (NM )x,` ) is an M The above definition is not precise on purpose: it allows anomaly if d is in the T op-k largest distances among all several interpretations, which can lead to different kinds of subsequences in T` , or d > , where  ∈ R>0 is a threshold. models. Nevertheless, subsequence anomalies can then be defined in a uniform way: anomalies are the subsequences Note that the only essential input parameter is the length that have the largest distances to the expected, normal be- ` of the anomaly (which is also one of the inputs in all rel- havior, NB (or their distance is above a set threshold). Both evant algorithms in the literature [12, 16, 7, 9, 8]). The of the proposed approaches will define their own data struc- parameter k (or ) is not essential, as long as the algorithm ture to represent NB , as well as their own distance measure. can rank the anomalies. We stress that in practice, experts start by examining the 2.1 The NormA Approach most anomalous pattern, and then move down in the ranked We describe a new approach for subsequence anomaly list, since there is (in general) no rigid threshold separating detection. We propose a formalization for NB , anomalous from non-anomalous behavior [2]. called the Normal Model, denoted NM , and defined As we mentioned above and will detail later on, we choose as follows [3]: NM is a set of sequences, NM = to define NM as a set of sequences that summarizes normal- 0 {(NM , w0 ), (NM 1 , w1 ), ..., (NM n , wn )}, where NM i is a subse- ity in T , by representing the average behavior of a set of i quence of length `NM (the same for all NM ) that corresponds normal sequences. Intuitively, NM is the set of data series, to a recurring behavior in the data series T , and wi is its which tries to minimize the sum of Z-normalized Euclidean normality score (as we explain later, the highest this score is, distances between itself and some of the subsequences in T . i the more usual the behavior represented by NM is). In other Last but not least, we need to compute NM in an unsuper- words, this model averages (with proper weights) the differ- vised way, i.e., without having normal/abnormal labels for ent recurrent behaviors observed in the data, such that all the subsequences in T` . Observe that this definition of NM implies the following graph or digraph G is an ordered pair G = (N , E) where N challenge: even though NM summarizes the normal behavior is a Node Set, and E is an ordered Edge Set. only, it needs to be computed based on T , which may include We now provide a new formulation for subsequence (several) anomalies. We address these challenges by taking anomaly detection. The idea is that a data series is trans- advantage of the fact that anomalies are a minority class. formed into a sequence of abstract states (corresponding to different subsequence patterns), represented by nodes N in a 2.1.1 NormA Framework directed graph, G(N , E), where the edges E encode the num- We now briefly describe NormA [3], our unsupervised so- ber of times one state occurred after another. Under this lution to the subsequence anomaly detection problem. formulation, paths in the graph composed of high-weight [Computing the Normal Model] Recall that NM should edges and high-degree nodes correspond to normal behavior. capture (summarize) the normal behavior of the data. This Then, the Normality of a data series is defined as follows [5]. may not be very hard to do for a sequence T that does not Definition 3 (θ-Normality). Let a node set be de- contain any anomalous subsequences. In practice though, fined as N = {N1 , N2 , ..., Nm }. Let also a data series T we want to apply the NormA approach in an unsupervised way on any sequence, which may contain several anomalies. be represented as a sequence of nodes hN (1) , N (2) , ..., N (n) i We compute the NM in three steps. First, we extract with ∀i ∈ [0, n], N (i) ∈ N and m ≤ n. The θ- the subsequences that can serve as candidates for building N ormality of T is the subgraph Gνθ (Nν , Eν ) of G(N , E) with the NM . These candidates are either randomly selected E = {(N (i) , N (i+1) )}i∈[0,n−1] , such that: Nν ⊂ N and from T (NormA-smpl), or correspond to motifs. Then, ∀(N (i) , N (i+1) ) ∈ Eν , w((N (i) , N (i+1) )).(deg(N (i) ) − 1) ≥ θ. we group these subsequences according to their similarity in a set of clusters C (we use hierarchical clustering and Minimum Description Length to identify the right num- Using θ-Normality subgraphs naturally leads to a ranking ber of clusters). The last step consists of scoring each of subsequences based on their ”normality”. For practical cluster, and selecting the cluster that best represents nor- reasons, this ranking can be transformed into a score, where mal behavior. Formally, for a given cluster c ∈ C, we each rank can be seen as a threshold in that score. We used set a weight wc = N orm(c, C) with the following for- such a score in GraphAn to detect abnormal subsequences. 2 mula: N orm(c, C) = PF requency(c) ×Coverage(c) , where Note that given the existence of graph G, the above def- x∈C dist(Center(c),Center(x)) initions imply a way for identifying the anomalous subse- F requency(c) is the number of subsequences in c, and quences. The problem is now how to construct this graph. Coverage(c) is the time interval between the first and the last occurrence of a subsequence in c. We thus set NM = 2.2.1 Series2Graph Framework 0 {(NM , w0 ), (NM 1 , w1 ), ..., (NMn , wn )}, with NM i the centroid th We now briefly describe Series2Graph [5, 6], our unsuper- of the i cluster in C. vised solution to the subsequence anomaly detection prob- [Normal Model Based Anomaly Detection] Intu- lem. For a given data series T , the overall Series2Graph itively, the anomalous subsequences of the long series T are process is divided into four main steps as follows: the ones that are far away from NM . We thus compute the [Subsequence Embedding] We project all the subse- meta S = [d0 , d1 , ..., d|T |−`NM ], with dj the distance of the quences (of a given length `) of T in a three-dimensional subsequence Ti,j to the Normal Model NM as defined in Def- space that corresponds to the three most important com- inition 2. These distances correspond to the degree of abnor- ponents of the Principle Component Analysis (PCA). This mality: the larger the distance is to NM , the more abnormal space is subsequently transformed into a two-dimensional the subsequence is. We then extract the k subsequences of space, composed of two components r~y , r~z corresponding to length ` with the highest distances to NM , and rank them two orthogonal vectors of vref ~ , an axis composed of every according to their distances. Alternatively, we can extract flat sequences (a ∗ 1`−λ with a ∈ R). In the later space, the all subsequences with distance larger than a threshold. shape similarity is preserved [5]. [Node Creation] We then create a node for each one of 2.2 The Series2Graph Approach the densest parts of the above two-dimensional space. The We now present an alternative data structure to represent space is first discretized by a set of radius Iψ of angle ψ. We the subsequences normal behaviors of the data series. The then estimate the density of subsequences along each one of previous normal model was a set of subsequences that aimed these radius using gaussian kernels. The maximal values of to store both normal and abnormal subsequences into the the estimated densities are assigned to nodes and form our same set, associated with weights that rank them based on Node Set N . These nodes summarize all major patterns of their normality. One can argue that an ordering information length ` that occurred in T [5]. is missing from this data structure representation. We thus [Edge Creation] We then retrieve all transitions between formulate an approach for subsequence anomaly detection pairs of subsequences represented by two different nodes: based on the data series representation into a Graph, in each transition corresponds to a pair of subsequences, where which edges encode the ordering information [5]. Figure 1(b) one occurs immediately after the other in the input data se- illustrates the Graph data structure. ries T . We represent transitions with an edge between the We first define several basic elements related to graphs. corresponding nodes, and we thus form the Edge Set E. The We define a Node Set N as a set of unique integers. Given a edge weights are set to the number of times the correspond- Node Set N , an Edge Set E is then a set composed of tuples ing pair of subsequences was observed in T . Finally we build (xi , xj ), where xi , xj ∈ N . w(xi , xj ) is the weight of that our graph G` (N , E). edge. Given a Node Set N , an Edge Set E (pairs of nodes in [Subsequence Scoring] We compute the normality N ), a Graph G is an ordered pair G = (N , E). A directed (or anomaly) score of a subsequence of length `q ≥ ` (a) (a) (a) (a) (a) proposed approach of using a Normal Behavior model for the subsequence anomaly detection problem is very promising. In our current work, we focus on the following directions. [Normal Behavior Model] We will study in depth dif- D-smplNormA-SJSTOMP NormA-SJ NormA-smpl NormA-SJGV NormA-smpl IF NormA-smpl DADGV NormA-smpl LOFSTOMPDADGVGV GVS2G STOMP NormA-SJ IFDAD DAD LOFSTOMP DAD NormA-smpl STOMP IF STOMP IF IF LOF NormA-SJ S2G IF S2GGV LOFLOF DAD S2GS2G LOF NormA-smpl S2G ferent STOMP GV models IF DAD for LOF Normal STOMP S2G Behavior, IF LOF and perform S2G analytical 10000 1000010000 100000 100000 10000 10000 10000 Time out 10000 10000 10000 100000 Time out 10000 10000 10000 10000 10000 and experimental 10000 comparisons 10000 among them. The goal will 10000 10000 100001000 be to understand the theoretical properties, as well as the Time (sec) 1000 1000 Time (sec) 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 practical limitations of each one, so as to make informed 1000 1000 100 100 100 100100 100 100 100 100 100100100 100 decisions 100 about the best model 100 to use. 100 100 100 Time (sec) 10 10 10 10 10 10 10 10 1010 10 10 10 10 [Multivariate 10 Data Series] 10 Given that many domains 10 10 1 monitor, process, and analyze multivariate data series, we 1 1 1 1 11 11 1 1 1 1 11 1 1 1 00000 1000000 1500000 000 0000 20 120 1500000 1000000 2000000 1000000 0 0 2000000 20 702000000 1500000 1500000 2000000 201201000000 20070500 20 500000 1000 70120 70 20000001500 01000000 70 0 5000 0120 1500000 120 120500000 500 20000001000 500 0 2000 1000000 10001000 1500 1500000 500 1500 500500 70 1000 1500 2000000 1000 1000 will1500 work 1500 70 0 20 120 1500 on algorithms 500 120 that 10000 will1500500 inherently1000 consider 1500 to- umber of oints Number ofof anomalies points Number(c) (c)points (b) Number of Number anomalies (c) (d) Anomaly of of(b) points Number Number anomalies (c) Number length ofof of points anomalies (d) Anomaly anomalies (c)(d) (b) Number Anomaly length length (c) of points Anomaly Number (d) Anomaly length (d) of anomalies Anomaly length (c) length Number gether all of (d) Anomaly anomalies variables length The of the series. (d) Anomaly lengthproblems challenging are how to create a multivariate Normal Behavior model Figure 2: Comparison of six state-of-the-art meth- and corresponding data structure, and how to define and ods to Series2graph and NormA: (a) Precision@k formalize suitable distance measures for this case. with a critical diagram over 21 datasets; (b) execu- [Online Operation] In several situations, subsequence tion time vs varying number of points; (c) execution anomaly detection applications need to operate in an on- time vs varying anomaly length. line fashion. In this context, we will develop extensions of our methods for real-time applications. We will analyze the (within or outside of T ), based on the previously com- impact of data characteristics changes on the Normal Behav- puted edges/nodes and their weights/degrees. Formally, for ior models, and explore solutions for updating the Normal a subsequence Ti,` of T , represented by a path in G` (N , E) Behavior data structures in real-time. q Pth = hN (i) , N (i+1) , ..., N (i+`q ) i, the normality score is de- Acknowledgments Work partially supported by EDF (j) (j+1) Pi+` −1 )(deg(N (j) )−1) fined as: N orm(Pth ) = j=iq w(N ,N `q , R&D and ANRT French program. where w(e) and deg(n) are the weight of edge e and the References degree of node n, respectively. [1] D. Abboud, M. Elbadaoui, W. Smith, and R. Randall. Advanced bearing diagnostics: A comparative study of two powerful ap- 2.3 Experimental Analysis proaches. MSSP, 114, 2019. [2] V. Barnet and T. Lewis. Outliers in Statistical Data. John In our preliminary experimental evaluations [3, 5], we used Wiley and Sons, Inc., 1994. 21 synthetic and real datasets from diverse domains, and [3] P. Boniol, M. Linardi, F. Roncallo, and T. Palpanas. Automated measured Precision@k (i.e., accuracy in the Top-k answers) Anomaly Detection in Large Sequences. In ICDE, 2020. [4] P. Boniol, M. Linardi, F. Roncallo, and T. Palpanas. SAD: An and execution time. (Due to lack of space, we only report Unsupervised System for Subsequence Anomaly Detection. In here aggregated results.) ICDE, 2020. After rejecting the null hypothesis using the Fried- [5] P. Boniol and T. Palpanas. Series2graph: Graph-based subse- quence anomaly detection for time series. PVLDB, 13(11), 2020. man test, we used the pairwise Post-Hoc Analysis with a [6] P. Boniol, T. Palpanas, M. Meftah, and E. Remy. Graphan: Wilcoxon signed-rank test (α = 0.05) [14], and we depict the Graph-based subsequence anomaly detection. PVLDB, 13(11), results in Figure 2(a). The results show that the NormA and 2020. Series2Graph approaches are the overall winners, perform- [7] Y. Bu, O. T. Leung, A. W. Fu, E. J. Keogh, J. Pei, and S. Meshkin. WAT: finding top-k discords in time series database. ing statistically significantly better than the state-of-the-art In SIAM, 2007. algorithms from both the data series and the multidimen- [8] M. Linardi, Y. Zhu, T. Palpanas, and E. J. Keogh. Matrix Profile sional data communities. (Even though Series2Graph seems Goes MAD: Variable-Length Motif And Discord Discovery in Data Series. In DAMI, 2020. to be slightly better than both variants of Norma, this dif- [9] W. Luo and M. Gallagher. Faster and parameter-free discord ference is not statistically significant.) search in quasi-periodic time series. In Advances in Knowledge Moreover, Figures 2(b) and (c) demonstrate that NormA- Discovery and Data Mining, 2011. [10] T. Palpanas. Data series management: The road to big sequence smpl and Series2Graph are considerably faster than the analytics. SIGMOD Rec., 44(2):47–52, Aug. 2015. other methods (note the log-scale y-axis). In particular, [11] T. Palpanas and V. Beckmann. Report on the First and Second NormA-smpl is between 1-3 orders of magnitude faster than Interdisciplinary Time Series Analysis Workshop (ITISA). ACM SIGMOD Record, 48(3), 2019. the rest, as we increase the length of the input sequence, or [12] P. Senin, J. Lin, X. Wang, T. Oates, S. Gandhi, A. P. Boedi- the length of the anomalies. hardjo, C. Chen, and S. Frankenstein. Time series anomaly dis- We note that further experiments are needed in order to covery with grammar-based compression. In EDBT, 2015. [13] S. Subramaniam, T. Palpanas, D. Papadopoulos, V. Kalogeraki, compare in detail our proposed methods. and D. Gunopulos. Online outlier detection in sensor data using non-parametric models. In VLDB 2006, pages 187–198, 2006. 3. CONCLUSIONS AND FUTURE WORK [14] F. Wilcoxon. Individual comparisons by ranking methods. Bio- metrics Bulletin, 1(6):80–83, 1945. Our studies on NormA and Series2Graph, described [15] D. Yankov, E. J. Keogh, and U. Rebbapragada. Disk aware above, constitute a first attempt to formalize the problem of discord discovery: finding unusual time series in terabyte sized datasets. Knowl. Inf. Syst., 17(2):241–262, 2008. Normal Behavior representation in the context of data se- [16] C. M. Yeh, Y. Zhu, L. Ulanova, N. Begum, Y. Ding, H. A. Dau, ries. The two proposed approaches describe data structures D. F. Silva, A. Mueen, and E. J. Keogh. Matrix profile I: all pairs based on subsequence sets and graphs, respectively. The ex- similarity joins for time series: A unifying view that includes motifs, discords and shapelets. In ICDM, pages 1317–1322, 2016. perimental results show that both NormA and Series2Graph outperform the current state-of-the-art techniques methods over a large set of diverse datasets. This indicates that the