1. INTRODUCTION

Unsupervised Subsequence Anomaly Detection in Large Sequences

Paul Boniol Supervised by: Themis Palpanas

paul.boniol@edf.fr 0

Mohammed Meftah

Emmanuel Remy

0 0 EDF R&D, LIPADE, Universit e ́ de Paris

Subsequence anomaly detection in long sequences is an important problem with applications in a wide range of domains. However, the approaches that have been proposed so far in the literature have severe limitations: they either require prior domain knowledge that is used to design the anomaly discovery algorithms, or become cumbersome and expensive to use in situations with recurrent anomalies of the same type. In this Ph.D. work, we address these problems, and present unsupervised methods suitable for domain agnostic subsequence anomaly detection. We explore two possible way to represent the normal behavior of a long data series that lead to fast and accurate identi cation of abnormal subsequences. These normal representations are either based on subsequences (using a data structure called the normal model), or on graphs, by taking advantage of graph properties to encode the normal transitions between neighboring subsequences of a long series. The experimental results, on a large set of synthetic and real datasets, demonstrate that the proposed approaches correctly identify single and recurrent anomalies of various types, without any prior knowledge of the characteristics of these anomalies. Our approaches outperform by a large margin several competing approaches in accuracy, while being up to orders of magnitude faster.

1. INTRODUCTION

Time series1 anomaly detection is a crucial problem with application in a wide range of domains. Examples of such applications can be found in manufacturing, astronomy, engineering, and other domains [ 10, 11 ]. This implies a real need by relevant applications for developing methods that can accurately and e ciently achieve this goal. 1A time series, or data series, or sequence, is an ordered sequence of real-valued points. If the dimension that imposes the ordering is time then we talk about time series, but it could also be mass, angle, position, etc. We will use the terms time series, data series, and sequence interchangeably. Proceedings of the VLDB 2020 PhD Workshop, August 31st, 2020. Tokyo, Japan. Copyright (C) 2020 for this paper by its authors. Copying permitted for private and academic purposes.

Anomaly detection is a well studied task [ 2, 13, 16, 8 ] that can be tackled by either examining single values, or sequences of points. In the speci c context of sequences, which is the focus of this paper, we are interested in identifying anomalous subsequences [ 16, 12, 8 ], which are not single abnormal values, but rather an abnormal sequence of values. In real-world applications, this distinction becomes crucial: in certain cases, even though every individual point may be normal, the trend exhibited by the sequence of these same values may be anomalous. Evidently, failing to identify such situations could lead to severe problems that are only detected when it is too late.

Some existing techniques explicitly look for a set of predetermined types of anomalies [ 1 ]. These are techniques that have been speci cally designed to operate in a particular setting, they require domain expertise, and cannot generalize. Other techniques identify as anomalies the subsequences with the largest distances to their nearest neighbors (termed discords) [ 16, 12 ]. The assumption is that the most distant subsequence is completely isolated from the "normal" subsequences.

However, this de nition fails in the case where an anomaly repeats itself (approximately the same). In this situation, anomalies will have other anomalies as close neighbors, and will not be identi ed as discords. In order to remedy this situation, the mthdiscord approach has been proposed [ 15 ], which takes into account the multiplicity m of the anomalous subsequences that are similar to one another, and marks as anomalies all the subsequences in the same group. However, this approach assumes that we know the cardinality of the anomalies, which is not true in practice (otherwise, we need to try several di erent m values, increasing drastically the execution time). Furthermore, the majority of the previous approaches require prior knowledge of the anomaly length, and their performance deteriorates signi cantly when the correct length value is not used.

Our work addresses the aforementioned problems. We proposed two approaches aiming to build a normal representation of the data series, which enables to identify anomalous subsequences. The two approaches use di erent data structures to represent the normal behavior of the sequences; they are summarized below:

NormA [ 3, 4 ], a normal-model based subsequence anomaly detection method in large data series, for which the normal data structure is a set of subsequences paired with a weight. The higher those weights are, the more "normal" their paired subsequences are.

Series2Graph [ 5, 6 ], an unsupervised graph-based approach for subsequence anomaly detection in large data series, for which the data structure is modeled as a directed graph. Nodes fo this graph can be seen as subsequences of the data series.

NORMAL BEHAVIOR

In this section we describe our proposed approaches to the aforementioned problems. In the next part we describe the notions and the elements used in our solutions.

We begin by introducing some formal notations useful for the rest of the paper. A data series T 2 Rn is a sequence of real-valued numbers Ti 2 R [T1; T2; :::; Tn], where n = jT j is the length of T , and Ti is the ith point of T . We are typically interested in local regions of the data series, known as subsequences. A subsequence Ti;` 2 R` of a data series T is a continuous subset of the values from T of length ` starting at position i. Formally, Ti;` = [Ti; Ti+1; :::; Ti+` 1]. Given two sequences, A and B, of the same length, `, we can calculate their Z-normalized Euclidean distance, dist, as follows: dist = qPi`=1( Ai A Bi B )2, where and A B represent respectively the mean and standard deviation of the sequences. For the remaining of this paper, we will simply use the term distance.

Given a subsequence Ti;`, we say that its mth Nearest Neighbor (mth NN) is Tj;` if Tj;` has the mth shortest distance to Ti;` among all the subsequences of length ` in T , excluding trivial matches; a trivial match of Ti;` is a subsequence Ta;`, where ji aj < `=2 (i.e., the two subsequences overlap by more than half their length). Since we are interested in subsequence anomalies, we rst de ne the set of all subsequences of length ` in a given data series T : T` = fTi;`j8i:0 i jT j ` + 1g. In general, we assume that T` contains both normal and anomalous subsequences. We then need a way to characterize normal behavior. For the purpose of the paper, we abstractly de ne the normal behavior of the data series as follows:

Definition 1 (Normal Behavior, NB). Given a data series T , NB is a model that represents the normal (i.e., not anomalous) trends and patterns in T .

The above de nition is not precise on purpose: it allows several interpretations, which can lead to di erent kinds of models. Nevertheless, subsequence anomalies can then be de ned in a uniform way: anomalies are the subsequences that have the largest distances to the expected, normal behavior, NB (or their distance is above a set threshold). Both of the proposed approaches will de ne their own data structure to represent NB, as well as their own distance measure. 2.1

The NormA Approach

We describe a new approach for subsequence anomaly detection. We propose a formalization for NB, called the Normal Model, denoted NM , and de ned as follows [ 3 ]: NM is a set of sequences, NM = f(N M0 ; w0); (N M1 ; w1); :::; (N Mn ; wn)g, where N Mi is a subsequence of length `NM (the same for all N Mi ) that corresponds to a recurring behavior in the data series T , and wi is its normality score (as we explain later, the highest this score is, the more usual the behavior represented by N Mi is). In other words, this model averages (with proper weights) the di erent recurrent behaviors observed in the data, such that all * '( 3 '(

5 !",ℓ '( 7" ℬ '( '* '3 !",ℓ '8ℓ '4 '5 '6 !%,ℓ !&,ℓ (a) d defined as the distance to the Normal (b) Normal represtnation of the data series as Model '( = '(* , +* , … , '(- , +- ; 8ℓ .The path to reach the red dots has smaller ℬ is the weighted barycenter of NM weight than those to reach the blue dots. the normal behaviors of the data series will be represented in the normal model, while unusual behaviors will not (or will have a very low weight).

Figure 1(a) is an illustration of a Normal Model. As depicted, the Normal Model NM is a weighted combination of a set of subsequences (points within the dotted circles). The combination of these subsequences and their related weights returns distances di, dj, dk that are high enough to be di erentiated from the normal points/subsequences. These distances can be seen as the distance between subsequences and a weighted barycenter B (in green) that represents NM . Note that we do not actually compute this barycenter; we illustrate it in Figure 1(a) for visualization purposes.

We choose `NM > ` in order to make sure that we do not miss useful subsequences, i.e., subsequences with a large overlap with an anomalous subsequence. For instance, for a given subsequence of length `, a normal model of length `NM = 2` will also contain the subsequences overlapping with the rst and last half of the anomalous subsequence. We can now de ne the Abnormal subsequences as follows [ 3 ].

Definition 2 (Subsequence Anomaly). Assume a data series T , the set T` of all its subsequences of length `, and the Normal Model NM of T . Then, the subsequence Tj;` 2 T` with anomaly score, i.e., distance to NM , dj = PNMi wi minx2[0;`NM `] dist(Tj;`; (N Mi )x;`) is an anomaly if d is in the T op-k largest distances among all subsequences in T`, or d > , where 2 R>0 is a threshold.

Note that the only essential input parameter is the length ` of the anomaly (which is also one of the inputs in all relevant algorithms in the literature [ 12, 16, 7, 9, 8 ]). The parameter k (or ) is not essential, as long as the algorithm can rank the anomalies.

We stress that in practice, experts start by examining the most anomalous pattern, and then move down in the ranked list, since there is (in general) no rigid threshold separating anomalous from non-anomalous behavior [ 2 ].

As we mentioned above and will detail later on, we choose to de ne NM as a set of sequences that summarizes normality in T , by representing the average behavior of a set of normal sequences. Intuitively, NM is the set of data series, which tries to minimize the sum of Z-normalized Euclidean distances between itself and some of the subsequences in T . Last but not least, we need to compute NM in an unsupervised way, i.e., without having normal/abnormal labels for the subsequences in T`.

Observe that this de nition of NM implies the following challenge: even though NM summarizes the normal behavior only, it needs to be computed based on T , which may include (several) anomalies. We address these challenges by taking advantage of the fact that anomalies are a minority class. 2.1.1

NormA Framework

We now brie y describe NormA [ 3 ], our unsupervised solution to the subsequence anomaly detection problem. [Computing the Normal Model] Recall that NM should capture (summarize) the normal behavior of the data. This may not be very hard to do for a sequence T that does not contain any anomalous subsequences. In practice though, we want to apply the NormA approach in an unsupervised way on any sequence, which may contain several anomalies.

We compute the NM in three steps. First, we extract the subsequences that can serve as candidates for building the NM . These candidates are either randomly selected from T (NormA-smpl), or correspond to motifs. Then, we group these subsequences according to their similarity in a set of clusters C (we use hierarchical clustering and Minimum Description Length to identify the right number of clusters). The last step consists of scoring each cluster, and selecting the cluster that best represents normal behavior. Formally, for a given cluster c 2 C, we set a weight wc = N orm(c; C) with the following formula: N orm(c; C) = PFx2reCqduiesnt(cCye(cn)t2er(Cc)o;vCeernagteer((cx))) ; where F requency(c) is the number of subsequences in c, and Coverage(c) is the time interval between the rst and the last occurrence of a subsequence in c. We thus set NM = f(N M0 ; w0); (N M1 ; w1); :::; (N Mn ; wn)g, with N Mi the centroid of the ith cluster in C. [Normal Model Based Anomaly Detection] Intuitively, the anomalous subsequences of the long series T are the ones that are far away from NM . We thus compute the meta S = [d0; d1; :::; djT j `NM ], with dj the distance of the subsequence Ti;j to the Normal Model NM as de ned in Definition 2. These distances correspond to the degree of abnormality: the larger the distance is to NM , the more abnormal the subsequence is. We then extract the k subsequences of length ` with the highest distances to NM , and rank them according to their distances. Alternatively, we can extract all subsequences with distance larger than a threshold. 2.2

The Series2Graph Approach

We now present an alternative data structure to represent the subsequences normal behaviors of the data series. The previous normal model was a set of subsequences that aimed to store both normal and abnormal subsequences into the same set, associated with weights that rank them based on their normality. One can argue that an ordering information is missing from this data structure representation. We thus formulate an approach for subsequence anomaly detection based on the data series representation into a Graph, in which edges encode the ordering information [ 5 ]. Figure 1(b) illustrates the Graph data structure.

We rst de ne several basic elements related to graphs. We de ne a Node Set N as a set of unique integers. Given a Node Set N , an Edge Set E is then a set composed of tuples (xi; xj), where xi; xj 2 N . w(xi; xj) is the weight of that edge. Given a Node Set N , an Edge Set E (pairs of nodes in N ), a Graph G is an ordered pair G = (N ; E). A directed graph or digraph G is an ordered pair G = (N ; E) where N is a Node Set, and E is an ordered Edge Set.

We now provide a new formulation for subsequence anomaly detection. The idea is that a data series is transformed into a sequence of abstract states (corresponding to di erent subsequence patterns), represented by nodes N in a directed graph, G(N ; E), where the edges E encode the number of times one state occurred after another. Under this formulation, paths in the graph composed of high-weight edges and high-degree nodes correspond to normal behavior. Then, the Normality of a data series is de ned as follows [ 5 ].

E = f(N (i); N (i+1))gi2[0;n 1], such that: N 8(N (i); N (i+1)) 2 E ; w((N (i); N (i+1))):(deg(N (i))

Using -Normality subgraphs naturally leads to a ranking of subsequences based on their "normality". For practical reasons, this ranking can be transformed into a score, where each rank can be seen as a threshold in that score. We used such a score in GraphAn to detect abnormal subsequences.

Note that given the existence of graph G, the above definitions imply a way for identifying the anomalous subsequences. The problem is now how to construct this graph. 2.2.1

Series2Graph Framework

We now brie y describe Series2Graph [ 5, 6 ], our unsupervised solution to the subsequence anomaly detection problem. For a given data series T , the overall Series2Graph process is divided into four main steps as follows:

[Subsequence Embedding] We project all the subsequences (of a given length `) of T in a three-dimensional space that corresponds to the three most important components of the Principle Component Analysis (PCA). This space is subsequently transformed into a two-dimensional space, composed of two components r~y; r~z corresponding to two orthogonal vectors of vr~ef , an axis composed of every

at sequences (a 1` with a 2 R). In the later space, the shape similarity is preserved [ 5 ].

[Node Creation] We then create a node for each one of the densest parts of the above two-dimensional space. The space is rst discretized by a set of radius I of angle . We then estimate the density of subsequences along each one of these radius using gaussian kernels. The maximal values of the estimated densities are assigned to nodes and form our Node Set N . These nodes summarize all major patterns of length ` that occurred in T [ 5 ].

[Edge Creation] We then retrieve all transitions between pairs of subsequences represented by two di erent nodes: each transition corresponds to a pair of subsequences, where one occurs immediately after the other in the input data series T . We represent transitions with an edge between the corresponding nodes, and we thus form the Edge Set E. The edge weights are set to the number of times the corresponding pair of subsequences was observed in T . Finally we build our graph G`(N ; E).

[Subsequence Scoring] We compute the normality (or anomaly) score of a subsequence of length `q ` (a) proposed approach of using a Normal Behavior model for the subsequence anomaly detection problem is very promising.

In our current work, we focus on the following directions. [Normal Behavior Model] We will study in depth difsmplNorNmoSArTm-OSAJM-GSsPmVJpl NorINDFmoAArGDm-VsAm-psmlpLlOFSDTOADMGPVGVS2G STIODFNAMoDrPmADA-SJ LOFSIFTOSMTONPMorPmSA2L-GsOmFpNIlForImFA-SJS2GLVOFLOF NormDAS-D2sGmS2pGl STOfGMeVrPent mDIoFADdels foLOrSFTONMoPrmSa2lG BIFehavioLrO,F and pSe2Grform analytical 10000 1000011000000000 100010000 Time out10000 1000010100000 10001100000000 Time out 10000 and e1x00p00erimental comp a10r00i0sons among them. The goal will 10500100100101000100020100020100200000105010101500100001002011001100010i()scTeem7000101000010200010000200000i()scTeem0110011000110000001100100010012010000100705200102010005000000001011010070010010201700002000i()scTeem0100100100001000010010101105010100100105000100010200120502000001001105010050001010000110010001000200 0 1010151050005000500 172050000000105101001001001001000200 12[dpbwm0Mreei1oa5lcl0nutic01s5oit0lwit0itooc1uoi0n1ar7v010rn00010,slkad0palreiborimraonsotticutaeeaatnsltsgtdDi,hoo5102netraa0hisnttbeadheomstfathSsne1eem01aa0o101tr00100clroh0iyh0eedaztsteeio]clnawmtGeloiu,lipllvu1s5tre50soiio0vnne0pa.haertserihratetaiteontest,1mldm0y0aa0aastcknaoweynseiesldnirldofi1oea5em0srsr0,mattiwehnodeesNifnmautnsbmoebmreoarfloipe(fcosp)inoNtiunsmtsber(co)fNa(bun)moNmb(udeam)lribAe(oecsn(fr)b(oaocN)mn)fNuopNamuomulyimbnmalteblesbireneosrgrftoohaffnpaoonmio(ndamt)sliaAe(lsnbieo)smNua(cml(y)dbAl)eenAnorngmotofhamlpy(oalceli)ynn(Ntdlgset)u(nhdAmg)ntbAhoenmrooamlfyaalnyeonlemgntahglti(gheces)tNhuemr baelrl ovfaarniaombl(aedls)ieAsonfotmhaelysleernigetsh. T(hd)eAcnhoamllaelnygleinnggthproblems are how to create a multivariate Normal Behavior model Figure 2: Comparison of six state-of-the-art meth- and corresponding data structure, and how to de ne and ods to Series2graph and NormA: (a) Precision@k formalize suitable distance measures for this case. with a critical diagram over 21 datasets; (b) execu- [Online Operation] In several situations, subsequence tion time vs varying number of points; (c) execution anomaly detection applications need to operate in an ontime vs varying anomaly length. line fashion. In this context, we will develop extensions of our methods for real-time applications. We will analyze the impact of data characteristics changes on the Normal Behavior models, and explore solutions for updating the Normal

Behavior data structures in real-time. (within or outside of T ), based on the previously computed edges/nodes and their weights/degrees. Formally, for a subsequence Ti;`q of T , represented by a path in G`(N ; E ) Pth = hN (i); N (i+1); :::; N (i+`q)i, the normality score is de

Acknowledgments Work partially supported by EDF R&D and ANRT French program. ned as: N orm(Pth) = Pi+`q 1 w(N(j);N(j+1))(deg(N(j)) 1) ,

j=i `q where w(e) and deg(n) are the weight of edge e and the degree of node n, respectively.

2.3 Experimental Analysis

In our preliminary experimental evaluations [ 3, 5 ], we used 21 synthetic and real datasets from diverse domains, and measured Precision@k (i.e., accuracy in the Top-k answers) and execution time. (Due to lack of space, we only report here aggregated results.)

After rejecting the null hypothesis using the Friedman test, we used the pairwise Post-Hoc Analysis with a Wilcoxon signed-rank test ( = 0:05) [ 14 ], and we depict the results in Figure 2(a). The results show that the NormA and Series2Graph approaches are the overall winners, performing statistically signi cantly better than the state-of-the-art algorithms from both the data series and the multidimensional data communities. (Even though Series2Graph seems to be slightly better than both variants of Norma, this difference is not statistically signi cant.)

Moreover, Figures 2(b) and (c) demonstrate that NormAsmpl and Series2Graph are considerably faster than the other methods (note the log-scale y-axis). In particular, NormA-smpl is between 1-3 orders of magnitude faster than the rest, as we increase the length of the input sequence, or the length of the anomalies.

We note that further experiments are needed in order to compare in detail our proposed methods.

3. CONCLUSIONS AND FUTURE WORK

Our studies on NormA and Series2Graph, described above, constitute a rst attempt to formalize the problem of Normal Behavior representation in the context of data series. The two proposed approaches describe data structures based on subsequence sets and graphs, respectively. The experimental results show that both NormA and Series2Graph outperform the current state-of-the-art techniques methods over a large set of diverse datasets. This indicates that the

[1]

Abboud ,

Elbadaoui ,

Smith , and

Randall . Advanced bearing diagnostics: A comparative study of two powerful approaches . MSSP , 114 , 2019 .

[2]

Barnet and

Lewis . Outliers in Statistical Data . John Wiley and Sons, Inc., 1994 .

[3]

Boniol ,

Linardi ,

Roncallo , and

Palpanas . Automated Anomaly Detection in Large Sequences. In ICDE , 2020 .

[4]

Boniol ,

Linardi ,

Roncallo , and

Palpanas . SAD: An Unsupervised System for Subsequence Anomaly Detection . In ICDE , 2020 .

[5]

Boniol and

Palpanas . Series2graph: Graph-based subsequence anomaly detection for time series . PVLDB, 13 ( 11 ), 2020 .

[6]

Boniol ,

Palpanas ,

Meftah , and

Remy . Graphan: Graph-based subsequence anomaly detection . PVLDB , 13 ( 11 ), 2020 .

[7]

Bu ,

O. T.

Leung ,

A. W.

Fu ,

E. J.

Keogh ,

Pei , and

Meshkin . WAT: nding top-k discords in time series database . In SIAM , 2007 .

[8]

Linardi ,

Zhu ,

Palpanas , and

E. J.

Keogh. Matrix Pro le Goes MAD : Variable-Length Motif And Discord Discovery in Data Series . In DAMI, 2020 .

[9]

Luo and

Gallagher . Faster and parameter-free discord search in quasi-periodic time series . In Advances in Knowledge Discovery and Data Mining , 2011 .

[10]

Palpanas . Data series management: The road to big sequence analytics . SIGMOD Rec ., 44 ( 2 ): 47 { 52 , Aug . 2015 .

[11]

Palpanas and

Beckmann . Report on the First and Second Interdisciplinary Time Series Analysis Workshop (ITISA) . ACM SIGMOD Record , 48 ( 3 ), 2019 .

[12]

Senin ,

Lin ,

Wang ,

Oates ,

Gandhi ,

A. P.

Boedihardjo ,

Chen , and

Frankenstein . Time series anomaly discovery with grammar-based compression . In EDBT , 2015 .

[13]

Subramaniam ,

Palpanas ,

Papadopoulos ,

Kalogeraki , and

Gunopulos . Online outlier detection in sensor data using non-parametric models . In VLDB 2006 , pages 187 { 198 , 2006 .

[14]

Wilcoxon . Individual comparisons by ranking methods . Biometrics Bulletin , 1 ( 6 ): 80 { 83 , 1945 .

[15]

Yankov ,

E. J.

Keogh , and

Rebbapragada . Disk aware discord discovery: nding unusual time series in terabyte sized datasets . Knowl. Inf. Syst. , 17 ( 2 ): 241 { 262 , 2008 .

[16] C. M. Yeh , Y.

Zhu , L.

Ulanova , N.

Begum , Y.

Ding , H. A.

Dau , D. F.

Silva , A.

Mueen , and E. J.

Keogh . Matrix pro le I: all pairs similarity joins for time series: A unifying view that includes motifs, discords and shapelets . In ICDM, pages 1317 { 1322 , 2016 .