1. Introduction

Event2Graph: Event-driven Bipartite Graph for Multivariate Time Series Forecasting and Anomaly Detection

Yuhang Wu

Mengting Gu

Lan Wang

Yu-san Lin

Fei Wang

Hao Yang

0 0 Visa Research , 385 Sherman Ave., Palo Alto, CA , U.S.A

Modeling inter-dependencies between time series is the key to achieve high performance in anomaly detection for multivariate time series data. The de-facto solution to model the dependencies is to feed the data into a recurrent neural network (RNN). However, the fully connected network structure underneath the RNN (either GRU or LSTM) assumes a static and complete dependency graph between time series, which may not hold in many real-world applications. To alleviate this assumption, we propose a dynamic bipartite graph structure to encode the inter-dependencies between time series. More concretely, we model time series as one type of nodes, and the time series segments (regarded as event) as another type of nodes, where the edge between two types of nodes describe a temporal pattern occurred on a specific time series at a certain time. Based on this design, relations between time series can be explicitly modeled via dynamic connections to event nodes, and the multivariate time series anomaly detection problem can be formulated as a self-supervised, edge stream prediction problem in dynamic graphs. We conducted extensive experiments to demonstrate the efectiveness of the design.

eol>Time series Anomaly detection Dynamic graph

1. Introduction

derlying inter-dependencies are diferent. Our work is built upon the recent wisdom of dynamic graph neural Detecting anomalies in time series data has been an im- network. We allow the connectivity of the graph changes portant problem in the research community of data min- dynamically in diferent time stamps based on the pating as well as the finance industry. In many circum- terns on the time series. stances, anomaly patterns in multiple time series need In order to construct a dynamic graph on-the-fly, one to be taken into account together to disclose the full important question is how to determine the connectivpicture of the system. Previous works in multivariate ity of the graph in each timestamp. Our work is inanomaly detection mainly rely on recurrent neural net- spired by the recent progress of evolutionary event graph works (RNNs). Malhotra et al.[ 1 ] and Hundman et al.[ 2 ] [ 7, 8 ] where the nodes in the graph represent the timeemployed LSTM models [ 3 ] to capture the temporal de- sequences segments (events) and directed links represent pendencies of multivariate signals and adopted predic- the transition of the segments (events). Compare to prevition and reconstruction errors, respectively, to identify ous works, this line of research naturally models the timethe anomaly patterns. Su et al.[ 4 ] improved the classical varying relations among time series states via dynamic RNN model by modeling the probability distribution of connections, and each state carries a physical meaning time series via incorporating stochastic variables. To fur- that is understandable by human. However, one major ther model the correlations of time series explicitly, Zhao limitation of [ 8 ] is that the event nodes employed in this et al.[ 5 ] proposed a graph attention network to propa- work capture the information across all the time series. gate information from diferent time series and aggregate Assume there are segment patterns in each time series, the information together before feeding into a GRU [ 6 ]. and the number of time series is . The model would However, because of the underlying RNN structure, pre- need number of event nodes to represent the multivious works assume a static, complete dependency graph variate signals. This exponential number of event nodes among time series. These approaches may not perform strongly limits the information processing capability of well under regime change of time series where the un- the evolutionary state graph and therefore allows only a very small number of time series () to be analyzed in practice (four in the dataset used in the paper [ 8 ]).

To address the problem of exponential number of combinations, we disentangle the time series nodes and the event nodes in our design, and model them as two types of nodes in a dynamic bipartite graph (as depicted in Fig. 1). Each event node only represents a time segment on one individual time series, instead of integrating patterns across all time series. The undirectional connection between two types of nodes indicates event sors, Su et al.[ 4 ] proposed a stochastic LSTM framework happens on the ℎ time series at time . So the maxi- to model the data distribution of time series in a global mum number of edges in the graph is (), which perspective. As RNN was originally proposed to model is much smaller than (2). To further improve the the temporal dependencies between diferent timestamps, eficiency and generalizability of the algorithm, we built and all the dimensions of the input of RNN are used to deupon the framework based on the recent advances in scribe a single concept (e.g.word) all together, it was not edge streams [9, 10, 11], where connections between tailored to model the inter-dependencies among variables nodes are modeled as incoming attributed edges instead by design. Recently, Zhao et al.[ 5 ] proposed a graph atof constructing adjacency matrices. The complete sys- tention network-based (GAT) approach [12] where each tem, with the name Event2Graph (Event-driven bipartite time series is regarded as an individual node, and informaGraph), outperforms previous strong baselines on two tion are aggregated based on the underlying similarity public data-sets, and we summarize our main contribu- of the signals. While this solution partially mitigates tions as follows: the problem by taking into account of dynamic pairwise (i) We propose a bipartite event-graph-based system to similarities between time series, the design assumes a analyze the multivariate time series and model the inter- complete and static inter-dependency graph. Also, the actions between time series and event segments via edge processed information after GAT is simply aggregated streams. (ii) Events employed in the system is highly all together and fed into a single GRU. Hence the module interpretable and easy for human to interact with. (iii) still sufers from similar problems as previous designs. The system achieves competitive performance on chal- Dynamic inter-dependency relation: A few recent lenging anomaly detection datasets through graph edge works started to explore the partial inter-dependency forecasting. relations among time series. Deng et al.[13] constructed a neighborhood graph on-the-fly based on sensor embeddings to describe the dependencies between diferent 2. Related work sensors. The sparsity level of this incomplete graph can be customized by users, but the connectivity of the graph is fixed once constructed. Hu et al.[ 8 ] proposed the first dynamic dependency design in multivariate anomaly detection. In this design, nodes are interpretable time series states, and the transition of these states are explicitly modeled by dynamic graph neural network. This design allows the system to represent the time-varying relations among time series. However, one major limitation of Static inter-dependency relation: In this category of approach, relations (e.g., correlation) between multivariate time series are fixed once learned, and the model assumes all the time series can influence each other (complete inter-dependency). LSTM-based framework has been widely employed in this category of work. For instance, Malhotra et al.[ 1 ] proposed a LSTM-based autoencoder network to detect anomalies from multiple senthe method is it uses a single event node to represent segments across all time series. This design brings the combination explosion problem in pattern representation (mentioned in Section I) and makes it hard to tackle a moderate scale problem. Our model is built upon the advances in dynamic graph neural networks. Most of the literatures in dynamic graph networks assume discretetime graph dynamics where the graphs are represented as a sequence of snapshots [14, 15, 16]. Recent works start exploring more flexible design and assuming edges can appear at any time [10, 17, 9, 18, 11]. Our solution is derived from a recent solution named Temporal Graph Networks (TGN) proposed by Rossi et al.[11]. The model is built upon the temporal graph attention network (TGAT) [10] but extended with an unique node-wise memory which attempt to model the temporal dependency on the node level instead of the system level as [ 5 ]. It allows us to model the dynamic inter-dependency in our bi-partite event graph accurately, and provide us the flexibility to incorporate new nodes that have not seen in training.

3. Problem definition

A multivariate time series is defined as X ∈ R × where is the maximum number of time stamps in the time series and is the dimension. The anomaly detection problem is to detect specific time stamps ̇* ∈ O in a multivariate time series where the time series behavior deviating from the normal patterns of the time series. Note that O contains timestamps that marked as anomaly by domain expert. In case is large, a common way to handle long time series is to use a sliding window approach. Let denote the length of the sliding window. We hereby formulate the problem as a binary classification problem with the objective to identify time windows with X[˙− :˙]× that contain anomaly time-stamps. One solution of solving the multivariate time series anomaly detection problem is to convert time series into a homogeneous graph [ 7, 8 ], where G is defined on the representative patterns of a multivariate time series. The node of the graph v corresponds to a human interpretable time series patterns p ∈ R × . The pattern should be representative enough so that the whole time series can be approximated by transitions of symbolic events (states) in the graph. The transitional relation between two sequential nodes (end with stamp and + 1) is defined as an edge in the graph. Many existing works (e.g., Bag of Patterns, Shaples, sequence clustering) can be used to distill representative patterns (event) from time series, but most of them only efective on univariate time series where = 1.

4. Bipartite Event Stream The core intuition behind the dynamic bipartite design

is to decouple three key concepts in the traditional event graph, namely: where (which time series), when (at what time), and which (the event category). This practice helps us avoid the problem of exponential pattern combinations.

Specifically, the proposed algorithm is built upon a dynamic bipartite graph. We define it as a sequence of undirected bipartite event graph = {(v, v, (v, v))}. Here represents when the time window that the event graph is formulated. If the window size is set to be 1, then the time window is essentially the same as the time step ̇. A time sequence index node v indicates “where" an event v is happening. An attributed edge (v, v) that connect event node v and sequence index node v indicates an event v happened on time series at time window . For simplicity, we also denote the edge as , in the following sections. We employ an edge stream representation so an edge (v, v) is only constructed to represent the relation that actually existed. A major benefit of using edge stream representation over adjacency matrix is that it allows the graph structure to be more scalable and flexible, which provides us the generality of incorporating new events that have not appeared in training. Based on this bipartite graph structure, we convert the multivariate anomaly detection problem into a self-supervised edge prediction problem in dynamic graph. Given the historical sequence of G(1:) and event nodes v+1 in time window + 1, we are predicting edges ˆ(v+1, v+1)) in G+1. The anomaly is derived as a mismatching score from the predicted edge set ˆ(v+1, v+1)) and the observed edge set (v+1, v+1)) with a read-out function (· ).

So the procedure of solving the anomaly detection problem becomes: 1. Given a normal sequence X, identify representative patterns on each of the time sequence respectively in multivariate time series. 2. With identified representative patterns events, merge similar events across time series to indicate afinity relation between time series. 3. Build a sequence of bipartite event graph for multivariate time series (with a stride ) based on event matching. 4. Analyze the sequence of bi-partite event graphs and derive a model Ψ to describe the intra- and inter-dependency relations. 5. Given a testing sequence X, repeat (3), apply the model learned in (4) to predict ˆ at time window * . 6. Derive anomaly scores based on the predicted ˆ and original time series with a proposed readout function (· ).

5. Event Node Detection And Matching

place. We employed a highly eficient C language-based Dynamic Time Wrapping algorithm [23] to conduct the event matching, which provides a ( − )∖ × × We defined the event nodes on each time series separately similarity tensor ℒ to indicate the similarity of each and employed an unsupervised algorithm to identify the event to all the sliding window with size and stride events. As we have no prior knowledge about what an in the input. anomaly pattern would look like in most real world appli- Event nodes created by pattern matching: for a single cations, the system is designed to learn the representative time stamp , given ℒ the event graph is created by patterns and their correlations from observed data. selecting the best matched events for each time series.

Representative segment detection: We adopted Matrix The event node is then connected with the correspondProfile [ 19], a state-of-the-art unsupervised algorithm to ing time series node v in the event graph G with an identify representative patterns from time sequences.f attribute edge (v, v). By finding and linking the best Matrix profile is able to identify the top- repeated pat- matched event for each time series, we added number terns on time series with high accuracy and low com- of event nodes into the graph. putation time. We employed a recent implementation Event nodes created by residual error: we noticed that of matrix profile called SCRIMP++ [ 20] on each of the even for the best matched time series, the pattern matchtime series in X, and the algorithm yields a list of single ing still left with small residual errors. We hereby define dimensional representative patterns p, ( ∈ {1 : }, two general residual nodes which indicate whether the ∈ {1 : }), each with size . residual error in a time series is larger than a threshold

Event generation: many time series share similar rep- . One residual event node denoted as v+ indicates the resentative patterns with each other. It is reasonable to residual error is larger than , another residual event merge similar segment patterns together across diferent node denoted as v− indicates the residual error is equal time series to create the event nodes. To measure the or smaller than . The parameter is automatically gensimilarity between time series, we employed dynamic erated using SPOT [24] algorithm, where we employ the time wrapping [21], which provides us with a robust dis- whole training data-set for initialization and testing for tance matrix that is insensitive to small misalignments on-going adaptation. All time series shared these two between two time series segments. After that, we em- residual nodes as shown in Fig. 3. ployed the H-DBSCAN [22] to cluster the patterns p, After generating a sequence of bipartite event graph into clusters. We select the centroid of each cluster to G(1:) for time series X, a model Ψ is trained on G(1:) be an event, which is a representative segment across all so that given a testing sequence X, it is able to predict time series within the cluster. Finally, we obtain an event the connectivity of an event graph in any specific time set E which contains all events each with length . stamp ̇ by observing historical sequence observed on

Event matching: after extracting events from time se- X. In Fig. 2, we visualize the events forecasted/detected ries segments, we match each event with the original time on ten time series. The diference between the top and series to identify where and when each event is taking the middle group result in the bottom group where we

5.1. Network Design With Node-wise Memory

time as following: () =

∑︁ ∈([0,]) ((), (), ,, (), ()) (3) For each pair of v and v in graph G(), the node fea- TGA represents the temporal graph attention module tures of them are defined as and , respectively. We [11], where graph attention layers compute ’s embedalso define , as the edge features between v and v. ding by aggregating information from its L-hop temporal In order to support the dynamic inter-dependency graph, neighbors. we avoid explicitly defining a static adjacency matrix to describe the relation between nodes. Instead, we adopt a 5.2. Optimization and inference state-message framework to model the node interactions.

For each node v at time stamp (here we use time se- The aforementioned model is trained in a self-supervised ries node as an example; the same rule applies to the fashion. As our goal is to predict the events that might event node ), we define a state vector () to repre- happen in the time sequences in the next time step, which sent its interaction history with other v nodes before corresponds to the edges linking the time sequence nodes in a compressed format. By initiating (0) as an all and event nodes. Therefore, we train a simple MLP model zero vector, the interaction at time is encoded with a with the edge prediction task based on (). message vector (): We use the cross entropy to supervise the forecasting of which original and residual events would happen in a () = [ ,()||∆ ||(− )||(− )] (1) given time stamp . It is a two hot classification problem where the two activated nodes corresponding to an event where ∆ is the time elapse between the previous time node and a residual event node. stamp − and , symbol || means concatenating operation. Event Forecasting Score: To convert the predicted event After aggregating all the messages from neighbors, the edges ˆ into an anomaly score, for each time series node state vector of v is updated as: v, with window size , the event v, that has the () = ({(1), ..., ()}, (− )) highest probability connect to v, is retrieved and its (2) pattern in the original signal space is denoted as s, .

We project the event s, ’s pattern back to its original signal space, we then compute anomaly score based on the dynamic time wrapping distance as follows: Here agg(· ) is a general aggregation operation (support learnable parameters). For the sake of simplicity, we only compute the mean of the most recent messages 1, = (X, , s, ) (4) in aggregation. We use mem(· ) to represent a trainable update function (e.g., GRU). Residual Score: For a positive residual event v+ at

Build upon the updated state vector () and (), time stamp where the forecasted results is not a posia time-aware node embedding can be generated at any tive residual vˆ− , we calculate a changing point score to Number of Time series nodes

Number of event nodes Time window length ( ) Time window stride ( )

X−, ||)

(5) where is a standard function that maps a scalar into the negative log likelihood, which indicates the sparsity of this changing point in the training data. A frequent changing signal may results in small 2, after the mapping. The function is learned in a data-driven manner based on the training data of time series . The ifnal anomaly score at time stamp is calculated as the following read out function (· ): Dataset SMD SMAP

Method DAGMM LSTM-VAE

LSTM-NDT OmniAnomaly MTAD-GAT Event2Graph

DAGMM LSTM-VAE

LSTM-NDT OmniAnomaly MTAD-GAT Event2Graph

The proposed bipartite event graph contains +

number of nodes, where is the number of time series and is the number of events ( equals to 25 and 38 in SMAP and SMD datasets, equal to 130 and 64 on SMAP and SMD datasets). The detailed number of nodes is shown in Table 2. The length ( ) and stride ( ) of the time window of each dataset are also displayed in Table 2. For each time series, we set the maximum number of motifs detected to be three, and the minimum cluster size of H-DBSCAN to be three. For the temporal attention model, we set the number of multi-head to be 2. The GRU is employed to model the time encoding. The dimension of node embedding and message vector are both set to 64 respectively. Each model is trained after 10 epochs with the learning rate 0.0001. The node features of both time series and event nodes are randomly initialized, and edge features are the one-hot embedding of the numerical ID of the nodes on both sides of the edge.

The objective of this experiment is to provide detailed analysis over the efectiveness of each proposed module. 6.3. Ablation study

By removing each of the critical module of our model, Table 3 the performances are reported in Table 2. Ablation study on SMD dataset. 6.3.1. Efectiveness of temporal graph attention

Through replacing the temporal graph attention with

a simple MLP module, the model’s F1 score is reduced by 2.42%. It demonstrated that the temporal attention plays an important role in aggregating information from neighborhood nodes. 6.3.2. Efectiveness of event forecasting score 1:

We remove the event matching score (Eq. 4) and just

use a residual score (Eq. 5) for anomaly detection. We observe that the model performance drop by 5.48%. The experiment indicates the event forecasting module in Event2Graph is essential for accurate anomaly detection. From the experiment, we also observe that TGA plays a critical role in guarantee the quality of the forecasting score 1. By removing TGA, the forecasting-based model performance (w/o score 2) degrade from 73.81% F1 score to 64.51%. 6.3.3. Efectiveness of residual score 2 We remove the residual score (Eq. 5) and only use forecasting score (Eq. 4) to detect anomaly. The model sufers more than 10% performance degradation. This results shows that modeling the residual score is essential to allow the system to identify the anomaly regions that may not well characterized by the event matching scores. We also observe that the TGA may not contribute much to residual event modeling, a 2 only model perform competitively well with or without TGA. This phenomenon tells us the residual event is something unexpected and an accurate changing point detection is better than learning simple, repeatable patterns. 6.3.4. Efectiveness of log likelihood function To demonstrate our assumption, we compare the results with a baseline that using a naive changing point detection score , without adopting any forecasting or event modeling model. Surprisingly, the changing-point only model is able to achieve 78.48% F1-score. Outperforms DAGMM, LSTM-VAE, as well as LSTM-NDT. This results further confirm that it is hard to learn any static pattern by using LSTM or autoencoder for anomaly detection on a challenge dataset, a dynamic model that is able to adapt to the changing of time series patterns is preferred.

Model Event2Graph

w/o TGA w/o score 1 w/o score 2 w/o TGA and w/o 1 w/o TGA and w/o 2 only

7. Conclusion We proposed an event-driven bipartite graph solution

for multivariate time series anomaly detection. The solution does not assume any inter-dependency on time series, and all the relations are learned in a dynamic, datadriven manner. Our design is based on edge-stream so no adjacency matrix of the graph is required as input.

As the system’s memory is defined on the node level, our design left plenty space for future extensions such as inductive learning and parallel computation. Our solution achieved very competitive results on two anomaly detection datasets, and we encourage future works to explore further using bipartite event graph for multivariate anomaly detection. series event prediction with evolutionary state theory, KDD’ 2017.

graph, WSDM ’21,. [25] B. Zong, Q. Song, M. R. Min, W. Cheng, [9] S. Kumar, X. Zhang, J. Leskovec, Predicting dy- C. Lumezanu, D. Cho, H. Chen, Deep autoencoding namic embedding trajectory in temporal interac- gaussian mixture model for unsupervised anomaly tion networks, KDD’, 2019. detection, ICLR’ 2018. [10] D. Xu, C. Ruan, E. Körpeoglu, S. Kumar, K. Achan, [26] D. Park, Y. Hoshi, C. C. Kemp, A multimodal Inductive representation learning on temporal anomaly detector for robot-assisted feeding usgraphs, ICLR’ 2020. ing an lstm-based variational autoencoder, IEEE [11] E. Rossi, B. Chamberlain, F. Frasca, D. Eynard, Robotics and Automation Letters (2018).

F. Monti, M. M. Bronstein, Temporal graph net- [27] L. Shen, Z. Li, J. T. Kwok, Timeseries anomaly detecworks for deep learning on dynamic graphs, CoRR, tion using temporal hierarchical one-class network, abs/2006.10637 (2020). NIPS’ 2020. [12] P. Velickovic, G. Cucurull, A. Casanova, A. Romero,

P. Liò, Y. Bengio, Graph attention networks, ICLR’ 2018. [13] A. Deng, B. Hooi, Graph neural network-based anomaly detection in multivariate time series,

AAAI’ 2021. [14] W. Yu, W. Cheng, C. C. Aggarwal, H. Chen,

W. Wang, Link prediction with spatial and temporal consistency in dynamic networks, IJCAI’ 2017. [15] W. Yu, W. Cheng, C. C. Aggarwal, K. Zhang,

H. Chen, W. Wang, Netwalk: A flexible deep embedding approach for anomaly detection in dynamic networks, KDD’ 2018. [16] A. Sankar, Y. Wu, L. Gou, W. Zhang, H. Yang, Dysat:

Deep neural representation learning on dynamic graphs via self-attention networks, WSDM’, 2020. [17] G. H. Nguyen, J. B. Lee, R. A. Rossi, N. K. Ahmed,

E. Koh, S. Kim, Dynamic network embeddings: From random walks to temporal random walks,

IEEE International Conference on Big Data, 2018. [18] R. Trivedi, M. Farajtabar, P. Biswal, H. Zha, Dyrep:

Learning representations over dynamic graphs,

ICLR’ 2019. [19] C. M. Yeh, Y. Zhu, L. Ulanova, N. Begum, Y. Ding,

H. A. Dau, D. F. Silva, A. Mueen, E. J. Keogh, Matrix profile I: all pairs similarity joins for time series: A unifying view that includes motifs, discords and shapelets, ICDM’ 2016. [20] Y. Zhu, C. M. Yeh, Z. Zimmerman, K. Kamgar, E. J.

Keogh, Matrix profile XI: SCRIMP++: time series motif discovery at interactive speeds, ICDM’ 2018. [21] D. J. Berndt, J. Cliford, Using dynamic time warping to find patterns in time series, AAAI’ (Workshop) 1994. [22] L. McInnes, J. Healy, S. Astels, hdbscan: Hierarchical density based clustering, J. Open Source Softw, 2017 (????). [23] A. Mueen, Y. Zhu, M. Yeh, K. Kamgar,

K. Viswanathan, C. Gupta, E. Keogh, The fastest similarity search algorithm for time series subsequences under euclidean distance, 2017. [24] A. Sifer, P.-A. Fouque, A. Termier, C. Largouet,

Anomaly detection in streams with extreme value

[1]

Malhotra ,

Ramakrishnan ,

Anand ,

Vig ,

Agarwal , G. Shrof, Lstm-based encoderdecoder for multi-sensor anomaly detection , CoRR abs/1607 .00148 (????).

[2]

Hundman ,

Constantinou ,

Laporte , I. Colwell, T. Söderström, Detecting spacecraft anomalies using lstms and nonparametric dynamic thresholding , KDD' 2018 .

[3]

Hochreiter ,

Schmidhuber , Long short-term memory, Neural Comput . ( 1997 ).

[4]

Su ,

Zhao ,

Niu ,

Liu ,

Sun ,

Pei , Robust anomaly detection for multivariate time series through stochastic recurrent neural network , KDD' 2019 .

[5]

Zhao ,

Wang ,

Duan ,

Huang ,

Cao ,

Tong ,

Xu ,

Bai ,

Tong ,

Zhang , Multivariate time-series anomaly detection via graph attention network , ICDM' 2020 .

[6]

Chung , Ç. Gülçehre,

Cho ,

Bengio , Empirical evaluation of gated recurrent neural networks on sequence modeling , CoRR (????).

[7]

Cheng ,

Yang ,

Wang ,

Hu ,

Zhuang , G. Song, Time2graph: Revisiting time series modeling with dynamic shapelets , AAAI' 2020 .

[8]

Hu ,

Yang , Z. Cheng,

Yang ,

Ren , Time-