Event2Graph: Event-driven Bipartite Graph for Multivariate Time Series Forecasting and Anomaly Detection Yuhang Wu1,* , Mengting Gu1 , Lan Wang1 , Yu-san Lin1 , Fei Wang1 and Hao Yang1 1 Visa Research, 385 Sherman Ave., Palo Alto, CA, U.S.A Abstract Modeling inter-dependencies between time series is the key to achieve high performance in anomaly detection for multivariate time series data. The de-facto solution to model the dependencies is to feed the data into a recurrent neural network (RNN). However, the fully connected network structure underneath the RNN (either GRU or LSTM) assumes a static and complete dependency graph between time series, which may not hold in many real-world applications. To alleviate this assumption, we propose a dynamic bipartite graph structure to encode the inter-dependencies between time series. More concretely, we model time series as one type of nodes, and the time series segments (regarded as event) as another type of nodes, where the edge between two types of nodes describe a temporal pattern occurred on a specific time series at a certain time. Based on this design, relations between time series can be explicitly modeled via dynamic connections to event nodes, and the multivariate time series anomaly detection problem can be formulated as a self-supervised, edge stream prediction problem in dynamic graphs. We conducted extensive experiments to demonstrate the effectiveness of the design. Keywords Time series, Anomaly detection, Dynamic graph, 1. Introduction derlying inter-dependencies are different. Our work is built upon the recent wisdom of dynamic graph neural Detecting anomalies in time series data has been an im- network. We allow the connectivity of the graph changes portant problem in the research community of data min- dynamically in different time stamps based on the pat- ing as well as the finance industry. In many circum- terns on the time series. stances, anomaly patterns in multiple time series need In order to construct a dynamic graph on-the-fly, one to be taken into account together to disclose the full important question is how to determine the connectiv- picture of the system. Previous works in multivariate ity of the graph in each timestamp. Our work is in- anomaly detection mainly rely on recurrent neural net- spired by the recent progress of evolutionary event graph works (RNNs). Malhotra et al.[1] and Hundman et al.[2] [7, 8] where the nodes in the graph represent the time- employed LSTM models [3] to capture the temporal de- sequences segments (events) and directed links represent pendencies of multivariate signals and adopted predic- the transition of the segments (events). Compare to previ- tion and reconstruction errors, respectively, to identify ous works, this line of research naturally models the time- the anomaly patterns. Su et al.[4] improved the classical varying relations among time series states via dynamic RNN model by modeling the probability distribution of connections, and each state carries a physical meaning time series via incorporating stochastic variables. To fur- that is understandable by human. However, one major ther model the correlations of time series explicitly, Zhao limitation of [8] is that the event nodes employed in this et al.[5] proposed a graph attention network to propa- work capture the information across all the time series. gate information from different time series and aggregate Assume there are 𝐾 segment patterns in each time series, the information together before feeding into a GRU [6]. and the number of time series is 𝐷. The model would However, because of the underlying RNN structure, pre- need 𝐾 𝐷 number of event nodes to represent the multi- vious works assume a static, complete dependency graph variate signals. This exponential number of event nodes among time series. These approaches may not perform strongly limits the information processing capability of well under regime change of time series where the un- the evolutionary state graph and therefore allows only a very small number of time series (𝐷) to be analyzed in AMLTS’22: Workshop on Applied Machine Learning Methods for Time practice (four in the dataset used in the paper [8]). Series Forecasting, co-located with the 31st ACM International Con- To address the problem of exponential number of com- ference on Information and Knowledge Management (CIKM), October binations, we disentangle the time series nodes and the 17-21, 2022, Atlanta, USA * Corresponding author. event nodes in our design, and model them as two types $ yuhawu@visa.com (Y. Wu); mengu@visa.com (M. Gu); of nodes in a dynamic bipartite graph (as depicted in lwang4@visa.com (L. Wang); yusalin@visa.com (Y. Lin); Fig. 1). Each event node only represents a time seg- feiwang@visa.com (F. Wang); haoyang@visa.com (H. Yang) ment on one individual time series, instead of integrating Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). patterns across all time series. The undirectional con- CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Figure 1: This toy example has three time series nodes and four event nodes. Time-step t90 correspond to an anomaly pattern while other time stamps are normal. The relation between time series nodes and event nodes are highlighted, green dashed line means the predicted relation while the red line means the actual relation. nection between two types of nodes indicates event 𝑒 sors, Su et al.[4] proposed a stochastic LSTM framework happens on the π‘‘π‘‘β„Ž time series at time 𝑑. So the maxi- to model the data distribution of time series in a global mum number of edges in the graph is 𝑂(𝐾𝐷), which perspective. As RNN was originally proposed to model is much smaller than 𝑂(𝐾 2𝐷 ). To further improve the the temporal dependencies between different timestamps, efficiency and generalizability of the algorithm, we built and all the dimensions of the input of RNN are used to de- upon the framework based on the recent advances in scribe a single concept (e.g.word) all together, it was not edge streams [9, 10, 11], where connections between tailored to model the inter-dependencies among variables nodes are modeled as incoming attributed edges instead by design. Recently, Zhao et al.[5] proposed a graph at- of constructing adjacency matrices. The complete sys- tention network-based (GAT) approach [12] where each tem, with the name Event2Graph (Event-driven bipartite time series is regarded as an individual node, and informa- Graph), outperforms previous strong baselines on two tion are aggregated based on the underlying similarity public data-sets, and we summarize our main contribu- of the signals. While this solution partially mitigates tions as follows: the problem by taking into account of dynamic pairwise (i) We propose a bipartite event-graph-based system to similarities between time series, the design assumes a analyze the multivariate time series and model the inter- complete and static inter-dependency graph. Also, the actions between time series and event segments via edge processed information after GAT is simply aggregated streams. (ii) Events employed in the system is highly all together and fed into a single GRU. Hence the module interpretable and easy for human to interact with. (iii) still suffers from similar problems as previous designs. The system achieves competitive performance on chal- Dynamic inter-dependency relation: A few recent lenging anomaly detection datasets through graph edge works started to explore the partial inter-dependency forecasting. relations among time series. Deng et al.[13] constructed a neighborhood graph on-the-fly based on sensor em- beddings to describe the dependencies between different 2. Related work sensors. The sparsity level of this incomplete graph can be customized by users, but the connectivity of the graph Static inter-dependency relation: In this category is fixed once constructed. Hu et al.[8] proposed the first of approach, relations (e.g., correlation) between multi- dynamic dependency design in multivariate anomaly de- variate time series are fixed once learned, and the model tection. In this design, nodes are interpretable time series assumes all the time series can influence each other (com- states, and the transition of these states are explicitly plete inter-dependency). LSTM-based framework has modeled by dynamic graph neural network. This design been widely employed in this category of work. For allows the system to represent the time-varying relations instance, Malhotra et al.[1] proposed a LSTM-based auto- among time series. However, one major limitation of encoder network to detect anomalies from multiple sen- the method is it uses a single event node to represent 4. Bipartite Event Stream segments across all time series. This design brings the combination explosion problem in pattern representation The core intuition behind the dynamic bipartite design (mentioned in Section I) and makes it hard to tackle a is to decouple three key concepts in the traditional event moderate scale problem. Our model is built upon the graph, namely: where (which time series), when (at what advances in dynamic graph neural networks. Most of the time), and which (the event category). This practice helps literatures in dynamic graph networks assume discrete- us avoid the problem of exponential pattern combina- time graph dynamics where the graphs are represented as tions. a sequence of snapshots [14, 15, 16]. Recent works start Specifically, the proposed algorithm is built upon a exploring more flexible design and assuming edges can dynamic bipartite graph. We define it as a sequence of appear at any time [10, 17, 9, 18, 11]. Our solution is de- undirected bipartite event graph 𝐺𝑑𝐡 = {(vπ‘‘π‘š , v𝑑𝑒 , rived from a recent solution named Temporal Graph Net- 𝐴(vπ‘‘π‘š , v𝑑𝑒 ))}. Here 𝑑 represents when the time window works (TGN) proposed by Rossi et al.[11]. The model is that the event graph is formulated. If the window size is built upon the temporal graph attention network (TGAT) set to be 1, then the time window 𝑑 is essentially the same [10] but extended with an unique node-wise memory as the time step 𝑑̇ . A time sequence index node vπ‘‘π‘š indi- which attempt to model the temporal dependency on the cates β€œwhere" an event v𝑑𝑒 is happening. An attributed node level instead of the system level as [5]. It allows us edge 𝐴(vπ‘‘π‘š , v𝑑𝑒 ) that connect event node v𝑑𝑒 and sequence to model the dynamic inter-dependency in our bi-partite index node vπ‘‘π‘š indicates an event v𝑑𝑒 happened on time event graph accurately, and provide us the flexibility to series π‘š at time window 𝑑. For simplicity, we also denote incorporate new nodes that have not seen in training. the edge as π΄π‘š,𝑒 in the following sections. We employ an edge stream representation so an edge 𝐴(vπ‘‘π‘š , v𝑑𝑒 ) is only constructed to represent the relation that actually 3. Problem definition existed. A major benefit of using edge stream represen- tation over adjacency matrix is that it allows the graph A multivariate time series is defined as X ∈ R𝑇 ×𝑑 where structure to be more scalable and flexible, which pro- 𝑇 is the maximum number of time stamps in the time vides us the generality of incorporating new events that series and 𝑑 is the dimension. The anomaly detection have not appeared in training. Based on this bipartite * problem is to detect specific time stamps 𝑑̇ ∈ O in a graph structure, we convert the multivariate anomaly multivariate time series where the time series behav- detection problem into a self-supervised edge prediction ior deviating from the normal patterns of the time series. problem in dynamic graph. Given the historical sequence Note that O contains timestamps that marked as anomaly (1:𝑑) of G𝐡 and event nodes v𝑑+1 𝑒 in time window 𝑑 + 1, by domain expert. In case 𝑇 is large, a common way to we are predicting edges 𝐴 Λ† (v𝑑+1 π‘š , v𝑒 𝑑+1 )) in G𝑑+1 𝐡 . The handle long time series is to use a sliding window ap- anomaly is derived as a mismatching score from the pre- proach. Let 𝜏 denote the length of the sliding window. dicted edge set 𝐴 Λ† (v𝑑+1 𝑑+1 π‘š , v𝑒 )) and the observed edge We hereby formulate the problem as a binary classifica- set 𝐴(vπ‘š , v𝑒 )) with a read-out function π‘Ÿ(Β·). 𝑑+1 𝑑+1 tion problem with the objective to identify time windows Λ™ Λ™ So the procedure of solving the anomaly detection with X[π‘‘βˆ’πœ :𝑑]×𝑑 that contain anomaly time-stamps. One problem becomes: solution of solving the multivariate time series anomaly detection problem is to convert time series into a homoge- 1. Given a normal sequence Xπ‘‘π‘Ÿ , identify represen- neous graph [7, 8], where G𝑒 is defined on the represen- tative patterns on each of the time sequence re- tative patterns of a multivariate time series. The node of spectively in multivariate time series. the graph v𝑒 corresponds to a human interpretable time 2. With identified representative patterns events, series patterns p ∈ R𝜏 ×𝑑 . The pattern should be repre- merge similar events across time series to indicate sentative enough so that the whole time series can be affinity relation between time series. 3. Build a sequence of bipartite event graph for mul- approximated by transitions of symbolic events (states) tivariate time series (with a stride 𝛽) based on in the graph. The transitional relation between two se- event matching. quential nodes (end with stamp 𝑑 and 𝑑 + 1) is defined 4. Analyze the sequence of bi-partite event graphs as an edge in the graph. Many existing works (e.g., Bag and derive a model Ξ¨ to describe the intra- and of Patterns, Shaples, sequence clustering) can be used to inter-dependency relations. distill representative patterns (event) from time series, 5. Given a testing sequence X𝑑𝑒 , repeat (3), apply the but most of them only effective on univariate time series model learned in (4) to predict 𝐴 Λ† at time window where 𝑑 = 1. 𝑑* . 6. Derive anomaly scores based on the predicted 𝐴 Λ† and original time series with a proposed readout function π‘Ÿ(Β·). Figure 2: Events are represented by colorful grids in the event bar. In each event bar, X axis corresponding to the time, each row corresponds a time series. One event bar summarizes all the event happened on an individual time series. For each group of event bar, two type event bars corresponding to the events generated by pattern matching and residual computing. The predicted efvent bars are generated by the proposed system via forecasting the next connected event node for each time series. The anomaly score is generated based on the mismatch between the predicted events and the actual events. 5. Event Node Detection And place. We employed a highly efficient C language-based Dynamic Time Wrapping algorithm [23] to conduct the Matching event matching, which provides a (𝑇 βˆ’ 𝜏 )βˆ–π›½ Γ— 𝐷 Γ— 𝐾 We defined the event nodes on each time series separately similarity tensor β„’πœ to indicate the similarity of each and employed an unsupervised algorithm to identify the event to all the sliding window with size 𝜏 and stride 𝛽 events. As we have no prior knowledge about what an in the input. anomaly pattern would look like in most real world appli- Event nodes created by pattern matching: for a single cations, the system is designed to learn the representative time stamp 𝑑, given β„’πœ the event graph is created by patterns and their correlations from observed data. selecting the best matched events for each time series. Representative segment detection: We adopted Matrix The event node is then connected with the correspond- Profile [19], a state-of-the-art unsupervised algorithm to ing time series node vπ‘‘π‘š in the event graph G𝑑𝐡 with an identify representative patterns from time sequences.f attribute edge 𝐴(v𝑑𝑒 , vπ‘‘π‘š ). By finding and linking the best Matrix profile is able to identify the top-𝐾 repeated pat- matched event for each time series, we added 𝑑 number terns on time series with high accuracy and low com- of event nodes into the graph. putation time. We employed a recent implementation Event nodes created by residual error: we noticed that of matrix profile called SCRIMP++ [20] on each of the even for the best matched time series, the pattern match- time series in Xπ‘‘π‘Ÿ , and the algorithm yields a list of single ing still left with small residual errors. We hereby define dimensional representative patterns pπ‘š,π‘˜ (π‘š ∈ {1 : 𝐷}, two general residual nodes which indicate whether the π‘˜ ∈ {1 : 𝐾}), each with size 𝜏 . residual error in a time series is larger than a threshold Event generation: many time series share similar rep- πœƒ. One residual event node denoted as v𝑑𝑒+ indicates the resentative patterns with each other. It is reasonable to residual error is larger than πœƒ, another residual event merge similar segment patterns together across different node denoted as vπ‘‘π‘’βˆ’ indicates the residual error is equal time series to create the event nodes. To measure the or smaller than πœƒ. The parameter πœƒ is automatically gen- similarity between time series, we employed dynamic erated using SPOT [24] algorithm, where we employ the time wrapping [21], which provides us with a robust dis- whole training data-set for initialization and testing for tance matrix that is insensitive to small misalignments on-going adaptation. All time series shared these two between two time series segments. After that, we em- residual nodes as shown in Fig. 3. ployed the H-DBSCAN [22] to cluster the patterns pπ‘š,π‘˜ After generating a sequence of bipartite event graph (1:𝑑) (1:𝑑) into clusters. We select the centroid of each cluster to G𝐡 for time series Xπ‘‘π‘Ÿ , a model Ξ¨ is trained on G𝐡 be an event, which is a representative segment across all so that given a testing sequence X𝑑𝑒 , it is able to predict time series within the cluster. Finally, we obtain an event the connectivity of an event graph in any specific time set E𝜏 which contains all events each with length 𝜏 . stamp 𝑑̇ by observing historical sequence observed on Event matching: after extracting 𝐾 events from time se- X𝑑𝑒 . In Fig. 2, we visualize the events forecasted/detected ries segments, we match each event with the original time on ten time series. The difference between the top and series to identify where and when each event is taking the middle group result in the bottom group where we Figure 3: Overview of the proposed approach. The time series nodes are represented by blue circles, two types of event nodes (results from time series matching and residual computation) are represented by yellow and orange circle, respectively. The whole system is trained in an end-to-end manner. can easily identify the anomaly locations. time 𝑑 as following: βˆ‘οΈ π‘§π‘š (𝑑) = 𝑇 𝐺𝐴(π‘ π‘š (𝑑), 𝑠𝑒 (𝑑), π‘’π‘š,𝑒 , π‘£π‘š (𝑑), 𝑣𝑒 (𝑑)) 5.1. Network Design With Node-wise π‘—βˆˆπ‘›π‘˜ π‘š ([0,𝑑]) Memory (3) (𝑑) For each pair of vπ‘‘π‘š and v𝑑𝑒 in graph G𝐡 , the node fea- TGA represents the temporal graph attention module tures of them are defined as π‘£π‘š and 𝑣𝑒 , respectively. We [11], where 𝐿 graph attention layers compute π‘šβ€™s embed- also define πœ–π‘š,𝑒 as the edge features between vπ‘š and v𝑒 . ding by aggregating information from its L-hop temporal In order to support the dynamic inter-dependency graph, neighbors. we avoid explicitly defining a static adjacency matrix to describe the relation between nodes. Instead, we adopt a state-message framework to model the node interactions. 5.2. Optimization and inference For each node vπ‘š at time stamp 𝑑 (here we use time se- The aforementioned model is trained in a self-supervised ries node π‘š as an example; the same rule applies to the fashion. As our goal is to predict the events that might event node 𝑒), we define a state vector π‘ π‘š (𝑑) to repre- happen in the time sequences in the next time step, which sent its interaction history with other v𝑒 nodes before corresponds to the edges linking the time sequence nodes 𝑑 in a compressed format. By initiating π‘ π‘š (0) as an all and event nodes. Therefore, we train a simple MLP model zero vector, the interaction at time 𝑑 is encoded with a with the edge prediction task based on π‘§π‘š (𝑑). message vector πœšπ‘š (𝑑): We use the cross entropy to supervise the forecasting βˆ’ βˆ’ of which original and residual events would happen in a πœšπ‘š (𝑑) = [πœ–π‘š,𝑒 (𝑑)||βˆ†π‘‘||π‘ π‘š (𝑑 )||𝑠𝑒 (𝑑 )] (1) given time stamp 𝑑. It is a two hot classification problem where the two activated nodes corresponding to an event where βˆ†π‘‘ is the time elapse between the previous time node and a residual event node. stamp 𝑑 and 𝑑, symbol || means concatenating operation. βˆ’ Event Forecasting Score: To convert the predicted event After aggregating all the messages from neighbors, the edges 𝐴ˆ into an anomaly score, for each time series node state vector of vπ‘š is updated as: vπ‘‘π‘š,𝜏 with window size 𝜏 , the event v𝑑𝑒,𝜏 that has the π‘ π‘š (𝑑) = π‘šπ‘’π‘š(π‘Žπ‘”π‘”{πœšπ‘š (𝑑1 ), ..., πœšπ‘š (𝑑𝑏 )}, π‘ π‘š (π‘‘βˆ’ )) highest probability connect to vπ‘‘π‘š,𝜏 is retrieved and its (2) pattern in the original𝑑 signal space is denoted as s𝑒,𝜏 . 𝑑 We project the event s𝑒,𝜏 ’s pattern back to its original Here agg(Β·) is a general aggregation operation (support signal space, we then compute anomaly score based on learnable parameters). For the sake of simplicity, we the dynamic time wrapping distance as follows: only compute the mean of the most recent messages 𝑑 πœ”1,π‘š = 𝐷𝑇 π‘Š (Xπ‘‘π‘š,𝜏 , s𝑑𝑒,𝜏 ) (4) in aggregation. We use mem(Β·) to represent a trainable update function (e.g., GRU). Residual Score: For a positive residual event v𝑑𝑒+ at Build upon the updated state vector π‘ π‘š (𝑑) and π‘ π‘š (𝑒), time stamp 𝑑 where the forecasted results is not a posi- a time-aware node embedding can be generated at any tive residual vΜ‚π‘‘π‘’βˆ’ , we calculate a changing point score to Table 1 Table 2 Dataset statistics and hyper-parameters Compare with current state-of-the-art approaches. SMAP SMD Dataset Method Precision Recall F1 Number of Time series nodes 25 38 DAGMM 59.51 88.82 70.94 Number of event nodes 130 64 LSTM-VAE 79.22 70.75 78.42 Time window length (𝜏 ) 20 50 LSTM-NDT 56.84 64.38 60.37 Time window stride (𝛽 ) 5 10 SMD OmniAnomaly 83.34 94.49 88.57 MTAD-GAT - - - Event2Graph 88.61 83.38 84.93 quantify the surprisal level as follows: DAGMM 58.45 90.58 71.05 LSTM-VAE 85.51 63.66 72.98 𝑑 πœ”2,π‘š = πœ“π‘ 𝐿𝐺 (||Xπ‘‘π‘š,𝜏 βˆ’ Xπ‘‘βˆ’πœ π‘š,𝜏 ||) (5) LSTM-NDT 89.65 88.46 89.05 SMAP OmniAnomaly 74.16 97.76 84.34 where πœ“π‘ 𝐿𝐺 is a standard function that maps a scalar MTAD-GAT 89.06 91.23 90.13 into the negative log likelihood, which indicates the spar- Event2Graph 86.54 94.33 90.27 sity of this changing point in the training data. A fre- quent changing signal may results in small πœ”2,π‘š 𝑑 after the mapping. The function is learned in a data-driven 6.2. Compare with state-of-the-art manner based on the training data of time series π‘š. The final anomaly score at time stamp 𝑑 is calculated as the We compared our solution with multiple baselines on following read out function π‘Ÿ(Β·): SMAP and SMD datasets: DAGMM [25] - an autoencoder- based anomaly detection model without taking into ac- count of temporal information; LSTM-VAE [26], LSTM- βˆ‘οΈ 𝑑 πœ”π‘‘ = 𝑑 (πœ”1,π‘š Β· πœ”2,π‘š ) (6) π‘š NDT [2], two LSTM-based anomaly detection solu- tions; and the most recent stochastic VAE-based ap- proaches (e.g., Omni-Anomaly[4]) and graph attention- 6. Experiment based method such as MTAD-GAT[5]. We selected these We performed experiments on two datasets to demon- baselines mainly because: (i) they are self-supervised al- strate the effectiveness of our model on multivariate gorithms that do not need any training labels (different anomaly detection. We adopted two public datasets: from [8]), (ii) they rely on a single scale of time-window SMAP (Soil Moisture Active Passive satellite) [2] and (instead of multi-scale [27]) so that the performances are SMD (Server Machine Dataset) [4]. We follow the evalu- directly comparable. ation protocol of [4] and an anomaly is classified a good The results are reported in Table 2. From the results catch if it is triggered within any subset of a ground truth we observed that the proposed Event2Graph achieves segment. competitive performance on SMAP dataset, ranked 2𝑛𝑑 on SMD dataset. In SMAP dataset, we observe that some of the data suffers from significant regime change 6.1. Settings during both training and testing, and our dynamic The proposed bipartite event graph contains 𝐷 + 𝐾 graph-based solution helps the algorithm adapt to the number of nodes, where 𝐷 is the number of time series regime change faster than MTAD-GAT[5] and Omni- and 𝐾 is the number of events (𝐷 equals to 25 and 38 Anomaly[4]. We also observe that our algorithm signifi- in SMAP and SMD datasets, 𝐾 equal to 130 and 64 on cantly outperforms simple LSTM-based solution (LSTM- SMAP and SMD datasets). The detailed number of nodes VAE and LSTM-NDT), which assumes a complete inter- is shown in Table 2. The length (𝜏 ) and stride (𝛽) of the dependency graph. They did not model the dynamic time window of each dataset are also displayed in Table inter-dependency among time series, while our node- 2. For each time series, we set the maximum number of level model explicitly encode the temporal information motifs detected to be three, and the minimum cluster size along with the attention, which helps to reduce the of H-DBSCAN to be three. For the temporal attention false alarms in anomaly detection. Furthermore, since model, we set the number of multi-head to be 2. The GRU DAGMM assumes a completely static relationship be- is employed to model the time encoding. The dimension tween time series, the algorithm lacks of the capability of node embedding and message vector are both set to 64 to adapt to any temporal evolving pattern. respectively. Each model is trained after 10 epochs with the learning rate 0.0001. The node features of both time 6.3. Ablation study series and event nodes are randomly initialized, and edge features are the one-hot embedding of the numerical ID The objective of this experiment is to provide detailed of the nodes on both sides of the edge. analysis over the effectiveness of each proposed module. By removing each of the critical module of our model, Table 3 the performances are reported in Table 2. Ablation study on SMD dataset. Model Precision Recall F1 6.3.1. Effectiveness of temporal graph attention Event2Graph 88.61 83.38 84.93 w/o TGA 84.81 82.94 82.51 Through replacing the temporal graph attention with w/o score πœ”1 82.83 80.29 79.45 a simple MLP module, the model’s F1 score is reduced w/o score πœ”2 81.51 71.76 73.81 by 2.42%. It demonstrated that the temporal attention w/o TGA and w/o πœ”1 80.56 81.09 79.47 plays an important role in aggregating information from w/o TGA and w/o πœ”2 75.84 64.33 64.51 neighborhood nodes. πœ“π‘ 𝐿𝐺 only 85.89 75.22 78.48 6.3.2. Effectiveness of event forecasting score πœ”1 : 7. Conclusion We remove the event matching score (Eq. 4) and just use a residual score (Eq. 5) for anomaly detection. We We proposed an event-driven bipartite graph solution observe that the model performance drop by 5.48%. The for multivariate time series anomaly detection. The so- experiment indicates the event forecasting module in lution does not assume any inter-dependency on time Event2Graph is essential for accurate anomaly detection. series, and all the relations are learned in a dynamic, data- From the experiment, we also observe that TGA plays a driven manner. Our design is based on edge-stream so critical role in guarantee the quality of the forecasting no adjacency matrix of the graph is required as input. score πœ”1 . By removing TGA, the forecasting-based model As the system’s memory is defined on the node level, performance (w/o score πœ”2 ) degrade from 73.81% F1 score our design left plenty space for future extensions such as to 64.51%. inductive learning and parallel computation. Our solu- tion achieved very competitive results on two anomaly 6.3.3. Effectiveness of residual score πœ”2 detection datasets, and we encourage future works to ex- plore further using bipartite event graph for multivariate We remove the residual score (Eq. 5) and only use fore- anomaly detection. casting score (Eq. 4) to detect anomaly. The model suffers more than 10% performance degradation. This results shows that modeling the residual score is essential to al- References low the system to identify the anomaly regions that may not well characterized by the event matching scores. We [1] P. Malhotra, A. Ramakrishnan, G. Anand, L. Vig, also observe that the TGA may not contribute much to P. Agarwal, G. Shroff, Lstm-based encoder- residual event modeling, a πœ”2 only model perform com- decoder for multi-sensor anomaly detection, CoRR petitively well with or without TGA. This phenomenon abs/1607.00148 (????). tells us the residual event is something unexpected and an [2] K. Hundman, V. Constantinou, C. Laporte, I. Col- accurate changing point detection is better than learning well, T. SΓΆderstrΓΆm, Detecting spacecraft anomalies simple, repeatable patterns. using lstms and nonparametric dynamic threshold- ing, KDD’ 2018. 6.3.4. Effectiveness of log likelihood function [3] S. Hochreiter, J. Schmidhuber, Long short-term πœ“π‘ 𝐿𝐺 memory, Neural Comput. (1997). [4] Y. Su, Y. Zhao, C. Niu, R. Liu, W. Sun, D. Pei, Ro- To demonstrate our assumption, we compare the results bust anomaly detection for multivariate time series with a baseline that using a naive changing point detec- through stochastic recurrent neural network, KDD’ tion score πœ“π‘ 𝐿𝐺 , without adopting any forecasting or 2019. event modeling model. Surprisingly, the changing-point [5] H. Zhao, Y. Wang, J. Duan, C. Huang, D. Cao, only model is able to achieve 78.48% F1-score. Outper- Y. Tong, B. Xu, J. Bai, J. Tong, Q. Zhang, Multi- forms DAGMM, LSTM-VAE, as well as LSTM-NDT. This variate time-series anomaly detection via graph results further confirm that it is hard to learn any static attention network, ICDM’ 2020. pattern by using LSTM or autoencoder for anomaly de- [6] J. Chung, Γ‡. GΓΌlΓ§ehre, K. Cho, Y. Bengio, Empirical tection on a challenge dataset, a dynamic model that is evaluation of gated recurrent neural networks on able to adapt to the changing of time series patterns is sequence modeling, CoRR (????). preferred. [7] Z. Cheng, Y. Yang, W. Wang, W. Hu, Y. Zhuang, G. Song, Time2graph: Revisiting time series mod- eling with dynamic shapelets, AAAI’ 2020. [8] W. Hu, Y. Yang, Z. Cheng, C. Yang, X. Ren, Time- series event prediction with evolutionary state theory, KDD’ 2017. graph, WSDM ’21,. [25] B. Zong, Q. Song, M. R. Min, W. Cheng, [9] S. Kumar, X. Zhang, J. Leskovec, Predicting dy- C. Lumezanu, D. Cho, H. Chen, Deep autoencoding namic embedding trajectory in temporal interac- gaussian mixture model for unsupervised anomaly tion networks, KDD’, 2019. detection, ICLR’ 2018. [10] D. Xu, C. Ruan, E. KΓΆrpeoglu, S. Kumar, K. Achan, [26] D. Park, Y. Hoshi, C. C. Kemp, A multimodal Inductive representation learning on temporal anomaly detector for robot-assisted feeding us- graphs, ICLR’ 2020. ing an lstm-based variational autoencoder, IEEE [11] E. Rossi, B. Chamberlain, F. Frasca, D. Eynard, Robotics and Automation Letters (2018). F. Monti, M. M. Bronstein, Temporal graph net- [27] L. Shen, Z. Li, J. T. Kwok, Timeseries anomaly detec- works for deep learning on dynamic graphs, CoRR, tion using temporal hierarchical one-class network, abs/2006.10637 (2020). NIPS’ 2020. [12] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. LiΓ², Y. Bengio, Graph attention networks, ICLR’ 2018. [13] A. Deng, B. Hooi, Graph neural network-based anomaly detection in multivariate time series, AAAI’ 2021. [14] W. Yu, W. Cheng, C. C. Aggarwal, H. Chen, W. Wang, Link prediction with spatial and temporal consistency in dynamic networks, IJCAI’ 2017. [15] W. Yu, W. Cheng, C. C. Aggarwal, K. Zhang, H. Chen, W. Wang, Netwalk: A flexible deep embed- ding approach for anomaly detection in dynamic networks, KDD’ 2018. [16] A. Sankar, Y. Wu, L. Gou, W. Zhang, H. Yang, Dysat: Deep neural representation learning on dynamic graphs via self-attention networks, WSDM’, 2020. [17] G. H. Nguyen, J. B. Lee, R. A. Rossi, N. K. Ahmed, E. Koh, S. Kim, Dynamic network embeddings: From random walks to temporal random walks, IEEE International Conference on Big Data, 2018. [18] R. Trivedi, M. Farajtabar, P. Biswal, H. Zha, Dyrep: Learning representations over dynamic graphs, ICLR’ 2019. [19] C. M. Yeh, Y. Zhu, L. Ulanova, N. Begum, Y. Ding, H. A. Dau, D. F. Silva, A. Mueen, E. J. Keogh, Matrix profile I: all pairs similarity joins for time series: A unifying view that includes motifs, discords and shapelets, ICDM’ 2016. [20] Y. Zhu, C. M. Yeh, Z. Zimmerman, K. Kamgar, E. J. Keogh, Matrix profile XI: SCRIMP++: time series motif discovery at interactive speeds, ICDM’ 2018. [21] D. J. Berndt, J. Clifford, Using dynamic time warp- ing to find patterns in time series, AAAI’ (Work- shop) 1994. [22] L. McInnes, J. Healy, S. Astels, hdbscan: Hierarchi- cal density based clustering, J. Open Source Softw, 2017 (????). [23] A. Mueen, Y. Zhu, M. Yeh, K. Kamgar, K. Viswanathan, C. Gupta, E. Keogh, The fastest similarity search algorithm for time series subsequences under euclidean distance, 2017. [24] A. Siffer, P.-A. Fouque, A. Termier, C. Largouet, Anomaly detection in streams with extreme value