Event2Graph: Event-driven Bipartite Graph for Multivariate
Time Series Forecasting and Anomaly Detection
Yuhang Wu1,* , Mengting Gu1 , Lan Wang1 , Yu-san Lin1 , Fei Wang1 and Hao Yang1
1
    Visa Research, 385 Sherman Ave., Palo Alto, CA, U.S.A


                                          Abstract
                                          Modeling inter-dependencies between time series is the key to achieve high performance in anomaly detection for multivariate
                                          time series data. The de-facto solution to model the dependencies is to feed the data into a recurrent neural network (RNN).
                                          However, the fully connected network structure underneath the RNN (either GRU or LSTM) assumes a static and complete
                                          dependency graph between time series, which may not hold in many real-world applications. To alleviate this assumption,
                                          we propose a dynamic bipartite graph structure to encode the inter-dependencies between time series. More concretely, we
                                          model time series as one type of nodes, and the time series segments (regarded as event) as another type of nodes, where
                                          the edge between two types of nodes describe a temporal pattern occurred on a specific time series at a certain time. Based
                                          on this design, relations between time series can be explicitly modeled via dynamic connections to event nodes, and the
                                          multivariate time series anomaly detection problem can be formulated as a self-supervised, edge stream prediction problem
                                          in dynamic graphs. We conducted extensive experiments to demonstrate the effectiveness of the design.

                                          Keywords
                                          Time series, Anomaly detection, Dynamic graph,


1. Introduction                                                                                        derlying inter-dependencies are different. Our work is
                                                                                                       built upon the recent wisdom of dynamic graph neural
Detecting anomalies in time series data has been an im- network. We allow the connectivity of the graph changes
portant problem in the research community of data min- dynamically in different time stamps based on the pat-
ing as well as the finance industry. In many circum- terns on the time series.
stances, anomaly patterns in multiple time series need                                                   In order to construct a dynamic graph on-the-fly, one
to be taken into account together to disclose the full important question is how to determine the connectiv-
picture of the system. Previous works in multivariate ity of the graph in each timestamp. Our work is in-
anomaly detection mainly rely on recurrent neural net- spired by the recent progress of evolutionary event graph
works (RNNs). Malhotra et al.[1] and Hundman et al.[2] [7, 8] where the nodes in the graph represent the time-
employed LSTM models [3] to capture the temporal de- sequences segments (events) and directed links represent
pendencies of multivariate signals and adopted predic- the transition of the segments (events). Compare to previ-
tion and reconstruction errors, respectively, to identify ous works, this line of research naturally models the time-
the anomaly patterns. Su et al.[4] improved the classical varying relations among time series states via dynamic
RNN model by modeling the probability distribution of connections, and each state carries a physical meaning
time series via incorporating stochastic variables. To fur- that is understandable by human. However, one major
ther model the correlations of time series explicitly, Zhao limitation of [8] is that the event nodes employed in this
et al.[5] proposed a graph attention network to propa- work capture the information across all the time series.
gate information from different time series and aggregate Assume there are 𝐾 segment patterns in each time series,
the information together before feeding into a GRU [6]. and the number of time series is 𝐷. The model would
However, because of the underlying RNN structure, pre- need 𝐾 𝐷 number of event nodes to represent the multi-
vious works assume a static, complete dependency graph variate signals. This exponential number of event nodes
among time series. These approaches may not perform strongly limits the information processing capability of
well under regime change of time series where the un- the evolutionary state graph and therefore allows only a
                                                                                                       very small number of time series (𝐷) to be analyzed in
AMLTS’22: Workshop on Applied Machine Learning Methods for Time practice (four in the dataset used in the paper [8]).
Series Forecasting, co-located with the 31st ACM International Con-                                      To address the problem of exponential number of com-
ference on Information and Knowledge Management (CIKM), October
                                                                                                       binations, we disentangle the time series nodes and the
17-21, 2022, Atlanta, USA
*
  Corresponding author.                                                                                event nodes in our design, and model them as two types
$ yuhawu@visa.com (Y. Wu); mengu@visa.com (M. Gu);                                                     of nodes in a dynamic bipartite graph (as depicted in
lwang4@visa.com (L. Wang); yusalin@visa.com (Y. Lin);                                                  Fig. 1). Each event node only represents a time seg-
feiwang@visa.com (F. Wang); haoyang@visa.com (H. Yang)                                                 ment on one individual time series, instead of integrating
          © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
          Attribution 4.0 International (CC BY 4.0).                                                   patterns across all time series. The undirectional con-
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
Figure 1: This toy example has three time series nodes and four event nodes. Time-step t90 correspond to an anomaly pattern
while other time stamps are normal. The relation between time series nodes and event nodes are highlighted, green dashed
line means the predicted relation while the red line means the actual relation.


nection between two types of nodes indicates event 𝑒       sors, Su et al.[4] proposed a stochastic LSTM framework
happens on the 𝑑𝑡ℎ time series at time 𝑡. So the maxi-     to model the data distribution of time series in a global
mum number of edges in the graph is 𝑂(𝐾𝐷), which           perspective. As RNN was originally proposed to model
is much smaller than 𝑂(𝐾 2𝐷 ). To further improve the      the temporal dependencies between different timestamps,
efficiency and generalizability of the algorithm, we built and all the dimensions of the input of RNN are used to de-
upon the framework based on the recent advances in         scribe a single concept (e.g.word) all together, it was not
edge streams [9, 10, 11], where connections between        tailored to model the inter-dependencies among variables
nodes are modeled as incoming attributed edges instead     by design. Recently, Zhao et al.[5] proposed a graph at-
of constructing adjacency matrices. The complete sys-      tention network-based (GAT) approach [12] where each
tem, with the name Event2Graph (Event-driven bipartite     time series is regarded as an individual node, and informa-
Graph), outperforms previous strong baselines on two       tion are aggregated based on the underlying similarity
public data-sets, and we summarize our main contribu-      of the signals. While this solution partially mitigates
tions as follows:                                          the problem by taking into account of dynamic pairwise
(i) We propose a bipartite event-graph-based system to     similarities between time series, the design assumes a
analyze the multivariate time series and model the inter-  complete and static inter-dependency graph. Also, the
actions between time series and event segments via edge    processed information after GAT is simply aggregated
streams. (ii) Events employed in the system is highly      all together and fed into a single GRU. Hence the module
interpretable and easy for human to interact with. (iii)   still suffers from similar problems as previous designs.
The system achieves competitive performance on chal-       Dynamic inter-dependency relation: A few recent
lenging anomaly detection datasets through graph edge      works started to explore the partial inter-dependency
forecasting.                                               relations among time series. Deng et al.[13] constructed
                                                           a neighborhood graph on-the-fly based on sensor em-
                                                           beddings to describe the dependencies between different
2. Related work                                            sensors. The sparsity level of this incomplete graph can
                                                           be customized by users, but the connectivity of the graph
Static inter-dependency relation: In this category
                                                           is fixed once constructed. Hu et al.[8] proposed the first
of approach, relations (e.g., correlation) between multi-
                                                           dynamic dependency design in multivariate anomaly de-
variate time series are fixed once learned, and the model
                                                           tection. In this design, nodes are interpretable time series
assumes all the time series can influence each other (com-
                                                           states, and the transition of these states are explicitly
plete inter-dependency). LSTM-based framework has
                                                           modeled by dynamic graph neural network. This design
been widely employed in this category of work. For
                                                           allows the system to represent the time-varying relations
instance, Malhotra et al.[1] proposed a LSTM-based auto-
                                                           among time series. However, one major limitation of
encoder network to detect anomalies from multiple sen-
the method is it uses a single event node to represent        4. Bipartite Event Stream
segments across all time series. This design brings the
combination explosion problem in pattern representation       The core intuition behind the dynamic bipartite design
(mentioned in Section I) and makes it hard to tackle a        is to decouple three key concepts in the traditional event
moderate scale problem. Our model is built upon the           graph, namely: where (which time series), when (at what
advances in dynamic graph neural networks. Most of the        time), and which (the event category). This practice helps
literatures in dynamic graph networks assume discrete-        us avoid the problem of exponential pattern combina-
time graph dynamics where the graphs are represented as       tions.
a sequence of snapshots [14, 15, 16]. Recent works start         Specifically, the proposed algorithm is built upon a
exploring more flexible design and assuming edges can         dynamic bipartite graph. We define it as a sequence of
appear at any time [10, 17, 9, 18, 11]. Our solution is de-   undirected bipartite event graph 𝐺𝑡𝐵 = {(v𝑡𝑚 , v𝑡𝑒 ,
rived from a recent solution named Temporal Graph Net-        𝐴(v𝑡𝑚 , v𝑡𝑒 ))}. Here 𝑡 represents when the time window
works (TGN) proposed by Rossi et al.[11]. The model is        that the event graph is formulated. If the window size is
built upon the temporal graph attention network (TGAT)        set to be 1, then the time window 𝑡 is essentially the same
[10] but extended with an unique node-wise memory             as the time step 𝑡̇ . A time sequence index node v𝑡𝑚 indi-
which attempt to model the temporal dependency on the         cates “where" an event v𝑡𝑒 is happening. An attributed
node level instead of the system level as [5]. It allows us   edge 𝐴(v𝑡𝑚 , v𝑡𝑒 ) that connect event node v𝑡𝑒 and sequence
to model the dynamic inter-dependency in our bi-partite       index node v𝑡𝑚 indicates an event v𝑡𝑒 happened on time
event graph accurately, and provide us the flexibility to     series 𝑚 at time window 𝑡. For simplicity, we also denote
incorporate new nodes that have not seen in training.         the edge as 𝐴𝑚,𝑒 in the following sections. We employ
                                                              an edge stream representation so an edge 𝐴(v𝑡𝑚 , v𝑡𝑒 ) is
                                                              only constructed to represent the relation that actually
3. Problem definition                                         existed. A major benefit of using edge stream represen-
                                                              tation over adjacency matrix is that it allows the graph
A multivariate time series is defined as X ∈ R𝑇 ×𝑑 where      structure to be more scalable and flexible, which pro-
𝑇 is the maximum number of time stamps in the time            vides us the generality of incorporating new events that
series and 𝑑 is the dimension. The anomaly detection          have not appeared in training. Based on this bipartite
                                               *
problem is to detect specific time stamps 𝑡̇ ∈ O in a         graph structure, we convert the multivariate anomaly
multivariate time series where the time series behav-         detection problem into a self-supervised edge prediction
ior deviating from the normal patterns of the time series.    problem in dynamic graph. Given the historical sequence
Note that O contains timestamps that marked as anomaly              (1:𝑡)
                                                              of G𝐵 and event nodes v𝑡+1       𝑒    in time window 𝑡 + 1,
by domain expert. In case 𝑇 is large, a common way to
                                                              we are predicting edges 𝐴     ˆ (v𝑡+1
                                                                                                 𝑚 , v𝑒
                                                                                                       𝑡+1
                                                                                                           )) in G𝑡+1
                                                                                                                  𝐵 . The
handle long time series is to use a sliding window ap-
                                                              anomaly is derived as a mismatching score from the pre-
proach. Let 𝜏 denote the length of the sliding window.
                                                              dicted edge set 𝐴   ˆ (v𝑡+1   𝑡+1
                                                                                       𝑚 , v𝑒    )) and the observed edge
We hereby formulate the problem as a binary classifica-
                                                              set 𝐴(v𝑚 , v𝑒 )) with a read-out function 𝑟(·).
                                                                        𝑡+1    𝑡+1
tion problem with the objective to identify time windows
         ˙   ˙                                                   So the procedure of solving the anomaly detection
with X[𝑡−𝜏 :𝑡]×𝑑 that contain anomaly time-stamps. One
                                                              problem becomes:
solution of solving the multivariate time series anomaly
detection problem is to convert time series into a homoge-         1. Given a normal sequence X𝑡𝑟 , identify represen-
neous graph [7, 8], where G𝑒 is defined on the represen-                tative patterns on each of the time sequence re-
tative patterns of a multivariate time series. The node of              spectively in multivariate time series.
the graph v𝑒 corresponds to a human interpretable time             2. With identified representative patterns events,
series patterns p ∈ R𝜏 ×𝑑 . The pattern should be repre-                merge similar events across time series to indicate
sentative enough so that the whole time series can be                   affinity relation between time series.
                                                                   3. Build a sequence of bipartite event graph for mul-
approximated by transitions of symbolic events (states)
                                                                        tivariate time series (with a stride 𝛽) based on
in the graph. The transitional relation between two se-
                                                                        event matching.
quential nodes (end with stamp 𝑡 and 𝑡 + 1) is defined
                                                                   4. Analyze the sequence of bi-partite event graphs
as an edge in the graph. Many existing works (e.g., Bag
                                                                        and derive a model Ψ to describe the intra- and
of Patterns, Shaples, sequence clustering) can be used to
                                                                        inter-dependency relations.
distill representative patterns (event) from time series,
                                                                   5. Given a testing sequence X𝑡𝑒 , repeat (3), apply the
but most of them only effective on univariate time series
                                                                        model learned in (4) to predict 𝐴 ˆ at time window
where 𝑑 = 1.
                                                                        𝑡* .
                                                                   6. Derive anomaly scores based on the predicted 𝐴     ˆ
                                                                        and original time series with a proposed readout
                                                                        function 𝑟(·).
Figure 2: Events are represented by colorful grids in the event bar. In each event bar, X axis corresponding to the time, each
row corresponds a time series. One event bar summarizes all the event happened on an individual time series. For each group
of event bar, two type event bars corresponding to the events generated by pattern matching and residual computing. The
predicted efvent bars are generated by the proposed system via forecasting the next connected event node for each time series.
The anomaly score is generated based on the mismatch between the predicted events and the actual events.


5. Event Node Detection And                                      place. We employed a highly efficient C language-based
                                                                 Dynamic Time Wrapping algorithm [23] to conduct the
   Matching                                                      event matching, which provides a (𝑇 − 𝜏 )∖𝛽 × 𝐷 × 𝐾
We defined the event nodes on each time series separately        similarity tensor ℒ𝜏 to indicate the similarity of each
and employed an unsupervised algorithm to identify the           event to all the sliding window with size 𝜏 and stride 𝛽
events. As we have no prior knowledge about what an              in the input.
anomaly pattern would look like in most real world appli-           Event nodes created by pattern matching: for a single
cations, the system is designed to learn the representative      time stamp 𝑡, given ℒ𝜏 the event graph is created by
patterns and their correlations from observed data.              selecting the best matched events for each time series.
   Representative segment detection: We adopted Matrix           The event node is then connected with the correspond-
Profile [19], a state-of-the-art unsupervised algorithm to       ing time series node v𝑡𝑚 in the event graph G𝑡𝐵 with an
identify representative patterns from time sequences.f           attribute edge 𝐴(v𝑡𝑒 , v𝑡𝑚 ). By finding and linking the best
Matrix profile is able to identify the top-𝐾 repeated pat-       matched event for each time series, we added 𝑑 number
terns on time series with high accuracy and low com-             of event nodes into the graph.
putation time. We employed a recent implementation                  Event nodes created by residual error: we noticed that
of matrix profile called SCRIMP++ [20] on each of the            even for the best matched time series, the pattern match-
time series in X𝑡𝑟 , and the algorithm yields a list of single   ing still left with small residual errors. We hereby define
dimensional representative patterns p𝑚,𝑘 (𝑚 ∈ {1 : 𝐷},           two general residual nodes which indicate whether the
𝑘 ∈ {1 : 𝐾}), each with size 𝜏 .                                 residual error in a time series is larger than a threshold
   Event generation: many time series share similar rep-         𝜃. One residual event node denoted as v𝑡𝑒+ indicates the
resentative patterns with each other. It is reasonable to        residual error is larger than 𝜃, another residual event
merge similar segment patterns together across different         node denoted as v𝑡𝑒− indicates the residual error is equal
time series to create the event nodes. To measure the            or smaller than 𝜃. The parameter 𝜃 is automatically gen-
similarity between time series, we employed dynamic              erated using SPOT [24] algorithm, where we employ the
time wrapping [21], which provides us with a robust dis-         whole training data-set for initialization and testing for
tance matrix that is insensitive to small misalignments          on-going adaptation. All time series shared these two
between two time series segments. After that, we em-             residual nodes as shown in Fig. 3.
ployed the H-DBSCAN [22] to cluster the patterns p𝑚,𝑘               After generating a sequence of bipartite event graph
                                                                   (1:𝑡)                                                  (1:𝑡)
into clusters. We select the centroid of each cluster to         G𝐵 for time series X𝑡𝑟 , a model Ψ is trained on G𝐵
be an event, which is a representative segment across all        so that given a testing sequence X𝑡𝑒 , it is able to predict
time series within the cluster. Finally, we obtain an event      the connectivity of an event graph in any specific time
set E𝜏 which contains all events each with length 𝜏 .            stamp 𝑡̇ by observing historical sequence observed on
   Event matching: after extracting 𝐾 events from time se-       X𝑡𝑒 . In Fig. 2, we visualize the events forecasted/detected
ries segments, we match each event with the original time        on ten time series. The difference between the top and
series to identify where and when each event is taking           the middle group result in the bottom group where we
Figure 3: Overview of the proposed approach. The time series nodes are represented by blue circles, two types of event nodes
(results from time series matching and residual computation) are represented by yellow and orange circle, respectively. The
whole system is trained in an end-to-end manner.


can easily identify the anomaly locations.                     time 𝑡 as following:
                                                                             ∑︁
                                                               𝑧𝑚 (𝑡) =             𝑇 𝐺𝐴(𝑠𝑚 (𝑡), 𝑠𝑒 (𝑡), 𝑒𝑚,𝑒 , 𝑣𝑚 (𝑡), 𝑣𝑒 (𝑡))
5.1. Network Design With Node-wise                                       𝑗∈𝑛𝑘
                                                                            𝑚 ([0,𝑡])
     Memory                                                                                                              (3)
                                          (𝑡)
For each pair of v𝑡𝑚 and v𝑡𝑒 in graph G𝐵 , the node fea-       TGA represents the temporal graph attention module
tures of them are defined as 𝑣𝑚 and 𝑣𝑒 , respectively. We      [11], where 𝐿 graph attention layers compute 𝑚’s embed-
also define 𝜖𝑚,𝑒 as the edge features between v𝑚 and v𝑒 .      ding by aggregating information from its L-hop temporal
In order to support the dynamic inter-dependency graph,        neighbors.
we avoid explicitly defining a static adjacency matrix to
describe the relation between nodes. Instead, we adopt a
state-message framework to model the node interactions.
                                                               5.2. Optimization and inference
For each node v𝑚 at time stamp 𝑡 (here we use time se-    The aforementioned model is trained in a self-supervised
ries node 𝑚 as an example; the same rule applies to the   fashion. As our goal is to predict the events that might
event node 𝑒), we define a state vector 𝑠𝑚 (𝑡) to repre-  happen in the time sequences in the next time step, which
sent its interaction history with other v𝑒 nodes before   corresponds to the edges linking the time sequence nodes
𝑡 in a compressed format. By initiating 𝑠𝑚 (0) as an all  and event nodes. Therefore, we train a simple MLP model
zero vector, the interaction at time 𝑡 is encoded with a  with the edge prediction task based on 𝑧𝑚 (𝑡).
message vector 𝜚𝑚 (𝑡):                                       We use the cross entropy to supervise the forecasting
                                     −        −
                                                          of which original and residual events would happen in a
        𝜚𝑚 (𝑡) = [𝜖𝑚,𝑒 (𝑡)||∆𝑡||𝑠𝑚 (𝑡 )||𝑠𝑒 (𝑡 )]     (1) given time stamp 𝑡. It is a two hot classification problem
                                                          where the two activated nodes corresponding to an event
where ∆𝑡 is the time elapse between the previous time
                                                          node and a residual event node.
stamp 𝑡 and 𝑡, symbol || means concatenating operation.
        −
                                                             Event Forecasting Score: To convert the predicted event
After aggregating all the messages from neighbors, the
                                                          edges 𝐴ˆ into an anomaly score, for each time series node
state vector of v𝑚 is updated as:
                                                          v𝑡𝑚,𝜏 with window size 𝜏 , the event v𝑡𝑒,𝜏 that has the
  𝑠𝑚 (𝑡) = 𝑚𝑒𝑚(𝑎𝑔𝑔{𝜚𝑚 (𝑡1 ), ..., 𝜚𝑚 (𝑡𝑏 )}, 𝑠𝑚 (𝑡− ))    highest probability connect to v𝑡𝑚,𝜏 is retrieved and its
                                                      (2) pattern in the original𝑡 signal space is denoted as s𝑒,𝜏 .
                                                                                                                  𝑡

                                                          We project the event s𝑒,𝜏 ’s pattern back to its original
Here agg(·) is a general aggregation operation (support signal space, we then compute anomaly score based on
learnable parameters). For the sake of simplicity, we the dynamic time wrapping distance as follows:
only compute the mean of the most recent messages                          𝑡
                                                                         𝜔1,𝑚  = 𝐷𝑇 𝑊 (X𝑡𝑚,𝜏 , s𝑡𝑒,𝜏 )             (4)
in aggregation. We use mem(·) to represent a trainable
update function (e.g., GRU).                                 Residual Score: For a positive residual event v𝑡𝑒+ at
   Build upon the updated state vector 𝑠𝑚 (𝑡) and 𝑠𝑚 (𝑒), time stamp 𝑡 where the forecasted results is not a posi-
a time-aware node embedding can be generated at any tive residual v̂𝑡𝑒− , we calculate a changing point score to
Table 1                                                       Table 2
Dataset statistics and hyper-parameters                       Compare with current state-of-the-art approaches.
                                     SMAP      SMD              Dataset      Method         Precision    Recall    F1
      Number of Time series nodes     25        38                           DAGMM            59.51      88.82    70.94
        Number of event nodes         130       64                          LSTM-VAE          79.22      70.75    78.42
        Time window length (𝜏 )       20        50                          LSTM-NDT          56.84      64.38    60.37
        Time window stride (𝛽 )        5        10               SMD       OmniAnomaly        83.34      94.49    88.57
                                                                            MTAD-GAT            -          -        -
                                                                           Event2Graph        88.61      83.38    84.93
quantify the surprisal level as follows:                                     DAGMM            58.45      90.58    71.05
                                                                            LSTM-VAE          85.51      63.66    72.98
            𝑡
           𝜔2,𝑚 = 𝜓𝑁 𝐿𝐺 (||X𝑡𝑚,𝜏 − X𝑡−𝜏
                                    𝑚,𝜏 ||)            (5)                  LSTM-NDT          89.65      88.46    89.05
                                                                 SMAP      OmniAnomaly        74.16      97.76    84.34
where 𝜓𝑁 𝐿𝐺 is a standard function that maps a scalar                       MTAD-GAT          89.06      91.23    90.13
into the negative log likelihood, which indicates the spar-                Event2Graph        86.54      94.33    90.27
sity of this changing point in the training data. A fre-
quent changing signal may results in small 𝜔2,𝑚  𝑡
                                                      after
the mapping. The function is learned in a data-driven         6.2. Compare with state-of-the-art
manner based on the training data of time series 𝑚. The
final anomaly score at time stamp 𝑡 is calculated as the     We compared our solution with multiple baselines on
following read out function 𝑟(·):                            SMAP and SMD datasets: DAGMM [25] - an autoencoder-
                                                             based anomaly detection model without taking into ac-
                                                             count of temporal information; LSTM-VAE [26], LSTM-
                        ∑︁ 𝑡
                 𝜔𝑡 =                𝑡
                            (𝜔1,𝑚 · 𝜔2,𝑚 )              (6)
                          𝑚                                  NDT [2], two LSTM-based anomaly detection solu-
                                                             tions; and the most recent stochastic VAE-based ap-
                                                             proaches (e.g., Omni-Anomaly[4]) and graph attention-
6. Experiment                                                based method such as MTAD-GAT[5]. We selected these
We performed experiments on two datasets to demon- baselines mainly because: (i) they are self-supervised al-
strate the effectiveness of our model on multivariate gorithms that do not need any training labels (different
anomaly detection. We adopted two public datasets: from [8]), (ii) they rely on a single scale of time-window
SMAP (Soil Moisture Active Passive satellite) [2] and (instead of multi-scale [27]) so that the performances are
SMD (Server Machine Dataset) [4]. We follow the evalu- directly comparable.
ation protocol of [4] and an anomaly is classified a good       The results are reported in Table 2. From the results
catch if it is triggered within any subset of a ground truth we observed that the proposed Event2Graph achieves
segment.                                                     competitive performance on SMAP dataset, ranked 2𝑛𝑑
                                                             on SMD dataset. In SMAP dataset, we observe that
                                                             some of the data suffers from significant regime change
6.1. Settings                                                during both training and testing, and our dynamic
The proposed bipartite event graph contains 𝐷 + 𝐾 graph-based solution helps the algorithm adapt to the
number of nodes, where 𝐷 is the number of time series regime change faster than MTAD-GAT[5] and Omni-
and 𝐾 is the number of events (𝐷 equals to 25 and 38 Anomaly[4]. We also observe that our algorithm signifi-
in SMAP and SMD datasets, 𝐾 equal to 130 and 64 on cantly outperforms simple LSTM-based solution (LSTM-
SMAP and SMD datasets). The detailed number of nodes VAE and LSTM-NDT), which assumes a complete inter-
is shown in Table 2. The length (𝜏 ) and stride (𝛽) of the dependency graph. They did not model the dynamic
time window of each dataset are also displayed in Table inter-dependency among time series, while our node-
2. For each time series, we set the maximum number of level model explicitly encode the temporal information
motifs detected to be three, and the minimum cluster size along with the attention, which helps to reduce the
of H-DBSCAN to be three. For the temporal attention false alarms in anomaly detection. Furthermore, since
model, we set the number of multi-head to be 2. The GRU DAGMM assumes a completely static relationship be-
is employed to model the time encoding. The dimension tween time series, the algorithm lacks of the capability
of node embedding and message vector are both set to 64 to adapt to any temporal evolving pattern.
respectively. Each model is trained after 10 epochs with
the learning rate 0.0001. The node features of both time 6.3. Ablation study
series and event nodes are randomly initialized, and edge
features are the one-hot embedding of the numerical ID The objective of this experiment is to provide detailed
of the nodes on both sides of the edge.                      analysis over the effectiveness of each proposed module.
By removing each of the critical module of our model, Table 3
the performances are reported in Table 2.             Ablation study on SMD dataset.
                                                                        Model            Precision    Recall    F1
6.3.1. Effectiveness of temporal graph attention                    Event2Graph            88.61      83.38    84.93
                                                                      w/o TGA              84.81      82.94    82.51
Through replacing the temporal graph attention with                  w/o score 𝜔1          82.83      80.29    79.45
a simple MLP module, the model’s F1 score is reduced                 w/o score 𝜔2          81.51      71.76    73.81
by 2.42%. It demonstrated that the temporal attention            w/o TGA and w/o 𝜔1        80.56      81.09    79.47
plays an important role in aggregating information from          w/o TGA and w/o 𝜔2        75.84      64.33    64.51
neighborhood nodes.                                                  𝜓𝑁 𝐿𝐺 only            85.89      75.22    78.48

6.3.2. Effectiveness of event forecasting score 𝜔1 :
                                                             7. Conclusion
We remove the event matching score (Eq. 4) and just
use a residual score (Eq. 5) for anomaly detection. We      We proposed an event-driven bipartite graph solution
observe that the model performance drop by 5.48%. The       for multivariate time series anomaly detection. The so-
experiment indicates the event forecasting module in        lution does not assume any inter-dependency on time
Event2Graph is essential for accurate anomaly detection.    series, and all the relations are learned in a dynamic, data-
From the experiment, we also observe that TGA plays a       driven manner. Our design is based on edge-stream so
critical role in guarantee the quality of the forecasting   no adjacency matrix of the graph is required as input.
score 𝜔1 . By removing TGA, the forecasting-based model     As the system’s memory is defined on the node level,
performance (w/o score 𝜔2 ) degrade from 73.81% F1 score    our design left plenty space for future extensions such as
to 64.51%.                                                  inductive learning and parallel computation. Our solu-
                                                            tion achieved very competitive results on two anomaly
6.3.3. Effectiveness of residual score 𝜔2                   detection datasets, and we encourage future works to ex-
                                                            plore further using bipartite event graph for multivariate
We remove the residual score (Eq. 5) and only use fore-
                                                            anomaly detection.
casting score (Eq. 4) to detect anomaly. The model suffers
more than 10% performance degradation. This results
shows that modeling the residual score is essential to al- References
low the system to identify the anomaly regions that may
not well characterized by the event matching scores. We       [1] P. Malhotra, A. Ramakrishnan, G. Anand, L. Vig,
also observe that the TGA may not contribute much to              P. Agarwal, G. Shroff,          Lstm-based encoder-
residual event modeling, a 𝜔2 only model perform com-             decoder for multi-sensor anomaly detection, CoRR
petitively well with or without TGA. This phenomenon              abs/1607.00148 (????).
tells us the residual event is something unexpected and an    [2] K. Hundman, V. Constantinou, C. Laporte, I. Col-
accurate changing point detection is better than learning         well, T. Söderström, Detecting spacecraft anomalies
simple, repeatable patterns.                                      using lstms and nonparametric dynamic threshold-
                                                                  ing, KDD’ 2018.
6.3.4. Effectiveness of log likelihood function               [3] S. Hochreiter, J. Schmidhuber, Long short-term
        𝜓𝑁 𝐿𝐺                                                     memory, Neural Comput. (1997).
                                                              [4] Y. Su, Y. Zhao, C. Niu, R. Liu, W. Sun, D. Pei, Ro-
To demonstrate our assumption, we compare the results             bust anomaly detection for multivariate time series
with a baseline that using a naive changing point detec-          through stochastic recurrent neural network, KDD’
tion score 𝜓𝑁 𝐿𝐺 , without adopting any forecasting or            2019.
event modeling model. Surprisingly, the changing-point        [5] H. Zhao, Y. Wang, J. Duan, C. Huang, D. Cao,
only model is able to achieve 78.48% F1-score. Outper-            Y. Tong, B. Xu, J. Bai, J. Tong, Q. Zhang, Multi-
forms DAGMM, LSTM-VAE, as well as LSTM-NDT. This                  variate time-series anomaly detection via graph
results further confirm that it is hard to learn any static       attention network, ICDM’ 2020.
pattern by using LSTM or autoencoder for anomaly de- [6] J. Chung, Ç. Gülçehre, K. Cho, Y. Bengio, Empirical
tection on a challenge dataset, a dynamic model that is           evaluation of gated recurrent neural networks on
able to adapt to the changing of time series patterns is          sequence modeling, CoRR (????).
preferred.                                                    [7] Z. Cheng, Y. Yang, W. Wang, W. Hu, Y. Zhuang,
                                                                  G. Song, Time2graph: Revisiting time series mod-
                                                                  eling with dynamic shapelets, AAAI’ 2020.
                                                              [8] W. Hu, Y. Yang, Z. Cheng, C. Yang, X. Ren, Time-
     series event prediction with evolutionary state             theory, KDD’ 2017.
     graph, WSDM ’21,.                                      [25] B. Zong, Q. Song, M. R. Min, W. Cheng,
 [9] S. Kumar, X. Zhang, J. Leskovec, Predicting dy-             C. Lumezanu, D. Cho, H. Chen, Deep autoencoding
     namic embedding trajectory in temporal interac-             gaussian mixture model for unsupervised anomaly
     tion networks, KDD’, 2019.                                  detection, ICLR’ 2018.
[10] D. Xu, C. Ruan, E. Körpeoglu, S. Kumar, K. Achan, [26] D. Park, Y. Hoshi, C. C. Kemp, A multimodal
     Inductive representation learning on temporal               anomaly detector for robot-assisted feeding us-
     graphs, ICLR’ 2020.                                         ing an lstm-based variational autoencoder, IEEE
[11] E. Rossi, B. Chamberlain, F. Frasca, D. Eynard,             Robotics and Automation Letters (2018).
     F. Monti, M. M. Bronstein, Temporal graph net- [27] L. Shen, Z. Li, J. T. Kwok, Timeseries anomaly detec-
     works for deep learning on dynamic graphs, CoRR,            tion using temporal hierarchical one-class network,
     abs/2006.10637 (2020).                                      NIPS’ 2020.
[12] P. Velickovic, G. Cucurull, A. Casanova, A. Romero,
     P. Liò, Y. Bengio, Graph attention networks, ICLR’
     2018.
[13] A. Deng, B. Hooi, Graph neural network-based
     anomaly detection in multivariate time series,
     AAAI’ 2021.
[14] W. Yu, W. Cheng, C. C. Aggarwal, H. Chen,
     W. Wang, Link prediction with spatial and temporal
     consistency in dynamic networks, IJCAI’ 2017.
[15] W. Yu, W. Cheng, C. C. Aggarwal, K. Zhang,
     H. Chen, W. Wang, Netwalk: A flexible deep embed-
     ding approach for anomaly detection in dynamic
     networks, KDD’ 2018.
[16] A. Sankar, Y. Wu, L. Gou, W. Zhang, H. Yang, Dysat:
     Deep neural representation learning on dynamic
     graphs via self-attention networks, WSDM’, 2020.
[17] G. H. Nguyen, J. B. Lee, R. A. Rossi, N. K. Ahmed,
     E. Koh, S. Kim, Dynamic network embeddings:
     From random walks to temporal random walks,
     IEEE International Conference on Big Data, 2018.
[18] R. Trivedi, M. Farajtabar, P. Biswal, H. Zha, Dyrep:
     Learning representations over dynamic graphs,
     ICLR’ 2019.
[19] C. M. Yeh, Y. Zhu, L. Ulanova, N. Begum, Y. Ding,
     H. A. Dau, D. F. Silva, A. Mueen, E. J. Keogh, Matrix
     profile I: all pairs similarity joins for time series:
     A unifying view that includes motifs, discords and
     shapelets, ICDM’ 2016.
[20] Y. Zhu, C. M. Yeh, Z. Zimmerman, K. Kamgar, E. J.
     Keogh, Matrix profile XI: SCRIMP++: time series
     motif discovery at interactive speeds, ICDM’ 2018.
[21] D. J. Berndt, J. Clifford, Using dynamic time warp-
     ing to find patterns in time series, AAAI’ (Work-
     shop) 1994.
[22] L. McInnes, J. Healy, S. Astels, hdbscan: Hierarchi-
     cal density based clustering, J. Open Source Softw,
     2017 (????).
[23] A. Mueen, Y. Zhu, M. Yeh, K. Kamgar,
     K. Viswanathan, C. Gupta, E. Keogh,              The
     fastest similarity search algorithm for time series
     subsequences under euclidean distance, 2017.
[24] A. Siffer, P.-A. Fouque, A. Termier, C. Largouet,
     Anomaly detection in streams with extreme value