=Paper= {{Paper |id=Vol-3827/paper6 |storemode=property |title=SpoT-Mamba: Learning Long-Range Dependency on Spatio-Temporal Graphs with Selective State Spaces |pdfUrl=https://ceur-ws.org/Vol-3827/paper6.pdf |volume=Vol-3827 |authors=Jinhyeok Choi,Heehyeon Kim,Minhyeong An,Joyce Jiyoung Whang |dblpUrl=https://dblp.org/rec/conf/strl/ChoiKAW24 }} ==SpoT-Mamba: Learning Long-Range Dependency on Spatio-Temporal Graphs with Selective State Spaces== https://ceur-ws.org/Vol-3827/paper6.pdf
                         SpoT-Mamba: Learning Long-Range Dependency on
                         Spatio-Temporal Graphs with Selective State Spaces
                         Jinhyeok Choi1 , Heehyeon Kim1 , Minhyeong An1 and Joyce Jiyoung Whang1,*
                         1
                             School of Computing, KAIST, Daejeon, Republic of Korea


                                             Abstract
                                             Spatio-temporal graph (STG) forecasting is a critical task with extensive applications in the real world, including traffic and weather
                                             forecasting. Although several recent methods have been proposed to model complex dynamics in STGs, addressing long-range spatio-
                                             temporal dependencies remains a significant challenge, leading to limited performance gains. Inspired by a recently proposed state space
                                             model named Mamba, which has shown remarkable capability of capturing long-range dependency, we propose a new STG forecasting
                                             framework named SpoT-Mamba. SpoT-Mamba generates node embeddings by scanning various node-specific walk sequences. Based on
                                             the node embeddings, it conducts temporal scans to capture long-range spatio-temporal dependencies. Experimental results on the
                                             real-world traffic forecasting dataset demonstrate the effectiveness of SpoT-Mamba.

                                             Keywords
                                             Spatio-Temporal Graphs, Traffic Forecasting, Selective State Spaces, Random Walks



                         1. Introduction                                                                                                     including language, audio, and genomics. In addition, there
                                                                                                                                             have been several studies towards replacing the transformer
                         The predictive learning methods on time series play a cru-                                                          with Mamba in graph transformer frameworks [20, 21, 22].
                         cial role in diverse applications, such as traffic and weather                                                         In this paper, our focus lies on the predictive learning task
                         forecasting. The intricate relationships and dynamic na-                                                            on STGs, specifically STG forecasting. For STG forecasting,
                         ture of time series are often represented as graphs, specifi-                                                       it is vital to capture the evolving behavior of individual
                         cally spatio-temporal graphs (STGs), where node attributes                                                          nodes over time and how these changes propagate through-
                         evolve over time. Recently, spatio-temporal graph neu-                                                              out the entire graph. Furthermore, leveraging these dynam-
                         ral networks (STGNNs) have emerged as a powerful tool                                                               ics over long spatial and temporal ranges plays a crucial
                         for capturing both spatial and temporal dependencies in                                                             role in dealing with the intricate spatio-temporal correla-
                         STGs [1, 2, 3]. Many of those methods employ graph neu-                                                             tions in STGs [1, 3, 23]. Building upon these insights and
                         ral networks (GNNs) to exploit spatial dependencies inher-                                                          recent advances, we introduce SpoT-Mamba, a new Spatio-
                         ent in the graph structures, integrating them with recur-                                                           Temporal graph forecasting framework with a Mamba-
                         rent units or convolutions to capture temporal dependen-                                                            based sequence modeling architecture. With Mamba blocks,
                         cies [2, 4, 5, 6, 7, 8, 9]. These approaches have facilitated                                                       SpoT-Mamba extracts structural information of each node by
                         the capturing of spatio-temporal dependencies within STGs.                                                          scanning multi-way walk sequences and effectively captures
                         Despite their remarkable performance in predictive learn-                                                           long-range temporal dependencies with temporal scans. Ex-
                         ing tasks, they often face challenges in handling long-range                                                        periments on the real-world dataset demonstrate that SpoT-
                         temporal dependencies among different time steps [10, 11].                                                          Mamba achieves promising performance in STG forecasting.
                            STGs often exhibit repetitive patterns over both short and                                                       The official implementations of SpoT-Mamba are available
                         long periods, which is critical for precise predictions [12, 13].                                                   at https://github.com/bdi-lab/SpoT-Mamba.
                         Therefore, several methods have adopted self-attention
                         mechanisms of transformer layers [14] rather than recurrent
                         units to enhance their capability in exploiting global tempo-                                                       2. Preliminaries
                         ral information [10, 11, 15]. However, the significant compu-
                         tational overhead and complexity of attention mechanisms                                                            State Space Model (SSM) The state space model (SSM)
                         are being highlighted as major concerns [10, 12, 16, 17].                                                           assumes that dynamic systems can be represented by their
                            Meanwhile, structured state space sequence (S4) models                                                           states at time step 𝑑 [18]. SSM defines the evolution of
                         have emerged as a promising approach for sequence mod-                                                              a dynamic system’s state with two equations: hβ€² (𝑑) =
                         eling with linear scaling in sequence length [18]. Those                                                            Ah(𝑑) + Bπ‘₯(𝑑) and 𝑦(𝑑) = Ch(𝑑) + Dπ‘₯(𝑑), where
                         models take the advantages of recurrent neural networks                                                             h(𝑑) ∈ R𝐷 denotes the latent state, π‘₯(𝑑) ∈ R represents
                         and convolutional neural networks, enabling them to han-                                                            the input signal, 𝑦(𝑑) ∈ R denotes the output signal, and
                         dle long-range dependencies without relying on attention.                                                           A ∈ R𝐷×𝐷 , B ∈ R𝐷×1 , C ∈ R𝐷×𝐷 , and D ∈ R are
                         However, due to their inability to select information depend-                                                       learnable parameters. SSM learns how to transform the in-
                         ing on input data, they have shown limited performance.                                                             put signal π‘₯(𝑑) into the latent state h(𝑑), which is used to
                            A recent study has introduced a new S4 model overcom-                                                            model the system dynamics and predict its output 𝑦(𝑑).
                         ing the issue, named Mamba, which introduces a selection
                         mechanism to filter information in an input-dependent man-                                                          Discretized SSM To adapt SSM for discrete input se-
                         ner [19]. Mamba has demonstrated notable performance                                                                quences instead of continuous signals, discretization is ap-
                         over transformers across various types of sequence data,                                                            plied with a step size βˆ†. The discretized SSM is defined in a
                                                                                                                                             recurrent form: h𝑑 = Ahπ‘‘βˆ’1 + Bπ‘₯𝑑 and 𝑦𝑑 = Ch𝑑 , where
                         STRL’24: Third International Workshop on Spatio-Temporal Reasoning                                                  A and B are approximated learnable parameters using a
                         and Learning, 5 August 2024, Jeju, South Korea                                                                      bilinear method with a step size βˆ† [18]. The term D is
                         *
                           Corresponding author.
                                                                                                                                             omitted from the equations as it can be considered as a skip
                         $ cjh0507@kaist.ac.kr (J. Choi); heehyeon@kaist.ac.kr (H. Kim);
                         amh0360@kaist.ac.kr (M. An); jjwhang@kaist.ac.kr (J. J. Whang)                                                      connection. This formulation allows for capturing temporal
                                     Β© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribu-
                                     tion 4.0 International (CC BY 4.0).



CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
dependencies efficiently, resulting in similar computations         3. Spatio-Temporal Graph
to those in recurrent neural networks.
   Meanwhile, due to its linear time-invariant (LTI) property,
                                                                       Forecasting with SpoT-Mamba
SSM can be reformulated as discrete convolution: y = K*x            We propose SpoT-Mamba (Figure 1), which captures spa-
                                         πΏβˆ’1
and K ∈ R𝐿 = (CB, CAB, . . . , CA            B), where x ∈          tial dependencies from node-specific walk sequences and
R𝐿 denotes the input sequence, y ∈ R𝐿 denotes the output            learns temporal dependencies across time steps leveraging
sequence, * indicates the convolution operation, and 𝐿 is           Mamba-based sequence modeling. By utilizing the selection
the sequence length. This representation facilitates parallel       mechanisms of Mamba blocks, SpoT-Mamba can selectively
training for SSM, thereby enhancing training efficiency.            propagate or forget information in an input-dependent man-
   The recurrent and convolutional representations of SSM           ner on both temporal and spatial domain.
for sequence modeling enable parallel training and linear
scaling in sequence length. To further enhance the com-             Multi-way Walk Sequence In STGs, the temporal se-
putational complexity of SSM, the structured state space            quences for nodes are naturally defined by the time-series
sequence (S4) models have been proposed [18]. S4 models             data. On the other hand, since the topological structure
address the fundamental bottleneck of SSM, which involves           does not have a specific order, a tailored method is required
repeated matrix multiplications, by employing a low-rank            to define the spatial sequences of nodes in graphs. Hence,
correction to stably diagonalize the transition matrix A.           we employ three well-known walk algorithms: depth-first
                                                                    search (DFS), breadth-first search (BFS), and random walks
Mamba S4 models have demonstrated remarkable per-                   (RW), to extract diverse local and global structural informa-
formance in handling long-range dependencies in contin-             tion from each node’s neighborhood. The walk sequences
uous signal data, such as audio and time series. However,           of length 𝐾 for node 𝑣𝑖 using these walk algorithms are
S4 models struggle with effectively handling discrete and           defined as 𝒲𝐡𝐹 𝑆 (𝑖), 𝒲𝐷𝐹 𝑆 (𝑖), and π’²π‘…π‘Š (𝑖), respectively.
information-dense data such as text [19]. This limitation           These node-specific walk sequences are extracted 𝑀 times
arises from the LTI property inherent in the convolutional          to exploit more comprehensive structural information.
form of SSMs. While the LTI property enables linear time se-
quence modeling for S4 models, it requires that the learnable       Walk Sequence Embedding SpoT-Mamba generates em-
matrices A, B, and C, as well as the step size βˆ†, remain            beddings for node-specific walk sequences by scanning each
unchanged across all time steps. Consequently, S4 models            sequence. Here, SpoT-Mamba performs bi-directional scans
cannot selectively recall previous tokens or combine the            through Mamba blocks, which makes the model robust to
current token, treating each token in the input sequence            permutations and captures the long-range spatial depen-
uniformly. In contrast, Transformers dynamically adjust at-         dency of the sequence more effectively [21]. Then, SpoT-
tention scores based on the input sequence, allowing them           Mamba aggregates the representations of node-specific walk
to effectively focus on different parts of the sequence [14].       sequences with pointwise convolution. This allows for incor-
   To address both the lack of selectivity in S4 models and         porating representations of neighboring nodes to generate
the efficiency bottleneck in sequence modeling, a recent            walk sequence embedding for each type of walk sequence.
study introduced a new S4 model called Mamba, which                    Subsequently, SpoT-Mamba integrates the walk sequence
removes the LTI constraints [19]. Mamba incorporates a              embeddings into a node embedding w𝑖 ∈ R𝐷 using Multi-
selection mechanism that allows its learnable parameters to         Layer Perceptron (MLP), where 𝑖 represents the node index
dynamically interact with the input sequence. This mech-            and 𝐷 denotes the embedding dimension. Rather than sim-
anism is achieved by modifying the learnable parameters             ply stacking GNN layers, we employ Mamba-based sequence
B and C, as well as the step size βˆ†, to functions of the            modeling to generate node embeddings from diverse types
input sequence. Therefore, Mamba can selectively recall or          of walk sequences. Therefore, our approach effectively cap-
ignore information in an input-dependent manner, while              tures local and long-range dependencies within the graph
maintaining linear scalability in sequence length.                  by scanning the neighborhood structure of each node.
   Inspired by the recent advancements in Mamba, we pro-
pose a Mamba-based sequence modeling architecture for
                                                                    Temporal Scan with Mamba Blocks We adopt the learn-
predictive learning tasks on STGs. Our approach employs
                                                                    able day-of-week and timestamps-of-day embeddings to
a selective mechanism to handle the dynamical changes
                                                                    capture the repetitive patterns over both short and long
in STGs, capturing long-range spatio-temporal dependen-
                                                                    periods in STGs [24]. SpoT-Mamba concatenates these em-
cies. In addition, this allows for addressing computational
                                                                    beddings with the node embedding for each time step to
inefficiencies in transformer-based STGNNs [10, 11, 15].
                                                                    effectively model temporal dynamics from historical obser-
                                                                    vations. Then, it performs selective recurrent scans with
Spatio-Temporal Graph Forecasting A spatio-temporal                 Mamba blocks across the sequence of embeddings along
graph (STG) is defined as 𝒒 = (𝒱, β„°, 𝒳 ), where 𝒱 is                the time axis, observing changes in temporal dynamics over
a set of 𝑁 nodes, β„° βŠ‚ 𝒱 Γ— 𝒱 is a set of edges, 𝒳 =                  time. This process helps identify critical portions of the
[X1 , . . . , X𝜏 ] is a sequence of observed data for all nodes     sequence for forecasting and captures periodic patterns in
at each historical time step, and 𝜏 is a length of the se-          both short- and long-term intervals. Consequently, the re-
quence. Here, X𝑑 ∈ R𝑁 ×𝐷in denotes the observed data                sults effectively encompass spatio-temporal dependencies,
at time step 𝑑, where 𝐷in denotes the dimension of the in-          thereby enriching the predictive capabilities.
put node attributes. STG forecasting aims to predict future
observations for 𝑇 β€² time steps, given historical observa-
                                                                    STG forecasting of SpoT-Mamba Finally, SpoT-Mamba
tions for the previous 𝑇 time steps. This is formulated as
                      𝑓 (Β·)
                                                                    enhances the representations of nodes scanned along the
[Xπ‘‘βˆ’π‘‡ +1 , . . . , X𝑑 ] βˆ’βˆ’β†’ [X𝑑+1 , . . . , X𝑑+𝑇 β€² ], where 𝑓 (Β·)   temporal axis by incorporating global information from the
represents the STG forecasting model.                               entire graph at each time step through transformer layers.
                                        𝐾             ×𝑀

                         𝒲𝐡𝐹𝑆 𝑖                                                                                            Pointwise
                                                                                                                             Conv
                                                                                                                                                               MLP            𝐰𝑖 ∈ ℝ𝐷
                         𝒲𝐷𝐹𝑆 𝑖
                         π’²π‘…π‘Š 𝑖                                                          ∈ ℝ𝐾×3𝐷                                                        ∈ ℝ3𝐷

                             Multi-way Walk                Bi-directional Scan
   Graph Structure                                                                                   Walk Sequence Embedding
                           Sequence Generation             with Mamba Blocks


                                                                                                                                                                      𝑇            𝑇′




                                                                                                     Spatial Transformer


                                                                                                                              Regression Layer
                                                   𝐖
                          SMTWTFS      Embedding
                                         Layer                                       time    𝐙𝑑 β€²                                                   𝐗 𝑑 β€²+𝑇


                  time                                                                                                                                                             time
                                                                                                                                                                          β€²
     ∈ ℝ𝑇×𝑁×𝐷in                                                                                                                                                   ∈ ℝ𝑇 ×𝑁×𝐷out
    Ground-Truth            Periodicity and Feature               Temporal Scan                Spatial Self-Attention                                                 Predicted
     Time Series                  Embedding                     with Mamba Blocks                 and Regression                                                     Time Series

         Figure 1: Overview of SpoT-Mamba. The first row represents the node-specific walk sequence embedding. The second row
         represents the overall procedure of STG forecasting. W ∈ R𝑁 ×𝐷 denotes the node embeddings for all nodes in the graph
         and Z𝑑′ ∈ R𝑁 Γ—4𝐷 denotes one of the outcomes from the temporal scan, corresponding to the time step 𝑑′ .



Then, MLP is applied to forecast the attributes of each node              Table 1
for future time steps. To accurately predict the temporal                 Statistic of PEMS04 dataset.
trajectory while ensuring robustness to outliers that deviate                |𝒱| |β„°| #Time Steps Time Interval                                                   Time Range
significantly from the expected trajectory, we train SpoT-
                                                                             307 338        16,992                                               5 min.        01/2018 - 02/2018
Mamba utilizing the Huber loss, which is less sensitive to
outliers while maintaining the smoothness of the squared
error for small errors [24].
                                                                          differences between predictions and ground truth values.
                                                                          RMSE is calculated as the square root of the average of the
4. Experiments                                                            squared differences. MAPE is similar to MAE but normalizes
                                                                          each error by the corresponding ground truth value and
We compare the performance of SpoT-Mamba with state-of-                   expresses it as a percentage. For all three metrics, lower
the-art baselines. Additionally, we conduct ablation studies              values indicate better performance. Additionally, we rank all
to further demonstrate the effectiveness of SpoT-Mamba.                   the methods used in our experiments across these evaluation
                                                                          metrics and compute their average ranks.
4.1. Dataset and Experimental Setup
                                                                          Implementation Details SpoT-Mamba is implemented
Dataset We evaluate SpoT-Mamba on PEMS04 [25], a real-
                                                                          using the Deep Graph Library [34] and PyTorch [35]. For the
world traffic flow forecasting benchmark, following the ex-
                                                                          transformer, we utilize the off-the-shelf transformer encoder
perimental setup in [24]. PEMS04 contains highway traffic
                                                                          available in PyTorch, and for Mamba, we employ the official
flow data collected from the California Department of Trans-
                                                                          implementation [19] and apply pre-normalization. We train
portation’s Performance Measurement System (PEMS). The
                                                                          SpoT-Mamba for 300 epochs using the Adam optimizer [36],
nodes represent the sensors, and edges are created when
                                                                          with early stopping if there is no improvement over 20
two sensors are on the same road. Traffic data in PEMS04 is
                                                                          epochs. Additionally, we apply the learning rate decay,
collected every 5 minutes. We set the input and prediction
                                                                          reducing the learning rate at the 20th, 40th, and 60th epochs.
intervals to 1 hour, corresponding to 𝑇 = 𝑇 β€² = 12. The
                                                                          To determine the optimal hyperparameters for SpoT-Mamba,
statistic of PEMS04 is shown in Table 1. In the experiments,
                                                                          we conduct a grid search. The grid search covers 𝑀 ∈
PEMS04 is divided into training, validation, and test sets in
                                                                          {2, 4}, learning rates of {0.001, 0.0005}), weight decays of
a 6:2:2 ratio in temporal order.
                                                                          {0.001, 0.0001}, and learning rate decay rates of {0.1, 0.5}.
                                                                          We fix the feed-forward dimension to 256, 𝐷 = 32, 𝐾 =
Baselines We compare SpoT-Mamba with several base-                        20, dropout probability to 0.1, batch size to 32, and the
lines using various methods, including GNNs and Trans-                    number of layers for Mamba and the transformer to 3. All
formers. For STGNNs, we consider GWNet [9], DCRNN [26],                   experiments are conducted using GeForce RTX 3090 24GB.
AGCRN [27], GTS [28], and MTGNN [29]. For attention-
based methods, we include STAEformer [24], GMAN [30],
and PDformer [31]. Additionally, other methods such as                    4.2. Traffic Forecasting Performance
HI [16], STNorm [32], and STID [33] are also considered.                  Table 2 shows the traffic forecasting performance of the
                                                                          baselines and SpoT-Mamba on PEMS04 in terms of MAE,
Evaluation Metrics We use three standard metrics for                      RMSE, and MAPE. Note that we report the baseline results
traffic flow prediction: Mean Absolute Error (MAE), Root                  from [24] since we strictly followed the experimental set-
Mean Squared Error (RMSE), and Mean Absolute Percentage                   tings described in [24]. For each evaluation metric, the best
Error (MAPE). MAE is the unweighted mean of the absolute                  results are boldfaced, and the second-best results are under-
                                                                                             250
  Traffic Flow




                                                                              Traffic Flow
                 400                                                                         200
                 200                                                                         150
                                                                                             100
                   0
                       0       100       200       300     400       500                           0     100      200       300       400     500
                 300
  Traffic Flow




                                                                              Traffic Flow
                 200                                                                         400
                 100                                                                         200
                   0                                                                           00
                       0       100       200       300     400       500                                 100      200       300       400     500
                                               Time Step                                                                Time Step
                       Figure 2: Traffic flow forecasting results for four randomly selected nodes in PEMS04. The blue line represents the ground
                       truth, and the orange line denotes the predictions made by SpoT-Mamba.



Table 2                                                                                      Table 3
Traffic forecasting performance on PEMS04. β€˜Avg.’ denotes the                                Ablation studies of SpoT-Mamba on PEMS04, varying the types
average rank across the three evaluation metrics.                                            of the walk scan and temporal scan modules.
            MAE Rank RMSE Rank MAPE Rank                              Avg.                          Walk Scan Temporal Scan       MAE     RMSE MAPE
      HI    42.35 13 61.66 13 29.92 13                                 13                          Transformer Transformer        18.41   30.32 12.12
   GWNet    18.53 5  29.92 1   12.89  6                                 4                          Transformer   Mamba            18.69   30.17 12.28
   DCRNN    19.63 11 31.26  8  13.59 11                                10                            Mamba     Transformer        18.29   30.06 11.93
   AGCRN    19.38 9  31.25  7  13.40  9                               8.33                           Mamba       Mamba            18.31   30.11 11.86
   STGCN    19.57 10 31.38  9  13.44 10                               9.67
     GTS    20.96 12 32.95 12 14.66 12                                 12
  MTGNN     19.17 8  31.70 11 13.37   8                                 9                    module is replaced. Specifically, when the Walk Scan is con-
   STNorm   18.96 6  30.98  6  12.69  5                               5.67                   ducted by a transformer encoder, the overall performance
   GMAN     19.14 7  31.60 10 13.19   7                                 8                    of SpoT-Mamba decreases (first and second rows). On the
  PDformer 18.36  3  30.03  3  12.00  3                                 3                    other hand, replacing only the Mamba blocks for the Tem-
    STID    18.38 4  29.95  2  12.04  4                               3.33                   poral Scan with a transformer encoder shows negligible
 STAEformer 18.22 1  30.18  5  11.98  2                               2.67                   performance differences (third row).
SpoT-Mamba 18.31 2   30.11  4  11.86 1                                2.33                      This disparity can be attributed to the inherent differ-
                                                                                             ences between Mamba and Transformer, along with the
                                                                                             application of learnable embeddings that impose biases on
lined. It is observed that SpoT-Mamba consistently achieves                                  the sequence. Mamba blocks scan inputs recurrently, inher-
high rankings across all metrics: MAE, RMSE, and MAPE,                                       ently considering the sequence order. In contrast, the trans-
recording the highest average rank among all methods. This                                   former encoder does not recognize input sequence order
suggests the effectiveness of Mamba’s selective recurrent                                    by itself. Furthermore, while SpoT-Mamba utilizes learn-
scan in modeling spatio-temporal dependency. Compared                                        able embeddings for temporal sequences, i.e., day-of-week
to other metrics, SpoT-Mamba demonstrates its best per-                                      and timestamps-of-day, it does not apply such embeddings
formance in MAPE, achieving the highest ranking. Among                                       for walk sequences. As a result, the Transformer encoder
the baselines, STAEformer [24] shows the most comparable                                     struggles to perceive the order in walk sequences, despite
performance to SpoT-Mamba.                                                                   performing well with temporal sequences.

4.3. Qualitative Analysis & Ablation Studies
                                                                                             5. Conclusion and Future Work
In Figure 2, we visualize the predictions of SpoT-Mamba
and the ground-truth time series on PEMS04. For this visu-                                   In this paper, we explore STG forecasting and introduce a
alization, predictions are made for four randomly selected                                   new Mamba-based STG forecasting model, SpoT-Mamba.
nodes, starting from an arbitrary time step within a test                                    SpoT-Mamba utilizes Mamba blocks to scan multi-way
split, covering 576 consecutive time steps (equivalent to two                                walk sequences and temporal sequences. This approach
days). Since we set 𝑇 = 𝑇 β€² = 12, multiple predictions                                       allows the model to effectively capture the long-range spatio-
are concatenated to represent the duration. When multiple                                    temporal dependencies in STG, enhancing forecasting ac-
predictions exist at a single time step, we average them. It is                              curacy on complex graph structures. SpoT-Mamba shows
observed that the predicted time series closely aligns with                                  promising results on the real-world traffic forecasting bench-
the ground-truth data.                                                                       mark PEMS04. For future work, we will extend SpoT-Mamba
   Additionally, we conduct the ablation studies of SpoT-                                    to handle graphs with complex relations [37, 38] and evolv-
Mamba on PEMS04. We replace the Mamba blocks used                                            ing graphs [39, 40].
for the walk sequence embedding (indicated as Walk Scan)
and for scanning along the time axis (indicated as Temporal
Scan) with transformer encoders. Note that when replacing
                                                                                             Acknowledgments
the Mamba blocks for the Temporal Scan with a transformer                                    This research was supported by an NRF grant funded by
encoder, we reduce the batch size from 32 to 8 due to Out-of-                                MSIT 2022R1A2C4001594 (Extendable Graph Representa-
Memory issues. Results are shown in Table 3. We observed                                     tion Learning) and an IITP grant funded by MSIT 2022-0-
differences in performance depending on which type of scan
00369 (Development of AI Technology to support Expert                  chine Learning, 2022, pp. 27268–27286. URL: https:
Decision-making that can Explain the Reasons/Grounds for               //proceedings.mlr.press/v162/zhou22g.html.
Judgment Results based on Expert Knowledge).                      [12] S. Liu, H. Yu, C. Liao, J. Li, W. Lin, A. X. Liu, S. Dust-
                                                                       dar, Pyraformer: Low-complexity pyramidal atten-
                                                                       tion for long-range time series modeling and fore-
References                                                             casting, in: Proceedings of the 10th International
                                                                       Conference on Learning Representations, 2022. URL:
 [1] Z. Diao, X. Wang, D. Zhang, Y. Liu, K. Xie, S. He, Dy-
                                                                       https://openreview.net/forum?id=0EXmFzUn5I.
     namic spatial-temporal graph convolutional neural
                                                                  [13] G. Lai, W.-C. Chang, Y. Yang, H. Liu, Modeling
     networks for traffic forecasting, in: Proceedings of the
                                                                       long- and short-term temporal patterns with deep
     33rd AAAI Conference on Artificial Intelligence, 2019,
                                                                       neural networks, in: Proceedings of the 42th Inter-
     pp. 890–897. doi:10.1609/aaai.v33i01.3301890.
                                                                       national ACM SIGIR Conference on Research and De-
 [2] D. Cao, Y. Wang, J. Duan, C. Zhang, X. Zhu, C. Huang,
                                                                       velopment in Information Retrieval, 2018, p. 95–104.
     Y. Tong, B. Xu, J. Bai, J. Tong, Q. Zhang, Spectral tempo-
                                                                       doi:10.1145/3209978.3210006.
     ral graph neural network for multivariate time-series
                                                                  [14] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
     forecasting, in: Proceedings of the 34th International
                                                                       L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At-
     Conference on Neural Information Processing Sys-
                                                                       tention is all you need, in: Proceedings of the 31st
     tems, volume 33, 2020, pp. 17766–17778. URL: https:
                                                                       International Conference on Neural Information Pro-
     //proceedings.neurips.cc/paper_files/paper/2020/file/
                                                                       cessing Systems, 2017, pp. 6000–6010.
     cdf6581cb7aca4b7e19ef136c6e601a5-Paper.pdf.
                                                                  [15] S. Guo, Y. Lin, N. Feng, C. Song, H. Wan, Attention
 [3] M. Li, Z. Zhu, Spatial-temporal fusion graph neural
                                                                       based spatial-temporal graph convolutional networks
     networks for traffic flow forecasting, in: Proceed-
                                                                       for traffic flow forecasting, in: Proceedings of the 33rd
     ings of the 35th AAAI Conference on Artificial In-
                                                                       AAAI Conference on Artificial Intelligence, 2019, pp.
     telligence, 2021, pp. 4189–4196. doi:10.1609/aaai.
                                                                       922–929. doi:10.1609/aaai.v33i01.3301922.
     v35i5.16542.
                                                                  [16] Y. Cui, J. Xie, K. Zheng, Historical inertia: A neglected
 [4] Y. Chen, I. Segovia, Y. R. Gel, Z-gcnets: Time zigzags at
                                                                       but powerful baseline for long sequence time-series
     graph convolutional networks for time series forecast-
                                                                       forecasting, in: Proceedings of the 30th ACM Interna-
     ing, in: Proceedings of the 38th International Confer-
                                                                       tional Conference on Information & Knowledge Man-
     ence on Machine Learning, 2021, pp. 1684–1694. URL:
                                                                       agement, 2021, p. 2965–2969. doi:10.1145/3459637.
     https://proceedings.mlr.press/v139/chen21o.html.
                                                                       3482120.
 [5] K. Guo, Y. Hu, Y. Sun, S. Qian, J. Gao, B. Yin, Hier-
                                                                  [17] R.-G. Cirstea, B. Yang, C. Guo, T. Kieu, S. Pan, To-
     archical graph convolution network for traffic fore-
                                                                       wards spatio- temporal aware traffic time series fore-
     casting, in: Proceedings of the 35th AAAI Con-
                                                                       casting, in: Proceedings of the IEEE 38th International
     ference on Artificial Intelligence, 2021, pp. 151–159.
                                                                       Conference on Data Engineering, 2022, pp. 2900–2913.
     doi:10.1609/aaai.v35i1.16088.
                                                                       doi:10.1109/ICDE53745.2022.00262.
 [6] B. Lu, X. Gan, H. Jin, L. Fu, H. Zhang, Spatiotem-
                                                                  [18] A. Gu, K. Goel, C. Re, Efficiently modeling long
     poral adaptive gated graph convolution network for
                                                                       sequences with structured state spaces,              arXiv
     urban traffic flow forecasting, in: Proceedings of
                                                                       preprint arXiv:2111.00396 (2022). doi:10.48550/
     the 29th ACM International Conference on Informa-
                                                                       arXiv.2111.00396.
     tion & Knowledge Management, 2020, pp. 1025–1034.
                                                                  [19] A. Gu, T. Dao, Mamba: Linear-time sequence
     doi:10.1145/3340531.3411894.
                                                                       modeling with selective state spaces,                arXiv
 [7] J. Ye, L. Sun, B. Du, Y. Fu, H. Xiong, Coupled layer-
                                                                       preprint arXiv:2312.00752 (2023). doi:10.48550/
     wise graph convolution for transportation demand
                                                                       arXiv.2312.00752.
     prediction, 2021, pp. 4617–4625. doi:10.1609/aaai.
                                                                  [20] C. Wang, O. Tsepa, J. Ma, B. Wang, Graph-mamba:
     v35i5.16591.
                                                                       Towards long-range graph sequence modeling with
 [8] L. Zhao, Y. Song, C. Zhang, Y. Liu, P. Wang, T. Lin,
                                                                       selective state spaces, arXiv preprint arXiv:2402.00789
     M. Deng, H. Li, T-gcn: A temporal graph convolu-
                                                                       (2024). doi:10.48550/arXiv.2402.00789.
     tional network for traffic prediction, IEEE Transac-
                                                                  [21] A. Behrouz, F. Hashemi,            Graph mamba: To-
     tions on Intelligent Transportation Systems 21 (2020)
                                                                       wards learning on graphs with state space mod-
     3848–3858. doi:10.1109/TITS.2019.2935152.
                                                                       els, arXiv preprint arXiv:2402.08678 (2024). doi:10.
 [9] Z. Wu, S. Pan, G. Long, J. Jiang, C. Zhang, Graph
                                                                       48550/arXiv.2402.08678.
     wavenet for deep spatial-temporal graph modeling,
                                                                  [22] L. Li, H. Wang, W. Zhang, A. Coster, Stg-mamba:
     in: Proceedings of the 28th International Joint Confer-
                                                                       Spatial-temporal graph learning via selective state
     ence on Artificial Intelligence, 2019, p. 1907–1913. URL:
                                                                       space model, arXiv preprint arXiv:2403.12418 (2024).
     https://dl.acm.org/doi/abs/10.5555/3367243.3367303.
                                                                       doi:10.48550/arXiv.2403.12418.
[10] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong,
                                                                  [23] Z. Fang, Q. Long, G. Song, K. Xie, Spatial-temporal
     W. Zhang, Informer: Beyond efficient transformer for
                                                                       graph ode networks for traffic flow forecasting, in:
     long sequence time-series forecasting, in: Proceedings
                                                                       Proceedings of the 27th ACM SIGKDD Conference
     of the 35th AAAI Conference on Artificial Intelligence,
                                                                       on Knowledge Discovery & Data Mining, 2021, pp.
     2021, pp. 11106–11115. doi:10.1609/aaai.v35i12.
                                                                       364β€”-373. doi:10.1145/3447548.3467430.
     17325.
                                                                  [24] H. Liu, Z. Dong, R. Jiang, J. Deng, J. Deng, Q. Chen,
[11] T. Zhou, Z. Ma, Q. Wen, X. Wang, L. Sun, R. Jin,
                                                                       X. Song, Spatio-temporal adaptive embedding makes
     FEDformer: Frequency enhanced decomposed trans-
                                                                       vanilla transformer sota for traffic forecasting, in: Pro-
     former for long-term series forecasting, in: Pro-
                                                                       ceedings of the 32nd ACM International Conference
     ceedings of the 39th International Conference on Ma-
                                                                       on Information and Knowledge Management, 2023, p.
     4125–4129. doi:10.1145/3583780.3615160.                           Processing Systems, 2019, pp. 8024–8035. URL: https:
[25] C. Song, Y. Lin, S. Guo, H. Wan, Spatial-temporal                 //proceedings.neurips.cc/paper_files/paper/2019/file/
     synchronous graph convolutional networks: A new                   bdbca288fee7f92f2bfa9f7012727740-Paper.pdf.
     framework for spatial-temporal network data fore-            [36] D. P. Kingma, J. Ba, Adam: A method for stochastic
     casting, in: Proceedings of the 34th AAAI Con-                    optimization, in: Proceedings of the 3rd International
     ference on Artificial Intelligence, 2020, pp. 914–921.            Conference on Learning Representations, 2015. URL:
     doi:10.1609/aaai.v34i01.5438.                                     https://doi.org/10.48550/arXiv.1412.6980.
[26] Y. Li, R. Yu, C. Shahabi, Y. Liu, Diffusion convolu-         [37] H. Kim, J. Choi, J. J. Whang, Dynamic relation-
     tional recurrent neural network: Data-driven traffic              attentive graph neural networks for fraud detection,
     forecasting, in: Proceedings of the 6th International             in: Proceedings of 2023 IEEE International Conference
     Conference on Learning Representations, 2018. URL:                on Data Mining Workshops (ICDMW), 2023, pp. 1092–
     https://openreview.net/forum?id=SJiHXGWAZ.                        1096. doi:10.1109/ICDMW60847.2023.00143.
[27] L. Bai, L. Yao, C. Li, X. Wang, C. Wang, Adaptive graph      [38] C. Chung, J. Lee, J. J. Whang, Representation learning
     convolutional recurrent network for traffic forecasting,          on hyper-relational and numeric knowledge graphs
     in: Proceedings of the 34th International Conference              with transformers, in: Proceedings of the 29th ACM
     on Neural Information Processing Systems, 2020, p.                SIGKDD Conference on Knowledge Discovery and
     17804–17815. URL: https://dl.acm.org/doi/abs/10.5555/             Data Mining, 2023, pp. 310–322.
     3495724.3497218.                                             [39] X. Huang, Y. Yang, Y. Wang, C. Wang, Z. Zhang,
[28] C. Shang, J. Chen, J. Bi, Discrete graph structure learn-         J. Xu, L. Chen, M. Vazirgiannis,          Dgraph: A
     ing for forecasting multiple time series, in: Proceed-            large-scale financial dataset for graph anomaly
     ings of the 9th International Conference on Learning              detection, in: Proceedings of the 36th International
     Representations, 2021. URL: https://openreview.net/               Conference on Neural Information Processing
     forum?id=WEHSlH5mOk.                                              Systems, 2022, pp. 22765–22777. URL: https:
[29] Z. Wu, S. Pan, G. Long, J. Jiang, X. Chang, C. Zhang,             //proceedings.neurips.cc/paper_files/paper/2022/file/
     Connecting the dots: Multivariate time series forecast-           8f1918f71972789db39ec0d85bb31110-Paper-Datasets_
     ing with graph neural networks, in: Proceedings of                and_Benchmarks.pdf.
     the 26th ACM SIGKDD International Conference on              [40] J. Lee, C. Chung, J. J. Whang, InGram: Inductive
     Knowledge Discovery & Data Mining, 2020, p. 753–763.              knowledge graph embedding via relation graphs, in:
     doi:10.1145/3394486.3403118.                                      Proceedings of the 40th International Conference on
[30] C. Zheng, X. Fan, C. Wang, J. Qi, Gman: A graph                   Machine Learning, 2023, pp. 18796–18809.
     multi-attention network for traffic prediction, in: Pro-
     ceedings of the 34th AAAI Conference on Artificial
     Intelligence, 2020, pp. 1234–1241. doi:10.1609/aaai.
     v34i01.5477.
[31] J. Jiang, C. Han, W. X. Zhao, J. Wang, Pdformer: Prop-
     agation delay-aware dynamic long-range transformer
     for traffic flow prediction, in: Proceedings of the 37th
     AAAI Conference on Artificial Intelligence, 2023, pp.
     4365–4373. doi:10.1609/aaai.v37i4.25556.
[32] J. Deng, X. Chen, R. Jiang, X. Song, I. W. Tsang, St-
     norm: Spatial and temporal normalization for multi-
     variate time series forecasting, in: Proceedings of
     the 27th ACM SIGKDD Conference on Knowledge
     Discovery & Data Mining, 2021, p. 269–278. doi:10.
     1145/3447548.3467330.
[33] Z. Shao, Z. Zhang, F. Wang, W. Wei, Y. Xu, Spatial-
     temporal identity: A simple yet effective baseline for
     multivariate time series forecasting, in: Proceedings
     of the 31st ACM International Conference on Informa-
     tion & Knowledge Management, 2022, p. 4454–4458.
     doi:10.1145/3511808.3557702.
[34] M. Wang, D. Zheng, Z. Ye, Q. Gan, M. Li, X. Song,
     J. Zhou, C. Ma, L. Yu, Y. Gai, T. Xiao, T. He, G. Karypis,
     J. Li, Z. Zhang, Deep graph library: A graph-
     centric, highly-performant package for graph neu-
     ral networks, arXiv preprint arXiv:1909.01315 (2020).
     doi:10.48550/arXiv.1909.01315.
[35] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Brad-
     bury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,
     L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. De-
     Vito, M. Raison, A. Tejani, S. Chilamkurthy,
     B. Steiner, L. Fang, J. Bai, S. Chintala, Pytorch:
     An imperative style, high-performance deep
     learning library,         in: Proceedings of the 33th
     International Conference on Neural Information