SpoT-Mamba: Learning Long-Range Dependency on Spatio-Temporal Graphs with Selective State Spaces Jinhyeok Choi1 , Heehyeon Kim1 , Minhyeong An1 and Joyce Jiyoung Whang1,* 1 School of Computing, KAIST, Daejeon, Republic of Korea Abstract Spatio-temporal graph (STG) forecasting is a critical task with extensive applications in the real world, including traffic and weather forecasting. Although several recent methods have been proposed to model complex dynamics in STGs, addressing long-range spatio- temporal dependencies remains a significant challenge, leading to limited performance gains. Inspired by a recently proposed state space model named Mamba, which has shown remarkable capability of capturing long-range dependency, we propose a new STG forecasting framework named SpoT-Mamba. SpoT-Mamba generates node embeddings by scanning various node-specific walk sequences. Based on the node embeddings, it conducts temporal scans to capture long-range spatio-temporal dependencies. Experimental results on the real-world traffic forecasting dataset demonstrate the effectiveness of SpoT-Mamba. Keywords Spatio-Temporal Graphs, Traffic Forecasting, Selective State Spaces, Random Walks 1. Introduction including language, audio, and genomics. In addition, there have been several studies towards replacing the transformer The predictive learning methods on time series play a cru- with Mamba in graph transformer frameworks [20, 21, 22]. cial role in diverse applications, such as traffic and weather In this paper, our focus lies on the predictive learning task forecasting. The intricate relationships and dynamic na- on STGs, specifically STG forecasting. For STG forecasting, ture of time series are often represented as graphs, specifi- it is vital to capture the evolving behavior of individual cally spatio-temporal graphs (STGs), where node attributes nodes over time and how these changes propagate through- evolve over time. Recently, spatio-temporal graph neu- out the entire graph. Furthermore, leveraging these dynam- ral networks (STGNNs) have emerged as a powerful tool ics over long spatial and temporal ranges plays a crucial for capturing both spatial and temporal dependencies in role in dealing with the intricate spatio-temporal correla- STGs [1, 2, 3]. Many of those methods employ graph neu- tions in STGs [1, 3, 23]. Building upon these insights and ral networks (GNNs) to exploit spatial dependencies inher- recent advances, we introduce SpoT-Mamba, a new Spatio- ent in the graph structures, integrating them with recur- Temporal graph forecasting framework with a Mamba- rent units or convolutions to capture temporal dependen- based sequence modeling architecture. With Mamba blocks, cies [2, 4, 5, 6, 7, 8, 9]. These approaches have facilitated SpoT-Mamba extracts structural information of each node by the capturing of spatio-temporal dependencies within STGs. scanning multi-way walk sequences and effectively captures Despite their remarkable performance in predictive learn- long-range temporal dependencies with temporal scans. Ex- ing tasks, they often face challenges in handling long-range periments on the real-world dataset demonstrate that SpoT- temporal dependencies among different time steps [10, 11]. Mamba achieves promising performance in STG forecasting. STGs often exhibit repetitive patterns over both short and The official implementations of SpoT-Mamba are available long periods, which is critical for precise predictions [12, 13]. at https://github.com/bdi-lab/SpoT-Mamba. Therefore, several methods have adopted self-attention mechanisms of transformer layers [14] rather than recurrent units to enhance their capability in exploiting global tempo- 2. Preliminaries ral information [10, 11, 15]. However, the significant compu- tational overhead and complexity of attention mechanisms State Space Model (SSM) The state space model (SSM) are being highlighted as major concerns [10, 12, 16, 17]. assumes that dynamic systems can be represented by their Meanwhile, structured state space sequence (S4) models states at time step 𝑑 [18]. SSM defines the evolution of have emerged as a promising approach for sequence mod- a dynamic system’s state with two equations: hβ€² (𝑑) = eling with linear scaling in sequence length [18]. Those Ah(𝑑) + Bπ‘₯(𝑑) and 𝑦(𝑑) = Ch(𝑑) + Dπ‘₯(𝑑), where models take the advantages of recurrent neural networks h(𝑑) ∈ R𝐷 denotes the latent state, π‘₯(𝑑) ∈ R represents and convolutional neural networks, enabling them to han- the input signal, 𝑦(𝑑) ∈ R denotes the output signal, and dle long-range dependencies without relying on attention. A ∈ R𝐷×𝐷 , B ∈ R𝐷×1 , C ∈ R𝐷×𝐷 , and D ∈ R are However, due to their inability to select information depend- learnable parameters. SSM learns how to transform the in- ing on input data, they have shown limited performance. put signal π‘₯(𝑑) into the latent state h(𝑑), which is used to A recent study has introduced a new S4 model overcom- model the system dynamics and predict its output 𝑦(𝑑). ing the issue, named Mamba, which introduces a selection mechanism to filter information in an input-dependent man- Discretized SSM To adapt SSM for discrete input se- ner [19]. Mamba has demonstrated notable performance quences instead of continuous signals, discretization is ap- over transformers across various types of sequence data, plied with a step size βˆ†. The discretized SSM is defined in a recurrent form: h𝑑 = Ahπ‘‘βˆ’1 + Bπ‘₯𝑑 and 𝑦𝑑 = Ch𝑑 , where STRL’24: Third International Workshop on Spatio-Temporal Reasoning A and B are approximated learnable parameters using a and Learning, 5 August 2024, Jeju, South Korea bilinear method with a step size βˆ† [18]. The term D is * Corresponding author. omitted from the equations as it can be considered as a skip $ cjh0507@kaist.ac.kr (J. Choi); heehyeon@kaist.ac.kr (H. Kim); amh0360@kaist.ac.kr (M. An); jjwhang@kaist.ac.kr (J. J. Whang) connection. This formulation allows for capturing temporal Β© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribu- tion 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings dependencies efficiently, resulting in similar computations 3. Spatio-Temporal Graph to those in recurrent neural networks. Meanwhile, due to its linear time-invariant (LTI) property, Forecasting with SpoT-Mamba SSM can be reformulated as discrete convolution: y = K*x We propose SpoT-Mamba (Figure 1), which captures spa- πΏβˆ’1 and K ∈ R𝐿 = (CB, CAB, . . . , CA B), where x ∈ tial dependencies from node-specific walk sequences and R𝐿 denotes the input sequence, y ∈ R𝐿 denotes the output learns temporal dependencies across time steps leveraging sequence, * indicates the convolution operation, and 𝐿 is Mamba-based sequence modeling. By utilizing the selection the sequence length. This representation facilitates parallel mechanisms of Mamba blocks, SpoT-Mamba can selectively training for SSM, thereby enhancing training efficiency. propagate or forget information in an input-dependent man- The recurrent and convolutional representations of SSM ner on both temporal and spatial domain. for sequence modeling enable parallel training and linear scaling in sequence length. To further enhance the com- Multi-way Walk Sequence In STGs, the temporal se- putational complexity of SSM, the structured state space quences for nodes are naturally defined by the time-series sequence (S4) models have been proposed [18]. S4 models data. On the other hand, since the topological structure address the fundamental bottleneck of SSM, which involves does not have a specific order, a tailored method is required repeated matrix multiplications, by employing a low-rank to define the spatial sequences of nodes in graphs. Hence, correction to stably diagonalize the transition matrix A. we employ three well-known walk algorithms: depth-first search (DFS), breadth-first search (BFS), and random walks Mamba S4 models have demonstrated remarkable per- (RW), to extract diverse local and global structural informa- formance in handling long-range dependencies in contin- tion from each node’s neighborhood. The walk sequences uous signal data, such as audio and time series. However, of length 𝐾 for node 𝑣𝑖 using these walk algorithms are S4 models struggle with effectively handling discrete and defined as 𝒲𝐡𝐹 𝑆 (𝑖), 𝒲𝐷𝐹 𝑆 (𝑖), and π’²π‘…π‘Š (𝑖), respectively. information-dense data such as text [19]. This limitation These node-specific walk sequences are extracted 𝑀 times arises from the LTI property inherent in the convolutional to exploit more comprehensive structural information. form of SSMs. While the LTI property enables linear time se- quence modeling for S4 models, it requires that the learnable Walk Sequence Embedding SpoT-Mamba generates em- matrices A, B, and C, as well as the step size βˆ†, remain beddings for node-specific walk sequences by scanning each unchanged across all time steps. Consequently, S4 models sequence. Here, SpoT-Mamba performs bi-directional scans cannot selectively recall previous tokens or combine the through Mamba blocks, which makes the model robust to current token, treating each token in the input sequence permutations and captures the long-range spatial depen- uniformly. In contrast, Transformers dynamically adjust at- dency of the sequence more effectively [21]. Then, SpoT- tention scores based on the input sequence, allowing them Mamba aggregates the representations of node-specific walk to effectively focus on different parts of the sequence [14]. sequences with pointwise convolution. This allows for incor- To address both the lack of selectivity in S4 models and porating representations of neighboring nodes to generate the efficiency bottleneck in sequence modeling, a recent walk sequence embedding for each type of walk sequence. study introduced a new S4 model called Mamba, which Subsequently, SpoT-Mamba integrates the walk sequence removes the LTI constraints [19]. Mamba incorporates a embeddings into a node embedding w𝑖 ∈ R𝐷 using Multi- selection mechanism that allows its learnable parameters to Layer Perceptron (MLP), where 𝑖 represents the node index dynamically interact with the input sequence. This mech- and 𝐷 denotes the embedding dimension. Rather than sim- anism is achieved by modifying the learnable parameters ply stacking GNN layers, we employ Mamba-based sequence B and C, as well as the step size βˆ†, to functions of the modeling to generate node embeddings from diverse types input sequence. Therefore, Mamba can selectively recall or of walk sequences. Therefore, our approach effectively cap- ignore information in an input-dependent manner, while tures local and long-range dependencies within the graph maintaining linear scalability in sequence length. by scanning the neighborhood structure of each node. Inspired by the recent advancements in Mamba, we pro- pose a Mamba-based sequence modeling architecture for Temporal Scan with Mamba Blocks We adopt the learn- predictive learning tasks on STGs. Our approach employs able day-of-week and timestamps-of-day embeddings to a selective mechanism to handle the dynamical changes capture the repetitive patterns over both short and long in STGs, capturing long-range spatio-temporal dependen- periods in STGs [24]. SpoT-Mamba concatenates these em- cies. In addition, this allows for addressing computational beddings with the node embedding for each time step to inefficiencies in transformer-based STGNNs [10, 11, 15]. effectively model temporal dynamics from historical obser- vations. Then, it performs selective recurrent scans with Spatio-Temporal Graph Forecasting A spatio-temporal Mamba blocks across the sequence of embeddings along graph (STG) is defined as 𝒒 = (𝒱, β„°, 𝒳 ), where 𝒱 is the time axis, observing changes in temporal dynamics over a set of 𝑁 nodes, β„° βŠ‚ 𝒱 Γ— 𝒱 is a set of edges, 𝒳 = time. This process helps identify critical portions of the [X1 , . . . , X𝜏 ] is a sequence of observed data for all nodes sequence for forecasting and captures periodic patterns in at each historical time step, and 𝜏 is a length of the se- both short- and long-term intervals. Consequently, the re- quence. Here, X𝑑 ∈ R𝑁 ×𝐷in denotes the observed data sults effectively encompass spatio-temporal dependencies, at time step 𝑑, where 𝐷in denotes the dimension of the in- thereby enriching the predictive capabilities. put node attributes. STG forecasting aims to predict future observations for 𝑇 β€² time steps, given historical observa- STG forecasting of SpoT-Mamba Finally, SpoT-Mamba tions for the previous 𝑇 time steps. This is formulated as 𝑓 (Β·) enhances the representations of nodes scanned along the [Xπ‘‘βˆ’π‘‡ +1 , . . . , X𝑑 ] βˆ’βˆ’β†’ [X𝑑+1 , . . . , X𝑑+𝑇 β€² ], where 𝑓 (Β·) temporal axis by incorporating global information from the represents the STG forecasting model. entire graph at each time step through transformer layers. 𝐾 ×𝑀 𝒲𝐡𝐹𝑆 𝑖 Pointwise Conv MLP 𝐰𝑖 ∈ ℝ𝐷 𝒲𝐷𝐹𝑆 𝑖 π’²π‘…π‘Š 𝑖 ∈ ℝ𝐾×3𝐷 ∈ ℝ3𝐷 Multi-way Walk Bi-directional Scan Graph Structure Walk Sequence Embedding Sequence Generation with Mamba Blocks 𝑇 𝑇′ Spatial Transformer Regression Layer 𝐖 SMTWTFS Embedding Layer time 𝐙𝑑 β€² 𝐗 𝑑 β€²+𝑇 time time β€² ∈ ℝ𝑇×𝑁×𝐷in ∈ ℝ𝑇 ×𝑁×𝐷out Ground-Truth Periodicity and Feature Temporal Scan Spatial Self-Attention Predicted Time Series Embedding with Mamba Blocks and Regression Time Series Figure 1: Overview of SpoT-Mamba. The first row represents the node-specific walk sequence embedding. The second row represents the overall procedure of STG forecasting. W ∈ R𝑁 ×𝐷 denotes the node embeddings for all nodes in the graph and Z𝑑′ ∈ R𝑁 Γ—4𝐷 denotes one of the outcomes from the temporal scan, corresponding to the time step 𝑑′ . Then, MLP is applied to forecast the attributes of each node Table 1 for future time steps. To accurately predict the temporal Statistic of PEMS04 dataset. trajectory while ensuring robustness to outliers that deviate |𝒱| |β„°| #Time Steps Time Interval Time Range significantly from the expected trajectory, we train SpoT- 307 338 16,992 5 min. 01/2018 - 02/2018 Mamba utilizing the Huber loss, which is less sensitive to outliers while maintaining the smoothness of the squared error for small errors [24]. differences between predictions and ground truth values. RMSE is calculated as the square root of the average of the 4. Experiments squared differences. MAPE is similar to MAE but normalizes each error by the corresponding ground truth value and We compare the performance of SpoT-Mamba with state-of- expresses it as a percentage. For all three metrics, lower the-art baselines. Additionally, we conduct ablation studies values indicate better performance. Additionally, we rank all to further demonstrate the effectiveness of SpoT-Mamba. the methods used in our experiments across these evaluation metrics and compute their average ranks. 4.1. Dataset and Experimental Setup Implementation Details SpoT-Mamba is implemented Dataset We evaluate SpoT-Mamba on PEMS04 [25], a real- using the Deep Graph Library [34] and PyTorch [35]. For the world traffic flow forecasting benchmark, following the ex- transformer, we utilize the off-the-shelf transformer encoder perimental setup in [24]. PEMS04 contains highway traffic available in PyTorch, and for Mamba, we employ the official flow data collected from the California Department of Trans- implementation [19] and apply pre-normalization. We train portation’s Performance Measurement System (PEMS). The SpoT-Mamba for 300 epochs using the Adam optimizer [36], nodes represent the sensors, and edges are created when with early stopping if there is no improvement over 20 two sensors are on the same road. Traffic data in PEMS04 is epochs. Additionally, we apply the learning rate decay, collected every 5 minutes. We set the input and prediction reducing the learning rate at the 20th, 40th, and 60th epochs. intervals to 1 hour, corresponding to 𝑇 = 𝑇 β€² = 12. The To determine the optimal hyperparameters for SpoT-Mamba, statistic of PEMS04 is shown in Table 1. In the experiments, we conduct a grid search. The grid search covers 𝑀 ∈ PEMS04 is divided into training, validation, and test sets in {2, 4}, learning rates of {0.001, 0.0005}), weight decays of a 6:2:2 ratio in temporal order. {0.001, 0.0001}, and learning rate decay rates of {0.1, 0.5}. We fix the feed-forward dimension to 256, 𝐷 = 32, 𝐾 = Baselines We compare SpoT-Mamba with several base- 20, dropout probability to 0.1, batch size to 32, and the lines using various methods, including GNNs and Trans- number of layers for Mamba and the transformer to 3. All formers. For STGNNs, we consider GWNet [9], DCRNN [26], experiments are conducted using GeForce RTX 3090 24GB. AGCRN [27], GTS [28], and MTGNN [29]. For attention- based methods, we include STAEformer [24], GMAN [30], and PDformer [31]. Additionally, other methods such as 4.2. Traffic Forecasting Performance HI [16], STNorm [32], and STID [33] are also considered. Table 2 shows the traffic forecasting performance of the baselines and SpoT-Mamba on PEMS04 in terms of MAE, Evaluation Metrics We use three standard metrics for RMSE, and MAPE. Note that we report the baseline results traffic flow prediction: Mean Absolute Error (MAE), Root from [24] since we strictly followed the experimental set- Mean Squared Error (RMSE), and Mean Absolute Percentage tings described in [24]. For each evaluation metric, the best Error (MAPE). MAE is the unweighted mean of the absolute results are boldfaced, and the second-best results are under- 250 Traffic Flow Traffic Flow 400 200 200 150 100 0 0 100 200 300 400 500 0 100 200 300 400 500 300 Traffic Flow Traffic Flow 200 400 100 200 0 00 0 100 200 300 400 500 100 200 300 400 500 Time Step Time Step Figure 2: Traffic flow forecasting results for four randomly selected nodes in PEMS04. The blue line represents the ground truth, and the orange line denotes the predictions made by SpoT-Mamba. Table 2 Table 3 Traffic forecasting performance on PEMS04. β€˜Avg.’ denotes the Ablation studies of SpoT-Mamba on PEMS04, varying the types average rank across the three evaluation metrics. of the walk scan and temporal scan modules. MAE Rank RMSE Rank MAPE Rank Avg. Walk Scan Temporal Scan MAE RMSE MAPE HI 42.35 13 61.66 13 29.92 13 13 Transformer Transformer 18.41 30.32 12.12 GWNet 18.53 5 29.92 1 12.89 6 4 Transformer Mamba 18.69 30.17 12.28 DCRNN 19.63 11 31.26 8 13.59 11 10 Mamba Transformer 18.29 30.06 11.93 AGCRN 19.38 9 31.25 7 13.40 9 8.33 Mamba Mamba 18.31 30.11 11.86 STGCN 19.57 10 31.38 9 13.44 10 9.67 GTS 20.96 12 32.95 12 14.66 12 12 MTGNN 19.17 8 31.70 11 13.37 8 9 module is replaced. Specifically, when the Walk Scan is con- STNorm 18.96 6 30.98 6 12.69 5 5.67 ducted by a transformer encoder, the overall performance GMAN 19.14 7 31.60 10 13.19 7 8 of SpoT-Mamba decreases (first and second rows). On the PDformer 18.36 3 30.03 3 12.00 3 3 other hand, replacing only the Mamba blocks for the Tem- STID 18.38 4 29.95 2 12.04 4 3.33 poral Scan with a transformer encoder shows negligible STAEformer 18.22 1 30.18 5 11.98 2 2.67 performance differences (third row). SpoT-Mamba 18.31 2 30.11 4 11.86 1 2.33 This disparity can be attributed to the inherent differ- ences between Mamba and Transformer, along with the application of learnable embeddings that impose biases on lined. It is observed that SpoT-Mamba consistently achieves the sequence. Mamba blocks scan inputs recurrently, inher- high rankings across all metrics: MAE, RMSE, and MAPE, ently considering the sequence order. In contrast, the trans- recording the highest average rank among all methods. This former encoder does not recognize input sequence order suggests the effectiveness of Mamba’s selective recurrent by itself. Furthermore, while SpoT-Mamba utilizes learn- scan in modeling spatio-temporal dependency. Compared able embeddings for temporal sequences, i.e., day-of-week to other metrics, SpoT-Mamba demonstrates its best per- and timestamps-of-day, it does not apply such embeddings formance in MAPE, achieving the highest ranking. Among for walk sequences. As a result, the Transformer encoder the baselines, STAEformer [24] shows the most comparable struggles to perceive the order in walk sequences, despite performance to SpoT-Mamba. performing well with temporal sequences. 4.3. Qualitative Analysis & Ablation Studies 5. Conclusion and Future Work In Figure 2, we visualize the predictions of SpoT-Mamba and the ground-truth time series on PEMS04. For this visu- In this paper, we explore STG forecasting and introduce a alization, predictions are made for four randomly selected new Mamba-based STG forecasting model, SpoT-Mamba. nodes, starting from an arbitrary time step within a test SpoT-Mamba utilizes Mamba blocks to scan multi-way split, covering 576 consecutive time steps (equivalent to two walk sequences and temporal sequences. This approach days). Since we set 𝑇 = 𝑇 β€² = 12, multiple predictions allows the model to effectively capture the long-range spatio- are concatenated to represent the duration. When multiple temporal dependencies in STG, enhancing forecasting ac- predictions exist at a single time step, we average them. It is curacy on complex graph structures. SpoT-Mamba shows observed that the predicted time series closely aligns with promising results on the real-world traffic forecasting bench- the ground-truth data. mark PEMS04. For future work, we will extend SpoT-Mamba Additionally, we conduct the ablation studies of SpoT- to handle graphs with complex relations [37, 38] and evolv- Mamba on PEMS04. We replace the Mamba blocks used ing graphs [39, 40]. for the walk sequence embedding (indicated as Walk Scan) and for scanning along the time axis (indicated as Temporal Scan) with transformer encoders. Note that when replacing Acknowledgments the Mamba blocks for the Temporal Scan with a transformer This research was supported by an NRF grant funded by encoder, we reduce the batch size from 32 to 8 due to Out-of- MSIT 2022R1A2C4001594 (Extendable Graph Representa- Memory issues. Results are shown in Table 3. We observed tion Learning) and an IITP grant funded by MSIT 2022-0- differences in performance depending on which type of scan 00369 (Development of AI Technology to support Expert chine Learning, 2022, pp. 27268–27286. URL: https: Decision-making that can Explain the Reasons/Grounds for //proceedings.mlr.press/v162/zhou22g.html. Judgment Results based on Expert Knowledge). [12] S. Liu, H. Yu, C. Liao, J. Li, W. Lin, A. X. Liu, S. Dust- dar, Pyraformer: Low-complexity pyramidal atten- tion for long-range time series modeling and fore- References casting, in: Proceedings of the 10th International Conference on Learning Representations, 2022. URL: [1] Z. Diao, X. Wang, D. Zhang, Y. Liu, K. Xie, S. He, Dy- https://openreview.net/forum?id=0EXmFzUn5I. namic spatial-temporal graph convolutional neural [13] G. Lai, W.-C. Chang, Y. Yang, H. Liu, Modeling networks for traffic forecasting, in: Proceedings of the long- and short-term temporal patterns with deep 33rd AAAI Conference on Artificial Intelligence, 2019, neural networks, in: Proceedings of the 42th Inter- pp. 890–897. doi:10.1609/aaai.v33i01.3301890. national ACM SIGIR Conference on Research and De- [2] D. Cao, Y. Wang, J. Duan, C. Zhang, X. Zhu, C. Huang, velopment in Information Retrieval, 2018, p. 95–104. Y. Tong, B. Xu, J. Bai, J. Tong, Q. Zhang, Spectral tempo- doi:10.1145/3209978.3210006. ral graph neural network for multivariate time-series [14] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, forecasting, in: Proceedings of the 34th International L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At- Conference on Neural Information Processing Sys- tention is all you need, in: Proceedings of the 31st tems, volume 33, 2020, pp. 17766–17778. URL: https: International Conference on Neural Information Pro- //proceedings.neurips.cc/paper_files/paper/2020/file/ cessing Systems, 2017, pp. 6000–6010. cdf6581cb7aca4b7e19ef136c6e601a5-Paper.pdf. [15] S. Guo, Y. Lin, N. Feng, C. Song, H. Wan, Attention [3] M. Li, Z. Zhu, Spatial-temporal fusion graph neural based spatial-temporal graph convolutional networks networks for traffic flow forecasting, in: Proceed- for traffic flow forecasting, in: Proceedings of the 33rd ings of the 35th AAAI Conference on Artificial In- AAAI Conference on Artificial Intelligence, 2019, pp. telligence, 2021, pp. 4189–4196. doi:10.1609/aaai. 922–929. doi:10.1609/aaai.v33i01.3301922. v35i5.16542. [16] Y. Cui, J. Xie, K. Zheng, Historical inertia: A neglected [4] Y. Chen, I. Segovia, Y. R. Gel, Z-gcnets: Time zigzags at but powerful baseline for long sequence time-series graph convolutional networks for time series forecast- forecasting, in: Proceedings of the 30th ACM Interna- ing, in: Proceedings of the 38th International Confer- tional Conference on Information & Knowledge Man- ence on Machine Learning, 2021, pp. 1684–1694. URL: agement, 2021, p. 2965–2969. doi:10.1145/3459637. https://proceedings.mlr.press/v139/chen21o.html. 3482120. [5] K. Guo, Y. Hu, Y. Sun, S. Qian, J. Gao, B. Yin, Hier- [17] R.-G. Cirstea, B. Yang, C. Guo, T. Kieu, S. Pan, To- archical graph convolution network for traffic fore- wards spatio- temporal aware traffic time series fore- casting, in: Proceedings of the 35th AAAI Con- casting, in: Proceedings of the IEEE 38th International ference on Artificial Intelligence, 2021, pp. 151–159. Conference on Data Engineering, 2022, pp. 2900–2913. doi:10.1609/aaai.v35i1.16088. doi:10.1109/ICDE53745.2022.00262. [6] B. Lu, X. Gan, H. Jin, L. Fu, H. Zhang, Spatiotem- [18] A. Gu, K. Goel, C. Re, Efficiently modeling long poral adaptive gated graph convolution network for sequences with structured state spaces, arXiv urban traffic flow forecasting, in: Proceedings of preprint arXiv:2111.00396 (2022). doi:10.48550/ the 29th ACM International Conference on Informa- arXiv.2111.00396. tion & Knowledge Management, 2020, pp. 1025–1034. [19] A. Gu, T. Dao, Mamba: Linear-time sequence doi:10.1145/3340531.3411894. modeling with selective state spaces, arXiv [7] J. Ye, L. Sun, B. Du, Y. Fu, H. Xiong, Coupled layer- preprint arXiv:2312.00752 (2023). doi:10.48550/ wise graph convolution for transportation demand arXiv.2312.00752. prediction, 2021, pp. 4617–4625. doi:10.1609/aaai. [20] C. Wang, O. Tsepa, J. Ma, B. Wang, Graph-mamba: v35i5.16591. Towards long-range graph sequence modeling with [8] L. Zhao, Y. Song, C. Zhang, Y. Liu, P. Wang, T. Lin, selective state spaces, arXiv preprint arXiv:2402.00789 M. Deng, H. Li, T-gcn: A temporal graph convolu- (2024). doi:10.48550/arXiv.2402.00789. tional network for traffic prediction, IEEE Transac- [21] A. Behrouz, F. Hashemi, Graph mamba: To- tions on Intelligent Transportation Systems 21 (2020) wards learning on graphs with state space mod- 3848–3858. doi:10.1109/TITS.2019.2935152. els, arXiv preprint arXiv:2402.08678 (2024). doi:10. [9] Z. Wu, S. Pan, G. Long, J. Jiang, C. Zhang, Graph 48550/arXiv.2402.08678. wavenet for deep spatial-temporal graph modeling, [22] L. Li, H. Wang, W. Zhang, A. Coster, Stg-mamba: in: Proceedings of the 28th International Joint Confer- Spatial-temporal graph learning via selective state ence on Artificial Intelligence, 2019, p. 1907–1913. URL: space model, arXiv preprint arXiv:2403.12418 (2024). https://dl.acm.org/doi/abs/10.5555/3367243.3367303. doi:10.48550/arXiv.2403.12418. [10] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, [23] Z. Fang, Q. Long, G. Song, K. Xie, Spatial-temporal W. Zhang, Informer: Beyond efficient transformer for graph ode networks for traffic flow forecasting, in: long sequence time-series forecasting, in: Proceedings Proceedings of the 27th ACM SIGKDD Conference of the 35th AAAI Conference on Artificial Intelligence, on Knowledge Discovery & Data Mining, 2021, pp. 2021, pp. 11106–11115. doi:10.1609/aaai.v35i12. 364β€”-373. doi:10.1145/3447548.3467430. 17325. [24] H. Liu, Z. Dong, R. Jiang, J. Deng, J. Deng, Q. Chen, [11] T. Zhou, Z. Ma, Q. Wen, X. Wang, L. Sun, R. Jin, X. Song, Spatio-temporal adaptive embedding makes FEDformer: Frequency enhanced decomposed trans- vanilla transformer sota for traffic forecasting, in: Pro- former for long-term series forecasting, in: Pro- ceedings of the 32nd ACM International Conference ceedings of the 39th International Conference on Ma- on Information and Knowledge Management, 2023, p. 4125–4129. doi:10.1145/3583780.3615160. Processing Systems, 2019, pp. 8024–8035. URL: https: [25] C. Song, Y. Lin, S. Guo, H. Wan, Spatial-temporal //proceedings.neurips.cc/paper_files/paper/2019/file/ synchronous graph convolutional networks: A new bdbca288fee7f92f2bfa9f7012727740-Paper.pdf. framework for spatial-temporal network data fore- [36] D. P. Kingma, J. Ba, Adam: A method for stochastic casting, in: Proceedings of the 34th AAAI Con- optimization, in: Proceedings of the 3rd International ference on Artificial Intelligence, 2020, pp. 914–921. Conference on Learning Representations, 2015. URL: doi:10.1609/aaai.v34i01.5438. https://doi.org/10.48550/arXiv.1412.6980. [26] Y. Li, R. Yu, C. Shahabi, Y. Liu, Diffusion convolu- [37] H. Kim, J. Choi, J. J. Whang, Dynamic relation- tional recurrent neural network: Data-driven traffic attentive graph neural networks for fraud detection, forecasting, in: Proceedings of the 6th International in: Proceedings of 2023 IEEE International Conference Conference on Learning Representations, 2018. URL: on Data Mining Workshops (ICDMW), 2023, pp. 1092– https://openreview.net/forum?id=SJiHXGWAZ. 1096. doi:10.1109/ICDMW60847.2023.00143. [27] L. Bai, L. Yao, C. Li, X. Wang, C. Wang, Adaptive graph [38] C. Chung, J. Lee, J. J. Whang, Representation learning convolutional recurrent network for traffic forecasting, on hyper-relational and numeric knowledge graphs in: Proceedings of the 34th International Conference with transformers, in: Proceedings of the 29th ACM on Neural Information Processing Systems, 2020, p. SIGKDD Conference on Knowledge Discovery and 17804–17815. URL: https://dl.acm.org/doi/abs/10.5555/ Data Mining, 2023, pp. 310–322. 3495724.3497218. [39] X. Huang, Y. Yang, Y. Wang, C. Wang, Z. Zhang, [28] C. Shang, J. Chen, J. Bi, Discrete graph structure learn- J. Xu, L. Chen, M. Vazirgiannis, Dgraph: A ing for forecasting multiple time series, in: Proceed- large-scale financial dataset for graph anomaly ings of the 9th International Conference on Learning detection, in: Proceedings of the 36th International Representations, 2021. URL: https://openreview.net/ Conference on Neural Information Processing forum?id=WEHSlH5mOk. Systems, 2022, pp. 22765–22777. URL: https: [29] Z. Wu, S. Pan, G. Long, J. Jiang, X. Chang, C. Zhang, //proceedings.neurips.cc/paper_files/paper/2022/file/ Connecting the dots: Multivariate time series forecast- 8f1918f71972789db39ec0d85bb31110-Paper-Datasets_ ing with graph neural networks, in: Proceedings of and_Benchmarks.pdf. the 26th ACM SIGKDD International Conference on [40] J. Lee, C. Chung, J. J. Whang, InGram: Inductive Knowledge Discovery & Data Mining, 2020, p. 753–763. knowledge graph embedding via relation graphs, in: doi:10.1145/3394486.3403118. Proceedings of the 40th International Conference on [30] C. Zheng, X. Fan, C. Wang, J. Qi, Gman: A graph Machine Learning, 2023, pp. 18796–18809. multi-attention network for traffic prediction, in: Pro- ceedings of the 34th AAAI Conference on Artificial Intelligence, 2020, pp. 1234–1241. doi:10.1609/aaai. v34i01.5477. [31] J. Jiang, C. Han, W. X. Zhao, J. Wang, Pdformer: Prop- agation delay-aware dynamic long-range transformer for traffic flow prediction, in: Proceedings of the 37th AAAI Conference on Artificial Intelligence, 2023, pp. 4365–4373. doi:10.1609/aaai.v37i4.25556. [32] J. Deng, X. Chen, R. Jiang, X. Song, I. W. Tsang, St- norm: Spatial and temporal normalization for multi- variate time series forecasting, in: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021, p. 269–278. doi:10. 1145/3447548.3467330. [33] Z. Shao, Z. Zhang, F. Wang, W. Wei, Y. Xu, Spatial- temporal identity: A simple yet effective baseline for multivariate time series forecasting, in: Proceedings of the 31st ACM International Conference on Informa- tion & Knowledge Management, 2022, p. 4454–4458. doi:10.1145/3511808.3557702. [34] M. Wang, D. Zheng, Z. Ye, Q. Gan, M. Li, X. Song, J. Zhou, C. Ma, L. Yu, Y. Gai, T. Xiao, T. He, G. Karypis, J. Li, Z. Zhang, Deep graph library: A graph- centric, highly-performant package for graph neu- ral networks, arXiv preprint arXiv:1909.01315 (2020). doi:10.48550/arXiv.1909.01315. [35] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Brad- bury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. De- Vito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, Pytorch: An imperative style, high-performance deep learning library, in: Proceedings of the 33th International Conference on Neural Information