Dynamic Gated Spatial Temporal Graph Neural Networks for Traffic Forecasting Ziyan Gui1, Changhui Liu1*, Li Xiong1, Zuoquan Xie2 and Liang Wu3 1 Wuhan Institute of Technology, Wuhan, China 2 Central China Normal University, Wuhan, China 3 Southwestern University of Finance and Economics, Chengdu, China Abstract Traffic forecasting is crucial to intelligent transportation system, and very challenging due to the uncertainty and complexity of spatial-temporal dependencies in real-world traffic network. Many existing approaches use the pre-defined graph to model spatial correlations, but they fail to capture the latent spatial evolution. Then some dynamic graph-based methods are proposed to address this issue, however they separately model spatial and temporal dependencies without internal connection. In this paper, we propose a novel Dynamic gated Spatial Temporal Graph Neural Network (DSTGNN) for traffic forecasting, which can capture time-varying spatial correlations and temporal dependencies jointly. Besides, we apply gate mechanism into residual connection between extracted spatial and temporal features. Experimental results on two real-world traffic datasets have demonstrated the effectiveness of DSTGNN, and that the proposed DSTGNN can compete with state-of-the-art baselines. Keywords Traffic forecasting, DSTGNN, Spatial-Temporal correlation 1. Introduction 1 As a core component of Intelligent Transportation System (ITS)[4], real-time and accurate traffic forecasting is curial for road resource planning and public traffic safety. The key of traffic forecasting is to capture dynamic and uncertain spatial-temporal dependencies from historical data. In early deep learning approaches, convolutional neural networks (CNNs) are used to extract spatial correlations and RNNs are used to model temporal dependencies. However, CNNs are only suitable for capturing spatial features in grid-data and perform poorly in non-Euclidean space. Recently, graph neural networks (GNNs)[1,12,13] have been generalized as convolution on graph-based data, which can extract the intrinsic spatial topological information of graphs. To model both temporal and spatial dependencies, many early approaches [3]combined GNNs with RNN-based sequence models and achieved improvement. However, most of them model spatial correlations based on predefined static graph and cannot capture spatial dynamics. With the advent of self-attention in Transformer[13]spatial and temporal attention are adopted in these methods[6,9,11]to model dynamic spatial-temporal correlations and improved a lot. However, spatial-temporal correlations are not captured interactively, resulting in learning irrelevant or redundant information. In this paper, we proposed an end-to-end model named Dynamic gated Spatial Temporal Graph Neural Network (DSTGNN), which models spatial-temporal dependencies jointly to address the above issues. Specifically, we stack multiple the proposed dynamic gated spatial-temporal(DGST) blocks to extract spatial-temporal features from historical traffic data. Each DGST block consists of a dynamic graph convolution module for extracting dynamic and static spatial correlations, a temporal attention ICBASE2022@3rd International Conference on Big Data & Artificial Intelligence & Software Engineering, October 21- 23, 2022, Guangzhou, China EMAIL: 2927990583@qq.com (Ziyan Gui); lch52012@163.com (Changhui Liu)*; 2273167975@qq.com (Li Xiong); 1584103174@qq.com (Zuoquan Xie) ORCID: 0000-0002-4530-5604(ZiyanGui); © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 187 module for modeling time dynamics and a gated residual connection for interaction between extracted spatial and temporal features. The contribution of this paper can be summarized as: • We propose the dynamic gated spatial-temporal block that jointly models the spatial and temporal dependencies of traffic data, and then make prediction in a non-autoregressive way. • Experiments on two real-world traffic datasets are conducted to evaluate the effectiveness of our model, and results show that our model can compete with the state-of-the-arts. 2. Related works 2.1. Graph Convolutional Network As an efficient variant of CNNs on graph-structured data, graph convolutional networks (GCNs) have been applied into various areas and achieved state-of-the-art results. In the spectral perspective [1], GCNs need to compute the eigen-decomposition of the Laplacian matrix, which leads to a huge consumption of computation resources. Subsequently, methods[12,14]based on Chebyshev polynomial approximation are proposed to improve the computing efficiency. In addition, GCNs based on the spatial perspective[15,16]not only avoid the eigen-decomposition of Laplacian matrix but also can learn the vertices representations inductively.[5]proposes STGCN combined GCN with standard temporal 1D CNN to tackle the traffic time series prediction. [10]proposes Graph WaveNet which learns static adjacency matrices for spatial-temporal modeling but failed to capture dynamic spatial correlations. 2.2. Attention based Traffic forecasting Attention mechanism has been extensively utilized in many domains such as natural language processing and graph learning (GAT[16]). Recently, researchers apply attention mechanism to model spatial-temporal dependencies for traffic forecasting. GMAN[6] proposes the ST-Attention block, which adds a transform attention layer between the encoder and decoder and models dynamic spatial- temporal correlations in both the encoder and decoder. For learning spatial-temporal features, ASTGCN[9]employs attention mechanisms in the spatial and temporal dimensions, respectively. A general dynamic graph neural network is created by STTN[11]to model the spatial dependencies that change over time. LSGCN[2]combines GCN with graph attention for long short-term traffic prediction. 3. Methodology 3.1. Problem Definition In this study, a traffic network is denoted as a directed weighted graph G = (V , E , A) , where V is a set of N =| V | nodes representing sensors in traffic network; E is a set of edges indicating the connectivity among the nodes; and A ∈  N×N is the weighted adjacency matrix of graph G representing the proximity measured by the Euclidean distances between sensors via Gaussian kernel. At each time step t , the traffic conditions can be represented as a graph signal X t ∈  N ×C on graph G , where C is the number of traffic conditions observed such as traffic speed, traffic density and so on. Given a traffic network graph G and traffic conditions of historical P time steps  X t −P+1 ,, X t ∈  P×N×C observed by the N nodes, we aim to learn a function f to forecast the traffic conditions of the next F time steps over all nodes. The process can be formulated as: Xˆ t +1 , , Xˆ t + F = f ( X t − P +1 , , X t ; G ) , (1) where  Xˆ t +1 , , Xˆ t + F  ∈  F ×N ×C . 188 3.2. Overall Architecture Figure 1: The overview of DSTGNN The overall architecture of the proposed DSTGNN is shown in Figure 1, which consists of three main components including input layer, multiple stacked dynamic gated spatial-temporal (DGST) blocks and output layer. We first use two-layer fully-connected network before traffic data enters the model to project the data to high-dimension space, then the multiple stacked DGST blocks extract spatial and temporal dependencies jointly from input. Finally, the output layer transforms the features with spatial-temporal information from high-dimension space to traffic speed. Besides, DSTGNN predicts future traffic conditions in a multi-step manner. The detailed modules will be introduced later. 3.3. Dynamic Graph Convolution Module Figure 2: Illustration of the proposed dynamic graph convolution (DGC) module Traffic conditions at a location are influenced not only by predefined static graph relations, but also by dynamic spatial correlations at that time step, with varying degrees of influence at each time step. According to CDGNet[12]is vital to reflect the sparsity of spatial correlations in real traffic network, instead of calculating a dense adjacency matrix at each time step via the dot-product operation. In the proposed dynamic graph convolution (DGC) module, the sparse dynamic adjacency matrix is combined with predefined graph to extract dynamic and static spatial correlations jointly at each time slice. As shown in Figure 2, the output tensor X S ∈  P×N ×d of input layer is embedded by spatial temporal positional embedding module to get X ′S ∈  P×N ×d according to STTN, before it entering the first DGC module (for DGC module in the first DGST block, X ST(0) = X ′S ). We first view the temporal dimension of X ′S as a batch dimension, and then X ′S is transformed to the query, key and value subspaces by three different linear projection, obtaining QS′ K S ,VS ∈  P×N×d ,formally: ( l −1) QS = X ST ( l −1) Wq , K S = X ST ( l −1) Wk ,VS = X ST Wv , (2) where Wq ,Wk ,Wv ∈  d×d are all learnable weight matrix, X ST(l −1) ∈  P×N ×d is input of DGC module in l th DGST block, and d is the dimension of feature space. Next, we use ReLU (⋅) to sparse the dense adjacency matrix derived from dot product as follows: ( ) A = Re LU QS K ST + I N , (3) 189 where A ∈  N ×N is the sparse spatial correlation matrix, I N is an identity matrix, and add it to enhance the self-connection ability. Furthermore, to capture both dynamic and static spatial correlations, we combine A and the predefined adjacency matrix A by element-wise product with broadcasting mechanism. A softmax (⋅) function is then adopted to normalize the combined adjacency matrix to avoid gradient disappearing or explosion, the final graph convolution operation can be expressed formally as: H S(l ) = softmax( A  A)VS , (4) where H S(l ) ∈  P×N×d denotes the output of the DGC module in the l th DGST block, and  stands for element-wise product. 3.4. Gated Residual Connection As shown in Figure 1, we apply a gate function to regulate the flow of residual information, drawing inspiration from the gate mechanism in the Gated Recurrent Unit (GRU). How much the previous residual information can influence the following module is regulated by this gate mechanism. Specifically, a gated residual shortcut path is added to DGC module, fusing the spatial correlation of the output H S( l ) and the input's spatial-temporal features X ST(l −1) ∈  P×N ×d in the l th DGST block with X ST(0) = X ′S , which can be formulated as: ( l −1) X T(l ) = g s  H S(l ) + X ST Wres , (5) ( ( l −1) g s = σ X ST Ug , ) where X T(l ) ∈  P×N×d is the following temporal attention module's input, g s ∈  P×N×d denotes the gate, Wres ∈  d ×d denotes the linear projection weight, and U g ∈  d ×d denotes the state-to-state weight matrix,  is the element-wise product, and 𝜎 stands for the 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 ⋅ non-linear activation function. Similar to DGC module, we also added a gated residual shortcut path to the following temporal attention module, which combines the spatial correlation of input X T(l ) with the temporal dependence of output H T( l ) , building an interaction between the extracted spatial and temporal information that can be learned to extract spatial-temporal representations X ST(l ) ∈  P×N×d . This process can be written as: (l ) X ST = gt  H T(l ) + X T(l )Wres , (6) ( gt = σ X T(l )U g , ) where H T(l ) ∈  P×N×d is output of the following temporal attention module, and g t is the gate function. 3.5. Temporal Attention Module The temporal attention module takes the output X T(l ) ∈  P×N×d of gated residual connection in DGC module as input and views the spatial dimension of X T(l ) as a batch dimension. Then it uses the transformer-based encoder to model temporal dependence from X T(l ) with spatial correlation. It mainly consists of attention aggregation and FFN refining, with the former highlighting relative temporal cues and the latter updating refined features. The output tensor H T(l ) ∈  P×N×d of the temporal attention module can be computed as follows: QT = X T(l )Wq′ , KT = X T(l )Wk′ ,VT = X T(l )Wv′ , (7)   Q KT   VT′ = LN  softmax  T T VT + X T  ,     d   ( ( ) H T(l ) = LN FFN VT′ + VT′ , ) 190 where Wq′ ,Wk′ ,Wv′ ∈  d×d are all weight matrix, LN indicates layer normalization, and FFN is the feed forward network. The above calculation can also be extended in a multi-head manner. 3.6. Loss Function The output layer regards the last DGST block's output X ST(l ) ∈  P×N×d as input, then transforms it to final prediction results Yˆ ∈  F ×N ×C . The proposed DSTGNN can be trained in an end-to-end style via back- propagation by minimizing the mean absolute error between predicted values and ground truths: 1 F N ˆ (8)  (Θ ) =  Yt ,n − Yt ,n F × N t =1n =1 where Θ denotes all the learnable parameters in our model, and 𝑌 denotes the ground truth. 4. Experiments 4.1. Experimental Settings 4.1.1. Datasets As detailed below, we evaluate our DSTGNN on two public real-world traffic datasets, namely PeMSD7 and PEMS-BAY. PeMSD7 collects 2-month traffic data on 228 sensors during the weekdays of May and June 2012 in California's District 7. PEMS-BAY contains six months of traffic data from 325 sensors in the Bay area from January 1st to May 31st, 2017. Following STGCN[5], PeMSD7 uses the first 34 days as a training set and the remaining days as a validation and test set. As with DCRNN[3], PEMS-BAY is divided into three sets: training (70%), validation (10%), and test (20%). 4.1.2. Evaluation Metric Metrices. To evaluate the performance of our model, we employ three metrics: Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Mean Absolute Percentage Error (MAPE). Baselines. Our DSTGNN is compared to the following baselines: ARIMA[7], FC-LSTM[8], DCRNN[3], STGCN[5], GMAN[6], ASTGCN[9], STTN[11], Graph WaveNet[10], LSGCN[2]. 4.2. Experimental Results and Analysis Table 1 Traffic forecasting performance comparison of DSTGNN and other baselines. Table 1 shows the experimental performance of DSTGNN and baselines for 15, 30 and 60 minutes ahead prediction on the PeMSD7 and PEMS-BAY datasets. The prediction performance comparison results show that DSTGNN can compete with the state-of-the-art methods in both long-term and short- 191 term predictions on both datasets, while outperforming predefined graph-based spatial-temporal models, namely STGCN[5]and DCRNN[3]. In terms of short-term prediction (<= 30 min), DSTGNN outperforms STTN[11]and GMAN[6]on PEMS-BAY but falls short of Graph WaveNet[10]. It is superior than Graph WaveNet in long-term prediction (60 min), and competitive with STTN, weaker to GMAN. On PeMSD7, DSTGNN displays similar results. This fact specifies that DSTGNN performs better in long-term forecasting due to its use of gated residual connections in DGST block, which sufficiently incorporates related spatial-temporal information and alleviates the accumulation of errors over time. 5. Conclusion In this paper, we propose a novel framework named dynamic gated spatial temporal graph neural network (DSTGNN) for long and short-term traffic forecasting. In DSTGNN, we adopt the DGC module to precisely integrate both static and dynamic spatial correlations and at the same use temporal attention module to capture evolution cues in time series. Besides, we add the gated residual connection to the proposed DGST block for fusing extracted spatial and temporal features. The experiments on two real traffic datasets verifies the effectiveness of DSTGNN on modeling spatial-temporal correlations from time series. In the future, we will consider our DSTGNN for more general spatial-temporal structural graph sequence forecasting tasks, such as preference prediction in recommendation systems. 6. Acknowledgements This work was supported by Wuhan Institute of Technology under Grant No.CX2021277. 7. References [1] Bruna, W. Zaremba, A. Szlam, and Y. Lecun. Spectral networks and locally connected networks on graphs. Computer Science, 2013. [2] Rongzhou Huang, Chuyin Huang, Yubao Liu, Genan Dai,and Weiyang Kong. Lsgcn: Long short- term traffic prediction with graph convolutional networks. In IJCAI, pages 2355–2361, 2020. [3] Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. arXiv preprint arXiv:1707.01926, 2017. [4] Usue Mori, Alexander Mendiburu, Maite ́Alvarez, and Jose A Lozano. A review of travel time estimation and forecasting for advanced traveller information systems. Transportmetrica A: Transport Science, 11(2):119–157, 2015. [5] Bing Yu, Haoteng Yin, and Zhanxing Zhu. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. arXiv preprint arXiv:1709.04875, 2017. [6] Chuanpan Zheng, Xiaoliang Fan, Cheng Wang, and Jianzhong Qi. Gman: A graph multi-attention network for traffic prediction. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 1234–1241, 2020. [7] Spyros Makridakis and Michele Hibon. Arma models and the box–jenkins methodology. Journal of forecasting, 16(3):147–163, 1997. [8] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27, 2014. [9] Shengnan Guo, Youfang Lin, Ning Feng, Chao Song, and Huaiyu Wan. Attention based spatial- temporal graph convolutional networks for traffic flow forecasting. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 922–929, 2019. [10] Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, and Chengqi Zhang. Graph wavenet for deep spatial-temporal graph modeling. arXiv preprint arXiv:1906.00121, 2019. [11] Mingxing Xu, Wenrui Dai, Chunmiao Liu, Xing Gao, Weiyao Lin, Guo-Jun Qi, and Hongkai Xiong. Spatial- temporal transformer networks for traffic flow forecasting. arXiv preprint arXiv:2001.02908, 2020. 192 [12] Micha ̈el Defferrard, Xavier Bresson, and Pierre Van dergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. Advances in neural information processing systems, 29, 2016. [13] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [14] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016. [15] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. Advances in neural information processing systems, 30, 2017. [16] Petar Veliˇckovi ́c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017. 193