Dynamic Gated Spatial Temporal Graph Neural Networks for
Traffic Forecasting
Ziyan Gui1, Changhui Liu1*, Li Xiong1, Zuoquan Xie2 and Liang Wu3
1
  Wuhan Institute of Technology, Wuhan, China
2
  Central China Normal University, Wuhan, China
3
  Southwestern University of Finance and Economics, Chengdu, China

                 Abstract
                 Traffic forecasting is crucial to intelligent transportation system, and very challenging due to
                 the uncertainty and complexity of spatial-temporal dependencies in real-world traffic network.
                 Many existing approaches use the pre-defined graph to model spatial correlations, but they fail
                 to capture the latent spatial evolution. Then some dynamic graph-based methods are proposed
                 to address this issue, however they separately model spatial and temporal dependencies without
                 internal connection. In this paper, we propose a novel Dynamic gated Spatial Temporal Graph
                 Neural Network (DSTGNN) for traffic forecasting, which can capture time-varying spatial
                 correlations and temporal dependencies jointly. Besides, we apply gate mechanism into
                 residual connection between extracted spatial and temporal features. Experimental results on
                 two real-world traffic datasets have demonstrated the effectiveness of DSTGNN, and that the
                 proposed DSTGNN can compete with state-of-the-art baselines.

                 Keywords
                 Traffic forecasting, DSTGNN, Spatial-Temporal correlation

1. Introduction 1

    As a core component of Intelligent Transportation System (ITS)[4], real-time and accurate traffic
forecasting is curial for road resource planning and public traffic safety. The key of traffic forecasting
is to capture dynamic and uncertain spatial-temporal dependencies from historical data.
    In early deep learning approaches, convolutional neural networks (CNNs) are used to extract spatial
correlations and RNNs are used to model temporal dependencies. However, CNNs are only suitable for
capturing spatial features in grid-data and perform poorly in non-Euclidean space. Recently, graph
neural networks (GNNs)[1,12,13] have been generalized as convolution on graph-based data, which
can extract the intrinsic spatial topological information of graphs.
    To model both temporal and spatial dependencies, many early approaches [3]combined GNNs with
RNN-based sequence models and achieved improvement. However, most of them model spatial
correlations based on predefined static graph and cannot capture spatial dynamics. With the advent of
self-attention in Transformer[13]spatial and temporal attention are adopted in these methods[6,9,11]to
model dynamic spatial-temporal correlations and improved a lot. However, spatial-temporal
correlations are not captured interactively, resulting in learning irrelevant or redundant information.
    In this paper, we proposed an end-to-end model named Dynamic gated Spatial Temporal Graph
Neural Network (DSTGNN), which models spatial-temporal dependencies jointly to address the above
issues. Specifically, we stack multiple the proposed dynamic gated spatial-temporal(DGST) blocks to
extract spatial-temporal features from historical traffic data. Each DGST block consists of a dynamic
graph convolution module for extracting dynamic and static spatial correlations, a temporal attention


ICBASE2022@3rd International Conference on Big Data & Artificial Intelligence & Software Engineering, October 21-
23, 2022, Guangzhou, China
EMAIL: 2927990583@qq.com (Ziyan Gui); lch52012@163.com (Changhui Liu)*; 2273167975@qq.com (Li Xiong);
1584103174@qq.com (Zuoquan Xie)
ORCID: 0000-0002-4530-5604(ZiyanGui);
              © 2022 Copyright for this paper by its authors.
              Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
              CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                 187
module for modeling time dynamics and a gated residual connection for interaction between extracted
spatial and temporal features. The contribution of this paper can be summarized as:
   • We propose the dynamic gated spatial-temporal block that jointly models the spatial and
        temporal dependencies of traffic data, and then make prediction in a non-autoregressive way.
   • Experiments on two real-world traffic datasets are conducted to evaluate the effectiveness of our
        model, and results show that our model can compete with the state-of-the-arts.

2. Related works
2.1. Graph Convolutional Network

   As an efficient variant of CNNs on graph-structured data, graph convolutional networks (GCNs)
have been applied into various areas and achieved state-of-the-art results. In the spectral perspective
[1], GCNs need to compute the eigen-decomposition of the Laplacian matrix, which leads to a huge
consumption of computation resources. Subsequently, methods[12,14]based on Chebyshev polynomial
approximation are proposed to improve the computing efficiency. In addition, GCNs based on the
spatial perspective[15,16]not only avoid the eigen-decomposition of Laplacian matrix but also can learn
the vertices representations inductively.[5]proposes STGCN combined GCN with standard temporal
1D CNN to tackle the traffic time series prediction. [10]proposes Graph WaveNet which learns static
adjacency matrices for spatial-temporal modeling but failed to capture dynamic spatial correlations.

2.2.       Attention based Traffic forecasting

   Attention mechanism has been extensively utilized in many domains such as natural language
processing and graph learning (GAT[16]). Recently, researchers apply attention mechanism to model
spatial-temporal dependencies for traffic forecasting. GMAN[6] proposes the ST-Attention block,
which adds a transform attention layer between the encoder and decoder and models dynamic spatial-
temporal correlations in both the encoder and decoder. For learning spatial-temporal features,
ASTGCN[9]employs attention mechanisms in the spatial and temporal dimensions, respectively. A
general dynamic graph neural network is created by STTN[11]to model the spatial dependencies that
change over time. LSGCN[2]combines GCN with graph attention for long short-term traffic prediction.

3. Methodology
3.1. Problem Definition

   In this study, a traffic network is denoted as a directed weighted graph G = (V , E , A) , where V is a set
of N =| V | nodes representing sensors in traffic network; E is a set of edges indicating the connectivity
among the nodes; and A ∈  N×N is the weighted adjacency matrix of graph G representing the proximity
measured by the Euclidean distances between sensors via Gaussian kernel.
   At each time step t , the traffic conditions can be represented as a graph signal X t ∈  N ×C on graph G ,
where C is the number of traffic conditions observed such as traffic speed, traffic density and so on.
   Given a traffic network graph G and traffic conditions of historical P time steps  X t −P+1 ,, X t ∈  P×N×C
observed by the N nodes, we aim to learn a function f to forecast the traffic conditions of the next F
time steps over all nodes. The process can be formulated as:

                                                     Xˆ t +1 , , Xˆ t + F = f ( X t − P +1 , , X t ; G ) ,   (1)


   where  Xˆ t +1 , , Xˆ t + F  ∈  F ×N ×C .


                                                                                 188
3.2.    Overall Architecture


Figure 1: The overview of DSTGNN

   The overall architecture of the proposed DSTGNN is shown in Figure 1, which consists of three
main components including input layer, multiple stacked dynamic gated spatial-temporal (DGST)
blocks and output layer. We first use two-layer fully-connected network before traffic data enters the
model to project the data to high-dimension space, then the multiple stacked DGST blocks extract
spatial and temporal dependencies jointly from input. Finally, the output layer transforms the features
with spatial-temporal information from high-dimension space to traffic speed. Besides, DSTGNN
predicts future traffic conditions in a multi-step manner. The detailed modules will be introduced later.

3.3.    Dynamic Graph Convolution Module


Figure 2: Illustration of the proposed dynamic graph convolution (DGC) module

      Traffic conditions at a location are influenced not only by predefined static graph relations, but also
by dynamic spatial correlations at that time step, with varying degrees of influence at each time step.
According to CDGNet[12]is vital to reflect the sparsity of spatial correlations in real traffic network,
instead of calculating a dense adjacency matrix at each time step via the dot-product operation. In the
proposed dynamic graph convolution (DGC) module, the sparse dynamic adjacency matrix is combined
with predefined graph to extract dynamic and static spatial correlations jointly at each time slice.
      As shown in Figure 2, the output tensor X S ∈  P×N ×d of input layer is embedded by spatial temporal
positional embedding module to get X ′S ∈  P×N ×d according to STTN, before it entering the first DGC
module (for DGC module in the first DGST block, X ST(0) = X ′S ). We first view the temporal dimension of
 X ′S as a batch dimension, and then X ′S is transformed to the query, key and value subspaces by three
different linear projection, obtaining QS′ K S ,VS ∈  P×N×d ，formally:

                                    ( l −1)
                             QS = X ST                  ( l −1)
                                           Wq , K S = X ST                ( l −1)
                                                               Wk ,VS = X ST     Wv ,                       (2)

  where Wq ,Wk ,Wv ∈  d×d are all learnable weight matrix, X ST(l −1) ∈  P×N ×d is input of DGC module in l th
DGST block, and d is the dimension of feature space.
  Next, we use ReLU (⋅) to sparse the dense adjacency matrix derived from dot product as follows:

                                                       (         )
                                         A = Re LU QS K ST + I N ,                                        (3)


                                                           189
   where A ∈  N ×N is the sparse spatial correlation matrix, I N is an identity matrix, and add it to enhance
the self-connection ability. Furthermore, to capture both dynamic and static spatial correlations, we
combine A and the predefined adjacency matrix A by element-wise product with broadcasting
mechanism. A softmax (⋅) function is then adopted to normalize the combined adjacency matrix to avoid
gradient disappearing or explosion, the final graph convolution operation can be expressed formally as:

                                            H S(l ) = softmax( A  A)VS ,                                                (4)

   where H S(l ) ∈  P×N×d denotes the output of the DGC module in the l th DGST block, and  stands for
element-wise product.

3.4.      Gated Residual Connection

   As shown in Figure 1, we apply a gate function to regulate the flow of residual information, drawing
inspiration from the gate mechanism in the Gated Recurrent Unit (GRU). How much the previous
residual information can influence the following module is regulated by this gate mechanism.
Specifically, a gated residual shortcut path is added to DGC module, fusing the spatial correlation of
the output H S( l ) and the input's spatial-temporal features X ST(l −1) ∈  P×N ×d in the l th DGST block with X ST(0) = X ′S ,
which can be formulated as:
                                                                     ( l −1)
                                         X T(l ) = g s  H S(l ) + X ST     Wres ,                                       (5)

                                                            (
                                                           ( l −1)
                                                 g s = σ X ST     Ug ,    )
    where X T(l ) ∈  P×N×d is the following temporal attention module's input, g s ∈  P×N×d denotes the gate,
Wres ∈  d ×d denotes the linear projection weight, and U g ∈  d ×d denotes the state-to-state weight matrix, 
is the element-wise product, and 𝜎 stands for the 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 ⋅ non-linear activation function.
    Similar to DGC module, we also added a gated residual shortcut path to the following temporal
attention module, which combines the spatial correlation of input X T(l ) with the temporal dependence of
output H T( l ) , building an interaction between the extracted spatial and temporal information that can be
learned to extract spatial-temporal representations X ST(l ) ∈  P×N×d . This process can be written as：
                                             (l )
                                           X ST   = gt  H T(l ) + X T(l )Wres ,                                          (6)

                                                            (
                                                  gt = σ X T(l )U g , )
    where H T(l ) ∈  P×N×d is output of the following temporal attention module, and g t is the gate function.

3.5.      Temporal Attention Module

   The temporal attention module takes the output X T(l ) ∈  P×N×d of gated residual connection in DGC
module as input and views the spatial dimension of X T(l ) as a batch dimension. Then it uses the
transformer-based encoder to model temporal dependence from X T(l ) with spatial correlation. It mainly
consists of attention aggregation and FFN refining, with the former highlighting relative temporal cues
and the latter updating refined features. The output tensor H T(l ) ∈  P×N×d of the temporal attention module
can be computed as follows:
                              QT = X T(l )Wq′ , KT = X T(l )Wk′ ,VT = X T(l )Wv′ ,                        (7)
                                                       Q KT         
                                    VT′ = LN  softmax  T T VT + X T  ,
                                                                      
                                                       d            
                                                        (        ( )
                                           H T(l ) = LN FFN VT′ + VT′ ,       )

                                                                190
   where Wq′ ，Wk′ ，Wv′ ∈  d×d are all weight matrix, LN indicates layer normalization, and FFN is the feed
forward network. The above calculation can also be extended in a multi-head manner.

3.6.    Loss Function

   The output layer regards the last DGST block's output X ST(l ) ∈  P×N×d as input, then transforms it to final
prediction results Yˆ ∈  F ×N ×C . The proposed DSTGNN can be trained in an end-to-end style via back-
propagation by minimizing the mean absolute error between predicted values and ground truths:
                                                    1 F N ˆ                                        (8)
                                          (Θ ) =        Yt ,n − Yt ,n
                                                  F × N t =1n =1
   where Θ denotes all the learnable parameters in our model, and 𝑌 denotes the ground truth.

4. Experiments
4.1. Experimental Settings
4.1.1. Datasets

   As detailed below, we evaluate our DSTGNN on two public real-world traffic datasets, namely
PeMSD7 and PEMS-BAY. PeMSD7 collects 2-month traffic data on 228 sensors during the weekdays
of May and June 2012 in California's District 7. PEMS-BAY contains six months of traffic data from
325 sensors in the Bay area from January 1st to May 31st, 2017. Following STGCN[5], PeMSD7 uses
the first 34 days as a training set and the remaining days as a validation and test set. As with DCRNN[3],
PEMS-BAY is divided into three sets: training (70%), validation (10%), and test (20%).

4.1.2. Evaluation Metric

  Metrices. To evaluate the performance of our model, we employ three metrics: Mean Absolute Error
(MAE), Root Mean Squared Error (RMSE), and Mean Absolute Percentage Error (MAPE).
  Baselines. Our DSTGNN is compared to the following baselines: ARIMA[7], FC-LSTM[8],
DCRNN[3], STGCN[5], GMAN[6], ASTGCN[9], STTN[11], Graph WaveNet[10], LSGCN[2].

4.2.    Experimental Results and Analysis

Table 1
Traffic forecasting performance comparison of DSTGNN and other baselines.


   Table 1 shows the experimental performance of DSTGNN and baselines for 15, 30 and 60 minutes
ahead prediction on the PeMSD7 and PEMS-BAY datasets. The prediction performance comparison
results show that DSTGNN can compete with the state-of-the-art methods in both long-term and short-


                                                      191
term predictions on both datasets, while outperforming predefined graph-based spatial-temporal models,
namely STGCN[5]and DCRNN[3].
   In terms of short-term prediction (<= 30 min), DSTGNN outperforms STTN[11]and GMAN[6]on
PEMS-BAY but falls short of Graph WaveNet[10]. It is superior than Graph WaveNet in long-term
prediction (60 min), and competitive with STTN, weaker to GMAN. On PeMSD7, DSTGNN displays
similar results. This fact specifies that DSTGNN performs better in long-term forecasting due to its use
of gated residual connections in DGST block, which sufficiently incorporates related spatial-temporal
information and alleviates the accumulation of errors over time.

5. Conclusion

    In this paper, we propose a novel framework named dynamic gated spatial temporal graph neural
network (DSTGNN) for long and short-term traffic forecasting. In DSTGNN, we adopt the DGC
module to precisely integrate both static and dynamic spatial correlations and at the same use temporal
attention module to capture evolution cues in time series. Besides, we add the gated residual connection
to the proposed DGST block for fusing extracted spatial and temporal features. The experiments on two
real traffic datasets verifies the effectiveness of DSTGNN on modeling spatial-temporal correlations
from time series. In the future, we will consider our DSTGNN for more general spatial-temporal
structural graph sequence forecasting tasks, such as preference prediction in recommendation systems.

6. Acknowledgements

   This work was supported by Wuhan Institute of Technology under Grant No.CX2021277.

7. References

[1] Bruna, W. Zaremba, A. Szlam, and Y. Lecun. Spectral networks and locally connected networks
     on graphs. Computer Science, 2013.
[2] Rongzhou Huang, Chuyin Huang, Yubao Liu, Genan Dai,and Weiyang Kong. Lsgcn: Long short-
     term traffic prediction with graph convolutional networks. In IJCAI, pages 2355–2361, 2020.
[3] Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. Diffusion convolutional recurrent neural
     network: Data-driven traffic forecasting. arXiv preprint arXiv:1707.01926, 2017.
[4] Usue Mori, Alexander Mendiburu, Maite ́Alvarez, and Jose A Lozano. A review of travel time
     estimation and forecasting for advanced traveller information systems. Transportmetrica A:
     Transport Science, 11(2):119–157, 2015.
[5] Bing Yu, Haoteng Yin, and Zhanxing Zhu. Spatio-temporal graph convolutional networks: A deep
     learning framework for traffic forecasting. arXiv preprint arXiv:1709.04875, 2017.
[6] Chuanpan Zheng, Xiaoliang Fan, Cheng Wang, and Jianzhong Qi. Gman: A graph multi-attention
     network for traffic prediction. In Proceedings of the AAAI conference on artificial intelligence,
     volume 34, pages 1234–1241, 2020.
[7] Spyros Makridakis and Michele Hibon. Arma models and the box–jenkins methodology. Journal
     of forecasting, 16(3):147–163, 1997.
[8] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks.
     Advances in neural information processing systems, 27, 2014.
[9] Shengnan Guo, Youfang Lin, Ning Feng, Chao Song, and Huaiyu Wan. Attention based spatial-
     temporal graph convolutional networks for traffic flow forecasting. In Proceedings of the AAAI
     conference on artificial intelligence, volume 33, pages 922–929, 2019.
[10] Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, and Chengqi Zhang. Graph wavenet for deep
     spatial-temporal graph modeling. arXiv preprint arXiv:1906.00121, 2019.
[11] Mingxing Xu, Wenrui Dai, Chunmiao Liu, Xing Gao, Weiyao Lin, Guo-Jun Qi, and Hongkai
     Xiong. Spatial- temporal transformer networks for traffic flow forecasting. arXiv preprint
     arXiv:2001.02908, 2020.


                                                  192
[12] Micha ̈el Defferrard, Xavier Bresson, and Pierre Van dergheynst. Convolutional neural networks
     on graphs with fast localized spectral filtering. Advances in neural information processing systems,
     29, 2016.
[13] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
     Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information
     processing systems, 30, 2017.
[14] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional
     networks. arXiv preprint arXiv:1609.02907, 2016.
[15] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs.
     Advances in neural information processing systems, 30, 2017.
[16] Petar Veliˇckovi ́c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua
     Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.


                                                  193