-

Dynamic Gated Spatial Temporal Graph Neural Networks for Traffic Forecasting

Ziyan Gui

Changhui Liu

Li Xiong

Zuoquan Xie

Liang Wu

1 0 Central China Normal University , Wuhan , China 1 Southwestern University of Finance and Economics , Chengdu , China 2 Wuhan Institute of Technology , Wuhan , China

187 193

Traffic forecasting is crucial to intelligent transportation system, and very challenging due to the uncertainty and complexity of spatial-temporal dependencies in real-world traffic network. Many existing approaches use the pre-defined graph to model spatial correlations, but they fail to capture the latent spatial evolution. Then some dynamic graph-based methods are proposed to address this issue, however they separately model spatial and temporal dependencies without internal connection. In this paper, we propose a novel Dynamic gated Spatial Temporal Graph Neural Network (DSTGNN) for traffic forecasting, which can capture time-varying spatial correlations and temporal dependencies jointly. Besides, we apply gate mechanism into residual connection between extracted spatial and temporal features. Experimental results on two real-world traffic datasets have demonstrated the effectiveness of DSTGNN, and that the proposed DSTGNN can compete with state-of-the-art baselines.

eol>Traffic forecasting DSTGNN Spatial-Temporal correlation

1. Introduction 1

As a core component of Intelligent Transportation System (ITS)[ 4 ], real-time and accurate traffic forecasting is curial for road resource planning and public traffic safety. The key of traffic forecasting is to capture dynamic and uncertain spatial-temporal dependencies from historical data.

In early deep learning approaches, convolutional neural networks (CNNs) are used to extract spatial correlations and RNNs are used to model temporal dependencies. However, CNNs are only suitable for capturing spatial features in grid-data and perform poorly in non-Euclidean space. Recently, graph neural networks (GNNs)[ 1,12,13 ] have been generalized as convolution on graph-based data, which can extract the intrinsic spatial topological information of graphs.

To model both temporal and spatial dependencies, many early approaches [ 3 ]combined GNNs with RNN-based sequence models and achieved improvement. However, most of them model spatial correlations based on predefined static graph and cannot capture spatial dynamics. With the advent of self-attention in Transformer[13]spatial and temporal attention are adopted in these methods[ 6,9,11 ]to model dynamic spatial-temporal correlations and improved a lot. However, spatial-temporal correlations are not captured interactively, resulting in learning irrelevant or redundant information.

In this paper, we proposed an end-to-end model named Dynamic gated Spatial Temporal Graph Neural Network (DSTGNN), which models spatial-temporal dependencies jointly to address the above issues. Specifically, we stack multiple the proposed dynamic gated spatial-temporal(DGST) blocks to extract spatial-temporal features from historical traffic data. Each DGST block consists of a dynamic graph convolution module for extracting dynamic and static spatial correlations, a temporal attention module for modeling time dynamics and a gated residual connection for interaction between extracted spatial and temporal features. The contribution of this paper can be summarized as: • We propose the dynamic gated spatial-temporal block that jointly models the spatial and temporal dependencies of traffic data, and then make prediction in a non-autoregressive way. • Experiments on two real-world traffic datasets are conducted to evaluate the effectiveness of our model, and results show that our model can compete with the state-of-the-arts.

2. Related works 2.1. Graph Convolutional Network

As an efficient variant of CNNs on graph-structured data, graph convolutional networks (GCNs) have been applied into various areas and achieved state-of-the-art results. In the spectral perspective [ 1 ], GCNs need to compute the eigen-decomposition of the Laplacian matrix, which leads to a huge consumption of computation resources. Subsequently, methods[12,14]based on Chebyshev polynomial approximation are proposed to improve the computing efficiency. In addition, GCNs based on the spatial perspective[15,16]not only avoid the eigen-decomposition of Laplacian matrix but also can learn the vertices representations inductively.[ 5 ]proposes STGCN combined GCN with standard temporal 1D CNN to tackle the traffic time series prediction. [ 10 ]proposes Graph WaveNet which learns static adjacency matrices for spatial-temporal modeling but failed to capture dynamic spatial correlations. 2.2.

Attention based Traffic forecasting

Attention mechanism has been extensively utilized in many domains such as natural language processing and graph learning (GAT[16]). Recently, researchers apply attention mechanism to model spatial-temporal dependencies for traffic forecasting. GMAN[ 6 ] proposes the ST-Attention block, which adds a transform attention layer between the encoder and decoder and models dynamic spatialtemporal correlations in both the encoder and decoder. For learning spatial-temporal features, ASTGCN[ 9 ]employs attention mechanisms in the spatial and temporal dimensions, respectively. A general dynamic graph neural network is created by STTN[ 11 ]to model the spatial dependencies that change over time. LSGCN[ 2 ]combines GCN with graph attention for long short-term traffic prediction.

3. Methodology 3.1. Problem Definition

In this study, a traffic network is denoted as a directed weighted graph G = (V , E, A) , where V is a set of N =| V | nodes representing sensors in traffic network; E is a set of edges indicating the connectivity among the nodes; and A ∈ N×N is the weighted adjacency matrix of graph G representing the proximity measured by the Euclidean distances between sensors via Gaussian kernel.

At each time step t , the traffic conditions can be represented as a graph signal Xt ∈ N×C on graph G , where C is the number of traffic conditions observed such as traffic speed, traffic density and so on.

Given a traffic network graph G and traffic conditions of historical P time steps Xt−P+1,, Xt ∈ P×N×C observed by the N nodes, we aim to learn a function f to forecast the traffic conditions of the next F time steps over all nodes. The process can be formulated as:

Xˆ t+1,, Xˆ t+F = f ( X t−P+1,, X t ;G ) , (1) where  Xˆ t+1,, Xˆ t+F  ∈ F×N×C . 3.2.

Overall Architecture

The overall architecture of the proposed DSTGNN is shown in Figure 1, which consists of three main components including input layer, multiple stacked dynamic gated spatial-temporal (DGST) blocks and output layer. We first use two-layer fully-connected network before traffic data enters the model to project the data to high-dimension space, then the multiple stacked DGST blocks extract spatial and temporal dependencies jointly from input. Finally, the output layer transforms the features with spatial-temporal information from high-dimension space to traffic speed. Besides, DSTGNN predicts future traffic conditions in a multi-step manner. The detailed modules will be introduced later.

Dynamic Graph Convolution Module

(2) (3) module (for DGC module in the first DGST block, X S(T0) = X′S ). We first view the temporal dimension of X′S as a batch dimension, and then X′S is transformed to the query, key and value subspaces by three different linear projection, obtaining QS′KS ,VS ∈ P×N×d ，formally:

QS = X S(lT−1)Wq , KS = X S(lT−1)Wk ,VS = X S(lT−1)W , v where Wq ,Wk ,Wv ∈ d×d are all learnable weight matrix, X S(Tl−1) ∈ P×N×d is input of DGC module in lth DGST block, and d is the dimension of feature space.

Next, we use ReLU (⋅) to sparse the dense adjacency matrix derived from dot product as follows: A = Re LU (QS KST ) + I N , where A ∈ N×N is the sparse spatial correlation matrix, IN is an identity matrix, and add it to enhance the self-connection ability. Furthermore, to capture both dynamic and static spatial correlations, we combine A and the predefined adjacency matrix A by element-wise product with broadcasting mechanism. A softmax(⋅) function is then adopted to normalize the combined adjacency matrix to avoid gradient disappearing or explosion, the final graph convolution operation can be expressed formally as:

HS(l) = softmax( A  A)VS , element-wise product. 3.4.

Gated Residual Connection

As shown in Figure 1, we apply a gate function to regulate the flow of residual information, drawing inspiration from the gate mechanism in the Gated Recurrent Unit (GRU). How much the previous residual information can influence the following module is regulated by this gate mechanism. Specifically, a gated residual shortcut path is added to DGC module, fusing the spatial correlation of the output HS(l) and the input's spatial-temporal features X S(Tl−1) ∈ P×N×d in the lth DGST block with X S(T0) = X′S , which can be formulated as: (4) (5) (6) (7) where XT(l) ∈ P×N×d is the following temporal attention module's input, gs ∈ P×N×d denotes the gate, Wres ∈ d×d denotes the linear projection weight, and Ug ∈ d×d denotes the state-to-state weight matrix,  is the element-wise product, and stands for the ⋅ non-linear activation function.

Similar to DGC module, we also added a gated residual shortcut path to the following temporal attention module, which combines the spatial correlation of input XT(l) with the temporal dependence of output HT(l) , building an interaction between the extracted spatial and temporal information that can be learned to extract spatial-temporal representations X S(Tl) ∈ P×N×d . This process can be written as： where HT(l) ∈ P×N×d is output of the following temporal attention module, and gt is the gate function. where HS(l) ∈ P×N×d denotes the output of the DGC module in the lth DGST block, and  stands for QT = XT(l)Wq′ , KT = XT(l)Wk′ ,VT = XT(l)Wv′ ,

  VT′ = LN  softmax  QT KTT VT + XT  ,

  d   HT(l) = LN (FFN (VT′ ) + VT′ ) , X S(lT) = gt  HT(l) + XT(l)Wres ,

gt =σ ( XT(l)U g ) , 3.5.

Temporal Attention Module

The temporal attention module takes the output XT(l) ∈ P×N×d of gated residual connection in DGC module as input and views the spatial dimension of XT(l) as a batch dimension. Then it uses the transformer-based encoder to model temporal dependence from XT(l) with spatial correlation. It mainly consists of attention aggregation and FFN refining, with the former highlighting relative temporal cues and the latter updating refined features. The output tensor HT(l) ∈ P×N×d of the temporal attention module can be computed as follows: where Wq′，W′，Wv′ ∈ d×d are all weight matrix, LN indicates layer normalization, and FFN is the feed k forward network. The above calculation can also be extended in a multi-head manner. 3.6.

Loss Function

The output layer regards the last DGST block's output X S(Tl) ∈ P×N×d as input, then transforms it to final prediction results Yˆ ∈ F×N×C . The proposed DSTGNN can be trained in an end-to-end style via backpropagation by minimizing the mean absolute error between predicted values and ground truths: (Θ) = 1 F N

  Yˆt,n − Yt,n

F × N t=1n=1 where Θ denotes all the learnable parameters in our model, and denotes the ground truth. (8)

4. Experiments 4.1. Experimental Settings 4.1.1. Datasets

As detailed below, we evaluate our DSTGNN on two public real-world traffic datasets, namely PeMSD7 and PEMS-BAY. PeMSD7 collects 2-month traffic data on 228 sensors during the weekdays of May and June 2012 in California's District 7. PEMS-BAY contains six months of traffic data from 325 sensors in the Bay area from January 1st to May 31st, 2017. Following STGCN[ 5 ], PeMSD7 uses the first 34 days as a training set and the remaining days as a validation and test set. As with DCRNN[ 3 ], PEMS-BAY is divided into three sets: training (70%), validation (10%), and test (20%).

4.1.2. Evaluation Metric

Metrices. To evaluate the performance of our model, we employ three metrics: Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Mean Absolute Percentage Error (MAPE).

Baselines. Our DSTGNN is compared to the following baselines: ARIMA[ 7 ], FC-LSTM[ 8 ], DCRNN[ 3 ], STGCN[ 5 ], GMAN[ 6 ], ASTGCN[ 9 ], STTN[ 11 ], Graph WaveNet[ 10 ], LSGCN[ 2 ]. 4.2.

Experimental Results and Analysis

Table 1 shows the experimental performance of DSTGNN and baselines for 15, 30 and 60 minutes ahead prediction on the PeMSD7 and PEMS-BAY datasets. The prediction performance comparison results show that DSTGNN can compete with the state-of-the-art methods in both long-term and shortterm predictions on both datasets, while outperforming predefined graph-based spatial-temporal models, namely STGCN[ 5 ]and DCRNN[ 3 ].

In terms of short-term prediction (<= 30 min), DSTGNN outperforms STTN[ 11 ]and GMAN[ 6 ]on PEMS-BAY but falls short of Graph WaveNet[ 10 ]. It is superior than Graph WaveNet in long-term prediction (60 min), and competitive with STTN, weaker to GMAN. On PeMSD7, DSTGNN displays similar results. This fact specifies that DSTGNN performs better in long-term forecasting due to its use of gated residual connections in DGST block, which sufficiently incorporates related spatial-temporal information and alleviates the accumulation of errors over time.

5. Conclusion

In this paper, we propose a novel framework named dynamic gated spatial temporal graph neural network (DSTGNN) for long and short-term traffic forecasting. In DSTGNN, we adopt the DGC module to precisely integrate both static and dynamic spatial correlations and at the same use temporal attention module to capture evolution cues in time series. Besides, we add the gated residual connection to the proposed DGST block for fusing extracted spatial and temporal features. The experiments on two real traffic datasets verifies the effectiveness of DSTGNN on modeling spatial-temporal correlations from time series. In the future, we will consider our DSTGNN for more general spatial-temporal structural graph sequence forecasting tasks, such as preference prediction in recommendation systems.

6. Acknowledgements

This work was supported by Wuhan Institute of Technology under Grant No.CX2021277. 7. References [12] Micha ̈el Defferrard, Xavier Bresson, and Pierre Van dergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. Advances in neural information processing systems, 29, 2016. [13] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [14] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016. [15] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs.

Advances in neural information processing systems, 30, 2017. [16] Petar Veliˇckovi ́c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.

[1] Bruna , W.

Zaremba , A.

Szlam , and Y.

Lecun . Spectral networks and locally connected networks on graphs . Computer Science , 2013 .

[2]

Rongzhou

Huang , Chuyin Huang, Yubao Liu, Genan Dai,and

Weiyang

Kong . Lsgcn: Long shortterm traffic prediction with graph convolutional networks . In IJCAI , pages 2355 - 2361 , 2020 .

[3]

Yaguang

Li ,

Rose

Yu ,

Cyrus

Shahabi , and Yan Liu. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting . arXiv preprint arXiv:1707 . 01926 , 2017 .

[4]

Usue

Mori , Alexander Mendiburu, Maite ́Alvarez, and Jose A Lozano. A review of travel time estimation and forecasting for advanced traveller information systems . Transportmetrica A: Transport Science , 11 ( 2 ): 119 - 157 , 2015 .

[5]

Bing

Yu ,

Haoteng

Yin , and

Zhanxing

Zhu . Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting . arXiv preprint arXiv:1709.04875 , 2017 .

[6]

Chuanpan

Zheng , Xiaoliang Fan, Cheng Wang, and

Jianzhong

Qi . Gman: A graph multi-attention network for traffic prediction . In Proceedings of the AAAI conference on artificial intelligence , volume 34 , pages 1234 - 1241 , 2020 .

[7]

Spyros

Makridakis and

Michele

Hibon . Arma models and the box-jenkins methodology . Journal of forecasting , 16 ( 3 ): 147 - 163 , 1997 .

[8]

Ilya

Sutskever , Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks . Advances in neural information processing systems , 27 , 2014 .

[9]

Shengnan

Guo , Youfang Lin, Ning Feng, Chao Song, and

Huaiyu

Wan . Attention based spatialtemporal graph convolutional networks for traffic flow forecasting . In Proceedings of the AAAI conference on artificial intelligence , volume 33 , pages 922 - 929 , 2019 .

[10] Zonghan

, Shirui Pan,

Guodong

Long , Jing Jiang, and Chengqi Zhang. Graph wavenet for deep spatial-temporal graph modeling . arXiv preprint arXiv:1906.00121 , 2019 .

[11] Mingxing

, Wenrui Dai, Chunmiao Liu, Xing Gao,

Weiyao

Lin , Guo-Jun Qi , and Hongkai Xiong . Spatial- temporal transformer networks for traffic flow forecasting . arXiv preprint arXiv:2001.02908 , 2020 .