=Paper= {{Paper |id=Vol-3842/paper20 |storemode=property |title=Traffic flow prediction model based on multi-scale pyramid space-time network |pdfUrl=https://ceur-ws.org/Vol-3842/paper20.pdf |volume=Vol-3842 |authors=Yanlie Zheng,Hu Jiwei,Xueying Li,Qingxia Shen,Chunzhi Wang |dblpUrl=https://dblp.org/rec/conf/bait/ZhengHLSW24 }} ==Traffic flow prediction model based on multi-scale pyramid space-time network== https://ceur-ws.org/Vol-3842/paper20.pdf
                                Traffic flow prediction model based on multi-scale
                                pyramid space-time network
                                Yanlie Zheng1,†, Hu Jiwei2,† , Xueying Li1,∗,† Qingxia Shen1,† and Chunzhi Wang3,†
                                1
                                  Fiberhome Telecommunication Technologies Co.,LTD, Wuhan 430068, China
                                2
                                  Wuhan Fiberhome Technical Services Co., Ltd., Wuhan 430205, China;
                                3
                                  Hubei University of Technology, Wuhan 430068, China


                                                Abstract
                                                Traffic flow prediction is both one of the important components of intelligent transport systems
                                                (ITS) and a challenging task at the same time. Although the existing traffic flow prediction has
                                                achieved good results, the existing traffic flow prediction models only model dynamic spatio-
                                                temporal correlations on a single time or spatial scale, and have poor performance on long-distance
                                                prediction. Aiming at the above problems, this paper proposes a traffic flow prediction model based
                                                on multi-scale pyramid spatio-temporal network. Specifically, firstly, a local spatio-temporal grid is
                                                generated by combining traffic data features and adjacency matrix. Secondly, multiple
                                                convolutional layers are used to aggregate sequences with multiple resolutions, and at the same
                                                time, the spatio-temporal grids are merged into traffic event sequences based on the temporal and
                                                spatial dimensions. Next, the adaptive combination of pyramidal attention and multi-channel
                                                spatio-temporal convolution module is used to capture the spatio-temporal dependence of sequence
                                                dynamics and the global spatio-temporal features are obtained by optimal fusion using fully
                                                connected layers. Finally, the corresponding predicted values are output based on the global spatio-
                                                temporal features. Experimental results on two publicly available datasets show that the model
                                                largely improves the detection.1

                                                Keywords
                                                Grid, Spatio-temporal,Attention, Channel



                                1. Introduction
                                In recent years, the sensor technology is developing rapidly, and the travel modes are rich and
                                diverse, and the intelligent transportation system has become the key development object in
                                many countries. In response to the problem of huge urban traffic flow and high speed area,
                                traffic prediction has become a key research in intelligent transportation system, and traffic
                                flow prediction methods are used to learn the highly nonlinear characteristics of traffic flow
                                data in order to accurately predict the traffic flow of complex urban roads in the coming
                                period. Stable and reliable traffic flow prediction algorithms can effectively alleviate traffic
                                congestion and improve people's quality of life.

                                1
                                  BAIT’2024: The 1st International Workshop on “Bioinformatics and applied information technologies”, October 02-04,
                                2024, Zboriv, Ukraine
                                ∗
                                  Corresponding author.
                                †
                                  These authors contributed equally.
                                    zhengyanlie@ fiberhome.com (Y. Zheng); hujiwei@fiberhome.com (J. Hu); xli@ fiberhome.com (X. Li); qshen@
                                fiberhome.com (Q. Shen); chunzhiwang@hbut.edu.cn (C. Wang)
                                    0000-0003-3412-1639 (Y. Zheng); 0000-0002-9622-744X (J. Hu); 0009-0008-2733-3565 (X. Li); 0009-0000-1600-4717
                                (Q. Shen); 0000-0002-6742-3644 (C. Wang)
                                             © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
    Traffic data contains important transportation system features, such as traffic flow, speed,
and time information recorded by road network sensors. Traditional statistical and machine
learning methods [1,2] are used to analyze the complex spatio-temporal properties of these
traffic data and thus predict traffic flow. However, these methods perform poorly in capturing
high-dimensional spatio-temporal features. In recent years, researchers in the field of
transportation have turned their attention to deep neural networks. Researchers have
employed CNNs to capture spatial features of target nodes and nearby regional nodes [3].Li et
al [4] modeled the diffusion process of directed graphs and combined diffusion convolution
and GRU recurrent neural networks to fuse the temporal and spatial correlations of the traffic
flow. Tang et al [5] proposed the spatio-temporal latent graph structure learning network
STLGSL, which employs a multilayer perceptron and k-nearest neighbors to generate graph
structure, and utilizes diffusion graph convolution and dilation causal convolution as well as
gating mechanism to mine the spatio-temporal features of the generated graphs. Li et al [6]
proposed Transformer-enhanced DetectorNet, which utilizes a multi-view temporal attention
module to capture temporal correlation of distance and proximity, and combines graph
convolution and a dynamic attention module to aggregate the spatial features of the generated
dynamic graphs.
    The aforementioned studies on traffic flow prediction have achieved impressive results.
However, existing work prefers to capture the pairwise impacts of spatio-temporal traffic
events and the spatio-temporal features of traffic data from a single temporal and spatial
scope. However, this approach makes it difficult to learn the dependencies of distant locations
in time and space, and does not comprehensively capture spatio-temporal dependencies at
different scales.
    In order to solve the above two problems, this paper proposes a new traffic flow prediction
model based on multi-scale pyramidal hybrid spatio-temporal network, called MSLST. First,
the original traffic data features (time and speed) and sensor distances are preprocessed, and a
local spatio-temporal grid is constructed by selecting other nodes that are spatially correlated
with the target node based on the generated adjacency matrices. Second, the temporal and
spatial dimensions of the spatio-temporal grid are combined into a sequence of traffic events,
and the sequence is processed using multiple convolutional layers with different Stride to
obtain the feature information of the sequence at different resolutions. And stacking multiple
pyramid attention is used to simulate the pairwise effects of traffic events under different
spatio-temporal scales to obtain coarse and fine scale based spatio-temporal correlation
features. Next, the spatio-temporal features at different scales are transformed into different
channels using linear layers, and spatio-temporal convolution blocks are used for each layer
separately to capture the spatio-temporal information of other nodes in the region near the
target node in the local spatio-temporal space. Finally, the spatio-temporal features of each
layer are merged into one channel, and the fully connected layer is used to transform the
merged features to obtain the global spatio-temporal features. And finally, traffic flow
prediction is performed based on the above output global spatio-temporal features.
    To summarize the main contributions:

   1.   In this paper, the temporal and spatial dimensions of traffic data are merged into a
        single fluid, and pyramid attention is used to directly model the dynamic spatio-
        temporal associations between a target node in local spatio-temporal space and other
        nodes at different moments, and to convey the spatio-temporal information of the
        nodes in different spatio-temporal ranges, which improves the model's ability to
        capture the highly nonlinear spatio-temporal features of the traffic flow.
   2.   In this paper, a multi-channel spatio-temporal convolutional block is proposed to
        perform gated aggregation of spatio-temporal information at various scales of the
        pyramid, flexibly mining the compact spatio-temporal feature representations of
        proximity and remoteness in the local spatio-temporal context, establishing one-to-
        many relationships between the target node and the other nodes in both time and
        space, and greatly improving the performance of long-distance multistep prediction.
   3.   MSLST is evaluated on two real-world public datasets to validate the effectiveness and
        sophistication of the model.

2. Model
This section describes a multi-scale local spatio-temporal network (MSLST) for modeling
spatial and temporal information of traffic flow. As shown in Fig. 1, the MSLST is composed of
three parts, which are local spatio-temporal grid construction, multi-scale spatio-temporal
attention block, and multi-channel spatio-temporal convolution block. The local spatio-
temporal mesh construction is responsible for stitching the spatio-temporal intervals between
traffic events from target sensors and other sensors into a 3D mesh, the multi-scale spatio-
temporal attention block is responsible for capturing the spatio-temporal correlation of traffic
events of a single flow shape under multiple ranges of spatio-temporal contexts, and the
multi-channel spatio-temporal convolution block is responsible for aggregating temporal and
spatial information under multiple channels to obtain global spatio-temporal features. Finally,
a dense fully connected layer is used to predict future traffic flow.




Figure 1: MSLST architecture diagram.

2.1. Local spatio-temporal grid construction
Inspired by the local spatio-temporal structure[7], we utilize a Gaussian kernel to transform
the distance matrix of the sensors into a weight matrix to represent the connectivity of the
sensors, where a larger weight represents a smaller distance of the sensors, and the formula is
defined as follows:



                                                                                        (1)
where                    is the weight matrix at time step     ,           is the distance matrix,

and     denote two nodes of the weight matrix                  ,     is some time step, and        is a

hyperparameter, which is usually set to the standard deviation of all              .
   According to the weight size of               to extract a portion of nodes that are close to the
target node, and combined into a node set       , the relevant definition is as follows:



(2)

where since the traffic graph is a bidirectional graph, the values of                  and
are different unless         ;    is a weight threshold set in advance, which requires that the

degree of connectivity of other nodes connected to the target node      is greater than ; in
addition to this, fix the size of the node set to be , and for the node set with the number
of nodes less than , the remaining nodes need to be filled with the features of 0 and have no

connectivity to the target node        . And for the node number is greater than        , the    nodes

with the closest distance to the target node    are selected. Finally, the nodes of node set
are arranged according to their weights from largest to smallest.

   The information related to the traffic map structure at each time step                       and the
node traffic measurements are fused together to obtain the local spatio-temporal
                    of a target node as follows:



                                                                                                 (3)

where        is the traffic measurement value recorded by node                at time step       , and

           represents the stitching operation. The matrix             not only contains the spatial
structure relationship between the target node        at time step     and node        in the node set
  , but also encodes the traffic measurement information of node             . Therefore, the local
spatio-temporal        is defined as



(4)
    It is the necessary information used to train the traffic prediction model to predict the
traffic data of node .
   In summary, the traffic flow prediction problem in this paper is defined as:



(5)
where         is the mapping function learned by the multi-scale spatio-temporal attention

block,          is the training function learned by the multi-channel spatio-temporal
convolution block,         is the historical time step,     is the future time step, the nodes are

                       , and    is the number of nodes in the traffic map.

2.2. Multi-scale spatio-temporal attention blocks
Inspired by Pyraformer [8], pyramid attention is introduced to describe the spatio-temporal
dependencies of individual traffic events in multiple resolutions. FPN feature pyramids[66]
utilize convolution to compute feature maps at different scales, preserving high-resolution
fine-grained semantic features while incorporating low-resolution coarse-grained semantic
features.
    In this section, multiple convolutional layers with kernel size 4 and step size 4 are used for
initialization in the dimension of time and space merging. Sequences of length             are
generated at scale . The dimensionality of each node is then reduced by a fully connected
layer, followed by connecting them to the output of the original sequence 1*1 convolution.
    Assuming that and are the input and output of a single attention, first the input data
  is linearly transformed into three independent matrices               ,       , and       , where

              ,                , and                . The original attention mechanism can be
represented as follows:




(6)

where              ,              ,            are the learnable weight matrices for transforming
the traffic event features into the query, key, and value space,             denotes the ith row of

         ,   denotes the jth row of         , and    denotes the jth row of         .




Figure 2: Pyramid diagram.
   The structure of the pyramid is shown in Fig. 2, which is defined as a set of neighbor nodes

       of the current node      ,     denotes the lth node in the sth layer and          contains

the neighbor nodes of the current node in the same layer including its own node,           ,

is the    child node of the current node in the C-fork tree and         is the parent node. They
are specifically defined as:



(7)



(8)


(9)


(10)
   In this section, the number of fixed pyramid layers           is 4, neighbor node           is 3
(including its own node), and child node         is 4. Therefore, the pyramid attention of the

current node      can be expressed as:




(11)

2.3. Multi-channel spatio-temporal convolution block
The outputs of the four layers of the pyramid represent the spatio-temporal feature
representations of traffic events at different scales. In this section, the features output from
different layers of the last pyramid are first converted to different channel spaces, and then the
spatio-temporal convolution block is used to integrate regional spatio-temporal features of
traffic flows at different scales with different depths, and the regional spatio-temporal features
at various scales are spliced together, and finally the fully-connected layer converts the spliced
multiscale spatio-temporal features into predicted values of future traffic flows.
    The spatio-temporal convolution block is composed of three convolutions that capture the
effect of local spatio-temporal on the target traffic events, the temporal dependence of traffic
events at different moments of the same node within the local spatio-temporal, and the spatial
dependence between the node and its neighbors at the same moments within the local spatio-
temporal, the three convolution kernels are the spatio-temporal convolution kernel

           , the temporal convolution kernel               , and the spatial convolution kernel

          , and the output features of the sth layer of the pyramid,                   , serve
as the input of the spatio-temporal convolution block, and the formula of the spatio-temporal
convolution block is defined as follows:



(12)


(13)


(14)


(15)


(16)
where     represents the convolution operation,                    denotes the leakage corrected
linear unit function,                is the output of the spatio-temporal convolution block,
the dimension of the output is the same as the dimension of the input as the convolution

kernel            is set to        and the padding size to be 1.               is a 1*1 convolution

kernel for the aggregation of the three features of          ,       , and          and a uniform
number of channels to be      .

2.4. Prediction and Optimization
Next, prediction is performed by using two fully connected layers as prediction layers and
using         as input to the prediction layers. The final prediction result     is obtained.


                                         (17)
   In order to learn the parameter settings of the model accurately, the final total loss
function is expressed as follows.



                                            (18)

where      denotes the predicted value of the mth sensor at time t and            denotes the true
value of the mth sensor at time       . We optimize the model parameters using stochastic
gradient descent and backpropagation algorithms and further update the parameters using
Adam optimizer.

3. Experimental comparison and analysis
3.1. Datasets

In this section, two publicly available real-world traffic prediction datasets are used to validate
the effectiveness of the model, PeMS-Bay[9] and METR-LA dataset[9].The PeMS-Bay dataset
contains measurements from 325 sensors for the period of January 1, 2017 to May 31, 2017 for
the Bay area.The METR-LA dataset contains measurement data from 207 sensors for the
period March 1, 2012 to June 30, 2012 in Bay. The speeds contained in both datasets are in
miles per hour. Table 3.1 summarizes the statistics for both datasets.

Table 1
Statistics of the data set
    Datasets          Time Series      Nodes              Time          Input         Output
                                                          Interval      Length        Length

  META-LA                   34272      207                5min          12            12

  PeMS-Bay                  52116      325                5min          12            12



3.2. Evaluation Criteria
This section evaluates the performance of the model using three metrics that are more
commonly used in traffic flow prediction, including MAE, RMSE, and MAPE, with the relevant
formulas as follows:



                                                                                         (19)


                                                                                         (20)
where             ,   denotes the length of the output sequence,        denotes the length of the
input sequence,       and     are usually 12,   denotes the true value of the output sequence of a
node at a certain moment, and denotes the predicted value of the input sequence of a node
at a certain moment after the model.

3.3. Algorithm Comparison and Experimental Setup
The number of neighbor nodes constructed by the local spatio-temporal grid in the MSLST
model architecture is 16. For the multi-scale spatio-temporal attention block, the number of
output channels is 128, the number of convolutional layers of the initialized pyramid graph is
4, including a 1*1 convolutional layer with stride 1 and three 4*4 convolutional layers with
stride 4, the number of pyramidal spatio-temporal attentions of the stacked pyramids is 2, and
the number of pyramidal layers of each pyramidal spatio-temporal The number of pyramidal
layers of attention is 4. The number of channels of the four layers of spatio-temporal
convolution of the multichannel spatio-temporal convolution block is 32, 64, 128, 256, and the
size of convolution kernel of the spatio-temporal convolution is 3, respectively.
    The dataset is set in chronological order with the first 70% as the training set, the middle
10% as the validation set, and the last 20% as the test set, and 0.2 70% is randomly selected as
the final training set, and the length of the input sequence P and the length of the output
sequence Q are 12. In terms of the model training, the batch size of the data is 80, and the
number of iterations of the training is 50, and the optimizer for updating the parameters of the
gradient descent is Adam, and the learning rate is 0.001, and the loss function is L1 loss value
that minimizes the true and predicted values. dropout layer uses 0.3.
    In order to verify the superiority of the model MSLST for the traffic prediction task, this
section compares it with the following baseline methods:

   1.   ARIMA (Autoregressive Integrated Moving Average) [11], a well-known time series
        analysis method for predicting future values;
   2.   FC-LSTM (Fully Connected Short-Term Memory Network) [55], a sequence-to-
        sequence model with fully connected LSTM layers in both encoder and decoder;
   3.   STGCN (Spatio-Temporal Graph Convolutional Network) [60], a multi-scale traffic
        network that combines graph convolution and regular convolution;
   4.   DCRNN (Diffusion Convolutional Training Neural Network) [45], a diffusion
        convolution and GRU[?] based codec structure;
   5.   Graph WaveNet[46], an approach that combines diffusion factor convolution and
        graph convolution to model spatio-temporal dependencies;
   6.   DetectorNet (Transformer-enhanced spatio-temporal graph neural network) [63], a
        Transformer spatio-temporal network combining a multi-view temporal attention
        module and a dynamic attention module;

3.4. Experimental results and analysis
1) Comparative Analysis of Mainstream Advanced Algorithms

In this section, the performance of MSLST is evaluated on two real-world traffic prediction
datasets and compared with state-of-the-art traffic flow prediction methods, as shown in
Tables 2 and 3, and the following conclusions can be drawn from the comparison results:

   1.   Deep learning models have better prediction ability than traditional time series
        methods and machine learning models on both datasets, which indicates that deep
        learning methods are better able to model the nonlinear relationships between traffic
        data.
   2.   The traffic flow prediction methods STGCN, DCRNN, Graph Wavenet, DetectorNet,
        and MSLST combined with deep learning and graph structure generally perform
         better than FC-LSTM, which demonstrates that information about the structure of the
         traffic road network is crucial for traffic prediction.
   3.    (c) DetectorNet and Graph WaveNet based on dynamic attention module have small
         detection errors at both time steps compared to STGCN and DCRNN with static road
         network structure, which reveals that the dynamic spatial correlation among modeled
         roads better reflects the dynamic changes of roads, and the learnable adaptive
         adjacency matrix can adapt to the uncertainty of traffic maps, both of which can
         retain valuable potential information.
   4.    (d) MSLST has a great improvement in prediction performance compared to
         DetectorNet in the short, medium, and long term, indicating that fusing temporal and
         spatial dimensions into a single spatio-temporal fluid to capture spatio-temporal
         features of traffic flow can directly model the dynamic spatio-temporal correlation
         among roads, which is a more complete characterization than that of aggregating
         independent temporal and spatial feature extraction modules. mSLST compared to
         Graph WaveNet at 60min has a decrease in MAE of 0.55, which implies that pyramidal
         attention modeling local spatio-temporal correlations at coarser scales enhances long-
         distance dependence, while multi-channel spatio-temporal convolution aggregating
         spatio-temporal features at each scale of pyramidal attention provides compact
         representations of proximity and long-distance spatio-temporal dependence.

Table 2
Comparison of overall performance in auroc, auprc, precision and recall
         METR-LA                         30min                                   60min

                          MAE       RMSE         MAPE        MAE       RMSE         MAPE

          ARIMA           5.15       10.45          12.70%   6.90      13.23        17.40%

         FC-LSTM          3.77       7.23           10.9%    4.37         8.69      13.20%

          STGCN           3.47       7.24           9.57%    4.59         9.40      12.70%

         DCRNN            3.15       6.45           8.80%    3.60         7.60      10.50%

   GraphWavene            3.07       6.22           8.37%    3.53         7.37      10.01%
   t

        DetectorNet       3.06       6.08           8.12%    3.40         6.98      9.60%

          MSLST           2.50       5.19           6.78%    2.98       6.19        8.43%


Table 3
Error comparison between MSLST of this paper and baseline on PeMS-Bay dataset

         METR-LA                            30min                                60min
                          MAE        RMSE       MAPE             MAE       RMSE         MAPE

          ARIMA           2.33       4.76         5.40%          3.38       6.50        8.30%

         FC-LSTM          2.20       4.55        5.20%           2.37       4.96        5.70%

          STGCN           1.81       4.27        4.17%           2.49       5.69        5.79%

         DCRNN            1.74       3.97        3.90%           2.07       4.74        4.90%

   GraphWavene            1.63       3.70        3.67%           1.95       4.52        4.63%
   t

        DetectorNet       1.57       3.54        3.56%           1.80       4.26        4.19%

          MSLST           1.47       3.22        3.29%           1.82       4.19        4.26%


Component Analysis

    In this section, component analysis is performed on two datasets, METR-LA and PeMS-
Bay. Tables 4 and 5 show the statistical results of the three metrics for the component analysis
of the two datasets, and the components are analyzed as follows:

   1.    w/o PM: eliminating the fusion of temporal and spatial dimensions into a single fluid,
         utilizing two independent self-attention for the temporal and spatial dimensions
         respectively for the capture of spatio-temporal correlation of traffic data, and then
         utilizing the gating mechanism to fuse the two features.
   2.    w/o R-PA: Replace pyramidal attention with ordinary self-attention.
   3.    w/o P-FPA: retain only the first layer output of the pyramid and extract regional
         spatio-temporal features using only a single-channel spatio-temporal convolution
         block.

Table 4
Component analysis of this paper's MSLST on the METR-LA dataset
                METR-LA                30min                            60min

                             MAE     RMSE     MAPE        MAE     RMSE      MAPE

                 MSLST        2.50    5.19    6.78%       2.98     6.19     8.43%

                  w/o PM      3.00    6.01     8.09%      3.38     6.90         9.50%

                w/o R-PA      2.67    5.58     7.13%      3.21     6.53         9.12%
                w/o P-FPA      2.69     5.86     7.26%     3.32    6.84      9.26%




Table 5
Component analysis of MSLST in this paper on PeMS-Bay dataset
                PeMS-Bay                 30min                       60min

                               MAE     RMSE     MAPE      MAE     RMSE      MAPE

                   MSLST       1.47     3.22     3.29%     1.82    4.19      4.26%

                   w/o PM      1.76     4.01     3.94%     1.99    4.74      4.71%

The             w/o R-PA       1.57     3.48     3.48%     1.94    4.67      4.63%     following

                w/o P-FPA      1.61     3.51     3.53%     1.95    4.70      4.67%

conclusions can be drawn from the results in Table 4 and Table 4:

1.    For both datasets, MSLST is much more effective than w/o PM, indicating that fusing the
      temporal and spatial dimensions of traffic flow data into a single fluid is more capable of
      modeling the spatio-temporal dynamics of traffic flow by using the attention mechanism
      that can unambiguously capture the spatio-temporal dependence of the target sensors at
      different moments.
2.    The larger error of w/o R-PA than MSLST on both datasets, especially 60min, proves that
      the ability of ordinary self-attention in modeling a single spatio-temporal fluid with long-
      distance multi-step prediction is weaker compared to pyramidal attention, whose multi-
      resolution model effectively constructs the sensor's information transfer over long
      distances in time and space.
3.    For METR-LA and PeMS-Bay, w/o R-PA works slightly better than w/o P-FPA, reflecting
      the multi-channel spatio-temporal convolutional block aggregation pyramid attention
      coarse and fine scale spatio-temporal features of the traffic flow are an integral part of the
      MSLST, and the finest scale spatio-temporal features are not able to simulate the long
      term dependence of traffic events.
3) Parametric Analysis

    In order to verify how much the number of stacked pyramid spatio-temporal attention
affects the model MSLST, 1~4 pyramid spatio-temporal attentions are stacked in this section,
respectively. Figure 4.4 shows the results of the comparison of the average MAE for 12 time
steps.
Table 4
Component analysis of this paper's MSLST on the METR-LA datase
                    Stacks            METR-LA                    PeMS-Bay

                               MA           RMSE        MA            RMSE
                                E            4.46        E             2.54
                                             4.18                      2.37
                       1      2.35           4.24       1.26           2.42
                                             4.41                      2.49
   From the            2      2.22                      1.17

                       3      2.27                      1.19

                       4      2.31                      1.23

experimental table, it can be seen that stacking 2 pyramids spatio-temporal attention is the
best performance case for both METR-LA and PeMS-Bay datasets. As the number of stacked
pyramids increases, the prediction error of the model does not decrease, but rather increases,
the possible reason being that the model structure is too complex and there is overfitting.
Apparently, 2 pyramids spatio-temporal attention is sufficient to deeply explore the complex
spatio-temporal dependence of traffic events, which maintains the adequacy of the nonlinear
structure of the model and does not lead to redundancy in the model structure.

4. Conclusion
In this chapter, a traffic flow prediction model based on a multi-scale local spatio-temporal
network is proposed. First, the temporal and spatial dimensions are fused into a single fluid,
the 3D local spatio-temporal grid is generated by combining the adjacency matrix and traffic
flow features, multiple convolutional layers and pyramidal attention are introduced to learn
the dynamic spatio-temporal dependence of the local spatio-temporal traffic events in
different resolutions, and then the multi-channel spatio-temporal convolutional block is
combined to merge and optimize spatial and temporal features among local spatio-temporal
nodes, so as to get the global spatio-temporal features. Experiments on the traffic flow datasets
PeMS and METR-LA show that the proposed model outperforms state-of-the-art methods.

References
[1] Fang Zheng, Long Qingqing, Song Guojie, et al., Spatial-Temporal Graph ODE Networks
    for Traffic Flow Forecasting [C]// Proceedings of the 27th ACM SIGKDD Conference on
    Knowledge Discovery & Data Mining, pp.364–373.
[2] Wang Xiaoyang, Ma Yao, Wang Yiqi, et al., Traffic Flow Prediction via Spatial Temporal
    Graph Neural Network[C]// Proceedings of The Web Conference 2020,pp. 1082–1092.
[3] Lu Bin, Gan Xiaoying, Jin Haiming, et al., Spatiotemporal Adaptive Gated Graph
    Convolution Network for Urban Traffic Flow Forecasting[C]// Proceedings of the 29th
    ACM International Conference on Information & Knowledge Management, pp.1025–
    1034.
[4] Li Y, Yu R, Shahabi C, et al. Diffusion Convolutional Recurrent Neural Network: Data-
    Driven Traffic Forecasting[C]//International Conference on Learning Representations
    2018.
[5] Tang J, Qian T, Liu S, et al. Spatio-temporal latent graph structure learning for traffic
    forecasting[C]//2022 International Joint Conference on Neural Networks (IJCNN). IEEE,
    2022: 1-8.
[6] He Li, Shiyu Zhang, Xuejiao Li, et al., DetectorNet: Transformer-enhanced Spatial
    Temporal Graph Neural Network for Traffic Prediction [C]// Proceedings of the 29th
    International Conference on Advances in Geographic Information Systems,pp.133–136
[7] Yang S, Liu J, Zhao K. Space meets time: Local spacetime neural network for traffic flow
    forecasting[C]//2021 IEEE International Conference on Data Mining (ICDM). IEEE, 2021:
    817-826.
[8] Liu S, Yu H, Liao C, et al. Pyraformer: Low-complexity pyramidal attention for long-
    range time series modeling and forecasting[C]//International conference on learning
    representations. 2021.
[9] Yu B, Yin H, Zhu Z. Spatio-temporal graph convolutional networks: a deep learning
    framework for traffic forecasting[C]//Proceedings of the 27th International Joint
    Conference on Artificial Intelligence. 2018: 3634-3640.