Spatial-temporal Transformer Network with Self-supervised
Learning for Traffic Flow Prediction
Zhangzhi Peng* , Xiaohui Huang
East China Jiaotong University


                                    Abstract
                                    Traffic flow prediction plays a critical role in improving the quality, security, and efficiency of Intelligent Transportation
                                    Systems (ITS). Accurate prediction requires modeling spatial and temporal characteristics simultaneously. Existing works
                                    usually extract the spatial features by CNN-based modules and temporal features by RNN-based modules. However, the CNN-
                                    based modules are locally biased, performing poorly in global spatial dependencies; and the RNN-based modules concentrate
                                    on learning the high-level temporal dynamics (e.g., periodicity), and fail to consider the numerical closeness between future
                                    data and historical observations as a strong prior knowledge for the prediction. To alleviate these limitations, we propose a
                                    Spatial-temporal Transformer Network with Self-supervised Learning (ST-TSNet). ST-TSNet uses a Pre-Conv Block and vision
                                    transformer to learn the spatial dependencies in both local and global contexts. Furthermore, a skip connection from the
                                    input of historical records to the output prediction is introduced to utilize similar patterns to improve the prediction results.
                                    Finally, a self-supervised strategy called stochastic augmentation is proposed to explore spatial-temporal representations
                                    from massive traffic data to benefit the prediction task. Experiments on two datasets, TaxiBJ and TaxiNYC, demonstrate the
                                    effectiveness of ST-TSNet. The codes is available at https://github.com/pengzhangzhi/spatial-temporal-transformer.


1. Introduction                                                                                        few abrupt changes, showing many similarities in adja-
                                                                                                       cent frames. As depicted in the time series of Fig.1, the
Traffic flow prediction is a build block in Intelligent Trans- ratio of current traffic flow to the previous one (shown
portation Systems (ITS), which is essential for providing in blue line) floats up and down within a fixed ratio of
high-quality traffic service. An accurate prediction of 1 as the traffic flow (shown in orange line) periodically
future traffic flow data depends on modeling the spatial- evolves. This means that adjacent traffic flow snapshots
temporal information from the previous observations. have a close value and exhibit similar distribution. Thus,
This problem can be considered from the spatial and tem- an intuitive idea is to use historical observations as the
poral perspectives. From the spatial perspective, learn- base prediction for future data. Such motivation provides
ing the local spatial correlations is essential since traffic a prior knowledge that forces the model to predict the fu-
volume is most influenced by its nearest neighbors. How- ture data partially based on the original historical records
ever, in real-world scenarios, two distant regions may be instead of completely depending on the extracted tempo-
strongly correlated in their traffic distributions as they ral patterns. However, such similarity is overlooked in
feature the similar functionality (e.g., transportation hub). existing methods [5, 2], as they process the historical data
Most of existing works [1, 2, 3] adopt the convolutional for high-order temporal characteristics (e.g., periodicity),
layers as their backbone to extract the spatial features, distorting the numerical similarity.
which may introduce short-range bias due to their small                                                   With the rapid growth of traffic sensors deployed, a
receptive field. These methods perform well in extract- massive amount of traffic flow data is collected but not
ing local context while hindering in global dependencies. fully utilized. Similarly, in the field of natural language
Recently, Vision transformer (ViT) [4] has shown impres- processing (NLP), TB-level unlabel corpus are collected
sive performance in computer vision, due to its innate but relatively fewer label data is available for various
power at extracting non-local features. We are motivated language tasks. The gap , however, in NLP is successfully
to apply ViT to learn the long-range spatial dependencies. alleviated by self-supervised learning [6], where unlabel
   From the temporal perspective, many works have been data are utilized to learn language representations and
proposed to extract complex temporal patterns, e.g., daily then transferred to facilitate downstream tasks. While
and weekly periodicity [1, 2]. However, we argue that in the field of traffic flow prediction, current training
a simple temporal characteristic: temporal similarity is algorithms are supervised learning, where the histori-
overlooked. Traffic flow data are generally smooth with cal records are regarded as input and the traffic data in
STRL’22: First International Workshop on Spatio-Temporal Reasoning the next timestamp is served as label. No effective un-
and Learning, July 24, 2022, Vienna, Austria                                                           supervised learning algorithms are proposed to learn
*
  Corresponding author.                                                                                spatial-temporal representations to facilitate the traffic
$ pengzhangzhics@gmail.com (Z. Peng); hxh016@gmail.com                                                 flow prediction task.
(X. Huang)
          © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License    Driven by these analyses, we propose a novel frame-
          Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
          CEUR Workshop Proceedings (CEUR-WS.org)
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                                                                                       work called Spatial-temporal Transformer Network with
                                                                            𝑥𝑥�𝑛𝑛
                                                                                                  MSEloss                                     Add &Norm
                       Skip Connection     𝑥𝑥�𝑟𝑟𝑟𝑟                             𝑥𝑥�𝑟𝑟𝑟𝑟                                        ViT
                                                           Fusion                                                                                 Feed
                                           𝑥𝑥�𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡               𝑥𝑥�𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐                                                            Forward

                                     ViT                                         ViT                           Traffic flow map
                                                                                                                                              Add &Norm


                                  Pre-Conv                                   Pre-Conv                                                         Multi-Head
                                                                                                                                               Attention
                                   Block                                      Block
                                         𝑋𝑋𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡               concatenate          𝑋𝑋𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
                                                                                                                                         Linear Projection


                                                                                                                                              patchify Stem
             𝑥𝑥𝑛𝑛−3𝐿𝐿𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤    𝑥𝑥𝑛𝑛−2𝐿𝐿𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤          𝑥𝑥𝑛𝑛−𝐿𝐿𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤 𝑥𝑥𝑛𝑛−6 … 𝑥𝑥𝑛𝑛−1              𝑥𝑥𝑛𝑛
                                                                                                                     Time
                Past                                                                               Future

                                                             Present
                                                                                                               2.5
                                                                                                                2
             Pre-training strategy                                                                             1.5
                                                                                                                1
                                                               Conv3                                           0.5
                                                                                                                0

                       Pre-Conv Block
                                                               Conv1           Conv2      +       Conv4
                                                                                                                                  Time Slot


Figure 1: The overall architecture of Spatial-temporal Transformer Network with Self-supervised Learning (ST-TSNet). The
three time axes illustrate our pre-training strategy. The time series at the bottom shows the periodicity of traffic flow data; the
blue line denotes the ratio, and the orange line denotes the normalized traffic flow observations. The figure reveals that as the
traffic flow periodically changes, the ratio floats up and down from a fixed value 1.


Self-supervised Learning (ST-TSNet). ST-TSNet consists                                   2. Related Work
of a Pre-Conv Block and ViT for learning spatial cor-
relations in both local and global contexts. In addition,                                Traffic Flow Prediction. There are two types of flow
we directly connect the historical data to the output to                                 data in the traffic flow prediction task: grid-like raster
make full use of the historical data as the base predictions.                            data and graph data and thus two distinct paradigms are
Lastly, a self-supervised task named stochastic augmen-                                  derived for the two types of data [7]. In our work, we
tation is proposed to pre-train our ST-TSNet to learn                                    focus on raster data. Existing mainstream traffic predic-
spatial-temporal representations and fine-tune them to                                   tion methods for raster data fall into one of the following
benefit the prediction task.                                                             classes: statistical methods or deep learning methods.
   The contributions of this work are summarized as fol-                                 Statistical methods include auto-regressive integrated
lows.                                                                                    moving average (ARIMA) [8], Kalman filtering [9] and
                                                                                         historical average. These methods often require strong
     • We propose a novel framework Spatial-temporal                                     and trivial theoretical assumptions, which may violate
       Transformer Network with Self-supervised Learn-                                   the nonlinearity of traffic flow data, thus having poor
       ing (ST-TSNet) to capture spatial-temporal fea-                                   performance in the real world. Recent advances have
       tures.                                                                            witnessed the impressive capacity of deep learning to
     • We employ a simple yet effective skip connection                                  extract nonlinear features from big data [10]. Many re-
       strategy, plugged into ST-TSNet, to make full use                                 searchers are inspired to apply deep learning to handle
       of the temporal similarities in traffic flow data.                                traffic flow prediction task. Existing deep learning meth-
     • We introduce self-supervised learning to our                                      ods are based on convolutional neural networks (CNNs)
       framework and design a pre-training task                                          and recurrent neural networks (RNNs) [11]. ST-ResNet
       called stochastic augmentation to explore spatial-                                [1] first employs the CNNs with residual connections to
       temporal features to boost traffic flow prediction                                learn the spatial dependencies and construct historical
       task.                                                                             data into different branches according to the temporal
     • We conduct extensive experiments on two bench-                                    semantics to learn temporal features. Similar ideas are
       marks (TaxiBJ and TaxiNYC) to evaluate the ef-                                    adopted by subsequent works [2, 12] in which 3D convo-
       fectiveness of our methods and the results show                                   lution is used to learn the spatial-temporal dependencies.
       that our ST-TSNet outperforms state-of-the-art                                    Moreover, RNN-based models [13, 14] are inspired to
       methods.                                                                          use convolutional layer to capture spatial features and
sequential hierarchy (e.g., LSTM and GRU) to extract          outflow, and H and W denote the number of rows and
temporal patterns. However, these methods are time-           columns of the grid map, respectively. The purpose of
consuming as they make predictions step by step and           traffic flow prediction is to predict 𝑥𝑛 given historical
may suffer gradient vanishing or explosion when captur-       traffic flow records 𝑋ℎ𝑖𝑠 = {x𝑡 | 𝑡 = 0, . . . , 𝑛 − 1}. As
ing long-range sequences [15]. To alleviate the problems,     shown in Fig.1, the historical data is summarized into
[15, 16] discard the recurrent chain structure and employ     two categories in the time axis: Closeness sequence
Multiplicative Cascade Unit (CMU) with autoencoders           𝑋𝑐𝑙𝑜𝑠𝑒 = {𝑋𝑛−1 , 𝑋𝑛−2 , · · · , 𝑋𝑛−(𝑑𝑐 −1) , 𝑋𝑛−𝑑𝑐 } ∈
while preserving the convolutional layers for learning        R2×𝑑𝑐 ×𝐻×𝑊 is a concatenation of recent his-
spatial features. The methods used by existing works can      torical data where 𝑑𝑐 is the length of close-
be considered from spatial and sequential perspectives.       ness sequence.         Trend sequence 𝑋𝑡𝑟𝑒𝑛𝑑             =
From the spatial perspective, convolutional layers are the    {𝑋𝑛−𝐿𝑤𝑒𝑒𝑘 , 𝑋𝑛−2·𝐿𝑤𝑒𝑒𝑘 , · · · , 𝑋𝑛−𝑑𝑡 ·𝐿𝑤𝑒𝑒𝑘 }           ∈
mainstream, including 2D and 3D convolution. From the         R2×𝑑𝑡 ×𝐻×𝑊 is a concatenation of periodic historical
sequential perspective, there are many choices, including     data from the past few weeks, where 𝑑𝑡 is the length of
RNN, GRU, LSTM and CMU. Most existing works are               trend sequence, 𝐿𝑤𝑒𝑒𝑘 is the number of intervals within
a combination of these methods. In summary, existing          a week.
methods that based on CNNs suffer from short-range
bias as the small receptive field limits their capacity to    3.2. Spatial-temporal Transformer
extract global dependencies.
Self-supervised Learning. Self-supervised learning                 Network
is a great way to extract training signals from massive       Overall, we employ a symmetric structure for handling
amounts of unlabelled data and to learn general represen-     the trend data 𝑋𝑡𝑟𝑒𝑛𝑑 and and the closeness data 𝑋𝑐𝑙𝑜𝑠𝑒 :
tation to facilitate downstream tasks which the labelled      a Pre-Conv Block followed by a ViT with two shortcuts
data are limited. To generate supervision information         (i.e., two blue lines shown in Fig.1) from the input to the
from data, a general strategy is to define pre-training       fusion layer. In the end, fusion layer adaptively merges
tasks for models [17, 18] to learn semantic representa-       four components (two residual components 𝑋    ˆ 𝑟𝑐 and 𝑋
                                                                                                                     ˆ 𝑟𝑡 ,
tions, and then transfer them to downstream tasks to          two outputs 𝑋  ˆ 𝑐𝑙𝑜𝑠𝑒 and 𝑋
                                                                                         ˆ 𝑡𝑟𝑒𝑛𝑑 ) to generate prediction
improve performance and robustness. Many works in             ˆ𝑛 .
                                                              𝑥
computer vision have defined various tasks based on           Pre-Conv Block. The traffic flow in a region is highly
heuristic methods[19, 20]. For example, [21] learns vi-       relevant to its nearby regions. We design a Pre-Conv
sual representations by predicting the image rotations.       Block for capturing such short-range dependencies. As
In natural language processing (NLP), masked language         illustrated in Fig.1, Conv1 and Conv2 are the main con-
modeling, e.g., Bert [6] have shown to be excellent for       volutional layers to capture short-range dependencies.
pre-training language models. These methods mask a            Thus, we employ a small kernel size ( i.e., 3 × 3) which
portion of the input sequence and train models to predict     leads to the receptive field of 5. Such design ensures the
the missing content with the rest. Such methods are effec-    Pre-Conv Block only captures the local dependencies at
tive for learning semantic correlations of elements within    most in 5 × 5 regions. The short-range dependencies
a sequence, e.g., sentence. The traffic flow data can also    are well-captured by the Pre-Conv Block while leaving
be viewed as a sequence temporally, while the effective-      the long-range features to the vision transformer. In-
ness of self-supervised learning remains unexplored in        serting CNNs before ViT has shown to be effective in
traffic flow prediction task.                                 strengthening the capacity of ViT [22]. Conv3 is the
                                                              residual shortcut, employing 64 kernels with size 1 × 1,
                                                              which adds up to the main branch as a residual compo-
3. Methods                                                    nent. Generally, we will use much more kernels (e.g., 64)
                                                              than that in Conv4. By enlarging and then reducing the
3.1. Problem Formulation                                      number of channels, Pre-Conv Block can learn various
We partition a city into an image-like grid map according spatial-temporal dependencies and then refine them into
the longitude and latitude, as shown in the traffic flow a compact feature map.
map of Fig.1, where each grid denotes a region. The Vision transformer. We apply vision transformer (ViT)
value of a grid denotes the traffic flow (inflow or outflow). [4] after the Pre-Conv Block to capture the global depen-
The device deployed at a region will periodically record dencies, as shown in the right of Fig.1. ViT is comprised of
the number of people arriving at and departing from two main components: “Patchify” stem and transformer
the location to collect the inflow and outflow. The encoder. “Patchify” stem spatially splits the input feature
traffic flow map of the entire city at time 𝑡 is noted map into non-overlap 𝑝 × 𝑝 patches and linearly projects
as xt ∈ R2×𝐻×𝑊 , where 2 refers to the inflow and patches into tokens. Each token contains the information
                                                              of a patch of regions. Then the tokens are fused with
learnable positional encoding to preserve the 2D posi-           Algorithm 1 The pre-training procedure with stochastic
tional information and fed into transformer encoder. The         augmentation.
encoder utilizes a multi-head self-attention mechanism to        Input: MASA model: 𝑓𝜃 , closeness data: 𝑋𝑐𝑙𝑜𝑠𝑒 , trend
model the long-range dependencies followed by a layer                   data: 𝑋𝑡𝑟𝑒𝑛𝑑 , and predicted future data: 𝑥𝑛 .
normalization and residual connection (Add & Norm)               Output: pre-trained MASA model.
to the next sub-layer, where a Feed Forward Network              repeat
(FFN) and another Add & Norm are employed to further                𝑋𝑔𝑟𝑜𝑢𝑝 ← 𝑋𝑐𝑙𝑜𝑠𝑒 ∪ 𝑋𝑡𝑟𝑒𝑛𝑑 ∪ 𝑥𝑛
process the tokens. Lastly, the tokens are averaged and              target 𝛼 ← 𝑅𝑎𝑛𝑑𝑜𝑚𝑆𝑎𝑚𝑝𝑙𝑖𝑛𝑔(𝑋𝑔𝑟𝑜𝑢𝑝 )
then linearly transformed to generating output:𝑋      ˆ 𝑐𝑙𝑜𝑠𝑒        Remaining snapshots Ω ← 𝑋𝑔𝑟𝑜𝑢𝑝 − 𝛼
and 𝑋 ˆ 𝑡𝑟𝑒𝑛𝑑 .
Skip Connection. Skip Connection are employed to                      pre-trained data (Ω, 𝛼)
transfer similar patterns from the historical observations            predictions 𝑦ˆ ← 𝑓𝜃 (Ω)
to the output as the base prediction. To preserve the origi-          loss ← 𝑀 𝑆𝐸𝐿𝑜𝑠𝑠(𝑦ˆ, 𝛼)
nal similar patterns in historical data, we directly connect          𝑏𝑎𝑐𝑘𝑝𝑟𝑜𝑝(𝑙𝑜𝑠𝑠)
input 𝑋𝑐 and 𝑋𝑡 to the fusion layer, as shown in the blue             update 𝑓𝜃
line of Fig.1. Before connecting, we aggregate histori-          until stop criteria is met;
cal input data in the time dimension to match the shape.
For two historical sequences 𝑋𝑐𝑙𝑜𝑠𝑒 ∈ R2×𝑑𝑐 ×𝐻×𝑊 and
𝑋𝑡𝑟𝑒𝑛𝑑 ∈ R2×𝑑𝑡 ×𝐻×𝑊 , we compute:                                the rest to predict the target. Such scheme can be ex-
                                                                 panded to three cases: (1) if the last frame is selected as
                                                                 the target, then this is similar to supervised training, the
             ˆ 𝑟𝑐 = 𝑓 (𝑋𝑐 ) ∈ R2×1×𝐻×𝑊 ,
             𝑋                                            (1)    historical records are used to predict future data; (2) if the
             ˆ 𝑟𝑡 = 𝑓 (𝑋𝑡 ) ∈ R2×1×𝐻×𝑊 ,
             𝑋                                            (2)    earliest frame is the target, then future observations are
                                                                 used to predict the past frame, as shown in the green axis
where 𝑋  ˆ 𝑟𝑐 and 𝑋    ˆ 𝑟𝑡 are the two residual compo-          of Fig.1; (3) if any intermediate frame is selected as the
nents. 𝑓 (·) is an aggregation function R2×𝐷×𝐻×𝑊 →               target, then the historical data and future observations
R2×1×𝐻×𝑊 , where 𝐷 denotes the length of historical              are used to predict present, as shown in the red axis of
data sequence. Here we use a summation function. Fi-             Fig.1. Different from the downstream prediction task,
nally, the two residual components will be fused in the          where input historical records and future data are paired
fusion layer.                                                    to be the training samples, our stochastic augmentation
Fusion Layer. The degree of influence of the four compo-         produces several times more samples for pretraining by
nents (i.e., two outputs 𝑋  ˆ 𝑐𝑙𝑜𝑠𝑒 , 𝑋
                                      ˆ 𝑡𝑟𝑒𝑛𝑑 and two residual   randomly constructing input-target pairs. For example,
components 𝑋    ˆ 𝑟𝑐 , 𝑋
                       ˆ 𝑟𝑐 ) is different, and the influence    given a group of five frames, the supervised learning only
in different regions also varies. Therefore, to dynami-          gives one training sample as stated in case (1). While our
cally calibrate their contributions, we follow [23] to use       stochastic augmentation paradigm yields five pretraining
a parametric-matrix-based fusion method, where the pa-           samples (every frame in the group is selected to be the
rameter matrices are learned from historical data. For-          target once), five times more samples than supervised
mally,                                                           training. With the large amount of pretraining samples,
                                                                 our models can explore useful spatial-temporal represen-
         ˆ 𝑝𝑟𝑒𝑑 =𝑤𝑐 · 𝑋
         𝑋            ˆ 𝑐𝑙𝑜𝑠𝑒 + 𝑤𝑡 · 𝑋
                                     ˆ 𝑡𝑟𝑒𝑛𝑑 +                   tations for the downstream prediction task. Specifically
                                                          (3)    for the traffic flow prediction task, we define the group
                       ˆ 𝑟𝑐 + 𝑤𝑟𝑡 · 𝑋
                 𝑤𝑟𝑐 · 𝑋             ˆ 𝑟𝑡 ,
                                                                 as the union of closeness data, trend data, and predicted
where · denotes element-wise multiplication, 𝑤 is the            ground truth: 𝑋𝑔𝑟𝑜𝑢𝑝 = 𝑋𝑐𝑙𝑜𝑠𝑒 ∩ 𝑋𝑡𝑟𝑒𝑛𝑑 ∩ 𝑥𝑛 . Then
learnable parameter that measures the influence of each          we randomly sample one snapshot as the target 𝛼 and
component.                                                       the rest data Ω = 𝑋𝑔𝑟𝑜𝑢𝑝 − 𝛼 as the input, construct-
                                                                 ing pre-training data (Ω, 𝛼) to pre-train our model. The
                                                                 algorithm is depicted in Alg.1.
3.3. Self-supervised Learning with
     Stochastic Augmentation
                                                                 4. Experiments
Our stochastic augmentation aims to pretrain our model
to learn general spatial-temporal features to facilitate the     4.1. Dataset and Evaluation
prediction task. The pretraining strategy is conceptually
simple: we select a group of continuous traffic frames,          Dataset. Our experiments are based on two traffic flow
randomly sample a frame as the predicted target and use          datasets: TaxiBJ and TaxiNYC. Additional external data
Table 1
Performance comparison of different methods on TaxiBJ and TaxiNYC.
                                         TaxiBJ                                                   TaxiNYC
       Model
                      RMSE         MAPE (%)             APE           RMSE                   MAPE (%)             APE
        HA             40.93          30.96          6.77E+07         164.31                    27.19          7.94E+05
  ST-ResNet [23]    17.56±0.91     15.74±0.94   4.81E+07±3.03E+05   35.87±0.60               22.52±3.43   6.57E+05±1.00E+05
    MST3D [12]      21.34±0.55     22.02±1.40   4.81E+07±3.03E+05   48.91±1.98               23.98±1.30   6,98E+05±1.34E+04
    ST-3DNet [2]    17.29±0.42     15.64±0.52   3.43E+07±1.13E+06   41.62±3.44               25.75±6.11   7.52E+05±1.78E+05
   3D-CLoST [14]    17.10±0.23     16.22±0.20   3.55E+07±4.39E+05   48.17±3.16               22.18±1.05   6.48E+05±3.08E+04
     STAR [24]      16.25±0.40     15.40±0.62   3.38E+07±1.36E+06   36.44±0.88               25.36±5.24   7.41E+05±1.53E+05
   PredCNN [15]     17.42±0.12     15.69±0.17   3.43E+07±3.76E+05   40.91±0.51               25.65±2.16   7.49E+05±6.32E+04
 STREED-Net [16] 15.61±0.11        14.73±0.21   3.22E+07±4.51E+05   36.22±0.72               20.29±1.48   5.93E+05±4.31E+04
 ST-TSNet (ours)    16.04±0.08    14.63±0.05    3.20E+07±1.05E+5   34.34±0.32                15.68±0.09 4.58E+05±2.52E+03

Table 2
Ablation study of sub-modules in ST-TSNet.
                                                 TaxiBJ                                            TaxiNYC
          Variant
                              RMSE        MAPE (%)              APE                RMSE       MAPE (%)             APE
           ViT                20.16          34.68           7.60E+07              51.82         96.52          2.12E+08
         ViT + SC          17.12±0.35     15.56±0.29    3.41E+07±6.29E+05       57.45±5.39    22.99±2.59   6.71E+07±7.57E+05
         PC + SC           19.17±0.05     29.16±1.14    6.39E+07±2.50E+06       37.36±0.32    49.24±1.94   1.08E+08±4.25E+06
         ViT + PC          16.34±0.21     14.70±0.13    3.22E+07±2.86E+05       37.29±2.88    16.83±0.24   4.91E+07±7.11E+04
      ViT + PC + SC        16.14±0.16     14.62±0.06 3.20E+07±1.38E+05          34.87±0.39    16.18±0.20   4.72E+07±5.71E+04
    ViT + PC + SC + SA     16.07±0.06     14.68±0.08     3.22E+07±1.72E05       34.47±0.23    15.90±0.08   4.64E+07±2.43E+04
     ST-TSNet (w Ext)      16.04±0.08     14.63±0.05    3.21E+07±1.05E+05       34.34±0.32    15.68±0.09 4.58E+07±2.52E+05


including DayOfWeek, Weekday/Weekend, holidays, and                  predicted target back to the original value. We split the
meteorological data (i.e., temperature, wind speed, and              last 28 days as the test set for both datasets, and the re-
weather) are processed into a one-hot vector. There                  maining are regarded as training data. During training,
are 20,016 constructed samples in TaxiBJ and 41,856 in               we select 90% of the training data for training models
TaxiNYC.                                                             and the remaining 10% is the validation set to early-stop
                                                                     our training algorithm. Our model is implemented and
        • TaxiBJ [23]: TaxiBJ is a citywide crowd flow               trained by PyTorch. We use Adam [25] as the optimizer
          dataset collected every half hour in Beijing. Based        with a learning rate of 0.001 for TaxiBJ and 0.005 for
          on the geographic area of Beijing, we partition            TaxiNYC. Cosine learning rate decay is employed to ad-
          the Beijing city into 32 × 32 regions.                     just the learning rate at each iteration. The batch size is
        • TaxiNYC [16]: TaxiNYC is the taxi trip record              128 for both TaxiBJ and TaxiNYC. We run our model for
          dataset collected every one hour in New York               600 epochs on TaxiBJ and 800 epochs on TaxiNYC. Our
          City. New York City is divided into 16×8 regions           ViT has two blocks, and the patch size is set to (8, 8); the
          based on the longitude and latitude1 .                     token dimension is set to 128; the number of attention
                                                                     heads is 2; the size of FFN is 512.
   Evaluation Metric. Three metrics: Rooted Mean
Square Error (RMSE), Mean Absolute Percentage Error
(MAPE), and Absolute Percentage Error (APE) are used
                                                                     4.3. Quantitative Comparison
to evaluate our proposed method. We follow previous                  Table 1 shows the comparing results against the state-
works that compute the metrics on traffic flow value                 of-the-art methods. We compare our ST-TSNet with the
that is larger than 10 to ensure a fair comparison. We               following baselines: HA, ST-ResNet [23], MST3D [12],
conducted experiments ten times for reliable results and             ST-3DNet [2], 3D-CLoST [14], STAR [24], and PredCNN
presented the means and standard variances of the re-                [15]. The results of the baselines are from [16].
sults.                                                                  On TaxiBJ, our method exceeds the SOTA STREED-
                                                                     Net in terms of MAPE and APE and achieves comparable
4.2. Implementation Details                                          results in RMSE. While on TaxiNYC, our method sig-
                                                                     nificantly outperforms the SOTA ST-ResNet across all
Min-Max normalization is applied in our experiments                  metrics by a fair margin (1.53 RMSE, 4.61 MAPE, and
to scale the data to range [−1, 1] and denormalize the               1.35E+05 APE improvement).
1                                                                       ST-TSNet has a more significant performance improve-
    The raw records are available at the NYC government website. A
    processed version for experiments is available at github         ment on TaxiNYC than TaxiBJ. The possible reason of
                         (a) Predictions on consecutive time         (b) A prediction sample for each method


                                                                     (e) Absolute error of the above predictions


                                                                  (d) Residual
                                                                      weight


                                (c) Self-attention map                           Inflow            outflow   Inflow            outflow
                                                                                          𝑤𝑤𝑟𝑟𝑟𝑟                      𝑤𝑤𝑟𝑟𝑡𝑡

Figure 2: Qualitative analysis of our methods. (a) comparing the predicted results of each method at different time slots.
(b) visualizing a prediction sample for each method and (e) showing the absolute errors of these predictions. (c) illustrating
the self-attention scores of four corner patches (the pentagram-marked) for other patches and revealing that they attend to
remote patches (brighter color) for long-range spatial dependencies. (d) visualizing the inflow and outflow weight of the two
residual components in the fusion layer; high-flow regions usually have a higher weight.


the improvement is that the amount of data of TaxiNYC           the predictions of each method at different time intervals.
is twice that of TaxiBJ (41,856 vs. 20,016), which sig-         The magnified subplot reveals that our method has better
nificantly facilitates the pre-training. This result prove      accuracy in predicting the peak. Fig.2 (b) spatially visu-
the effectiveness of the self-supervised learning module        alizes a prediction sample of each method, and Fig.2 (e)
proposed in our method. STREED-Net and STAR have                displays the absolute errors of these predictions, demon-
impressive performance on TaxiBJ against other base-            strating that our ST-TSNet has lower prediction errors
lines due to the simple single-branch design. However,          than baselines. Fig.2 (c) shows the self-attention map for
such simple architecture performs worse than ours in            four reference patches. The visualizations are produced
a larger dataset TaxiNYC (1.88 RMSE higher than our             by attention scores computed via query-key product in
ST-TSNet) as there are rich spatial-temporal information        the ViT. We use the pentagram-marked regions as query,
that a single-branch structure can not extract effectively.     and show which patch (region) they attend to. The four
Although STREED-Net and PredCNN both introduce cas-             corner patches usually attend to remote regions (brighter
cading hierarchical structure in their backbone, STREED-        color meaning higher attention scores) while caring less
Net has better performance than PredCNN. The reason             about their neighbors. The reason is that the short-range
is that STREED-Net additionally introduces channel and          features are perfectly captured and encoded into tokens
spatial attention mechanisms to dynamically refine the          by Pre-Conv Block, resulting in the ViT focusing more on
learned features to generate predictions. Nevertheless,         the long-range features. Fig.2 (d) visualizes the weights of
the cascading hierarchical structure still suffers from         inflow and outflow of two residual components. Combin-
short-range bias as it only allows distant snapshots to         ing the ground truth in Fig.2 (c), we observe that although
interact at higher layers. ST-ResNet, STAR, and Pred-           the weights vary in different regions and differ from in-
CNN introduce a 2D convolutional layer, and MST3D,              flow to outflow, they tend to concentrate on the regions
ST-3DNet, and 3D-CloST employ 3D convolution. The               with higher traffic flow. The reason is that these regions
3D convolutional layer is better than the 2D counterparts       show a more regular time series, having more similar
as it can additionally capture temporal features, while         patterns in residual components.
2D convolutions are restricted to only capture spatial
features. However, they all suffer from short-range bias        4.5. Ablation Study
due to the small receptive field of convolution. More-
over, they do not introduce the skip connection and any     To verify the effectiveness of proposed methods, we de-
additional pre-training strategies, resulting in inferior   sign a list of variants by appending modules step by step
performance.                                                and comparing them on TaxiBJ and TaxiNYC. The basic
                                                            variant is Vision transformer (ViT). We separately append
                                                            skip connection (SC), Pre-Conv Block (PC), and stochas-
4.4. Qualitative Analysis
                                                            tic argumentation pre-training (SA) to ViT to construct
We offer four intuitive visualizations of proposed meth- other variants. We further consider the external factors
ods to explain their behaviors in Fig.2. Fig.2 (a) compares on our ST-TSNet (ST-TSNet (w Ext)). We use an external
module (two-layer multilayer perceptron) to model the           References
external features according to [5]. The external data is
transformed and added together with the main output to           [1] J. Zhang, Y. Zheng, D. Qi, Deep spatio-temporal
yield prediction.                                                    residual networks for citywide crowd flows predic-
   The results in Table 2 show that: 1) the full version             tion, in: Proc. of AAAI, 2017.
of the our methods (i.e., ST-TSNet (w Ext)) achieves the         [2] S. Guo, Y. Lin, S. Li, Z. Chen, H. Wan, Deep spa-
best performance. 2) Adding each module step by step                 tial–temporal 3d convolutional neural networks for
will progressively improve the performance. It suggests              traffic data forecasting, IEEE Transactions on Intel-
that each module is an indispensable component for our               ligent Transportation Systems (2019).
ST-TSNet.                                                        [3] S. Fang, Q. Zhang, G. Meng, S. Xiang, C. Pan, Gst-
   We additionally study the strategy of the skip con-               net: Global spatial-temporal network for traffic flow
nection by introducing a new residual component: the                 prediction, in: Proc. of IJCAI, 2019.
Pre-Conv Block output 𝑌𝑐𝑜𝑛𝑣 . We investigate two con-            [4] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis-
nection strategies: additionally and solely connect 𝑌𝑐𝑜𝑛𝑣            senborn, X. Zhai, T. Unterthiner, M. Dehghani,
to the fusion layer. Results show that the two strategies            M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit,
degrade performance (1.66 and 1.35 RMSE degradation),                N. Houlsby, An image is worth 16x16 words: Trans-
suggesting that the 𝑌𝑐𝑜𝑛𝑣 is harmful for prediction. The             formers for image recognition at scale, in: Proc. of
performance degradation may be caused by the convolu-                ICLR, 2021.
tional operations in Pre-Conv Block disrupt the semantic         [5] J. Zhang, Y. Zheng, D. Qi, R. Li, X. Yi, Dnn-based
information in historical data (e.g., traffic distributions),        prediction model for spatio-temporal data, Proceed-
resulting in the 𝑌𝑐𝑜𝑛𝑣 and predicted target have different           ings of the 24th ACM SIGSPATIAL International
distributions. In contrast, the historical records ( 𝑥𝑡𝑟𝑒𝑛𝑑          Conference on Advances in Geographic Informa-
and 𝑥𝑐𝑙𝑜𝑠𝑒 ) and the predicted target are collected from             tion Systems (2016).
the same distribution and temporally correlated. Thus            [6] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:
the historical records share similar patterns with the pre-          Pre-training of deep bidirectional transformers for
dicted target that can directly contribute to the prediction,        language understanding, in: Proc. of ACL, 2019.
while the 𝑌𝑐𝑜𝑛𝑣 confuses the model.                              [7] X. Yin, G. Wu, J. Wei, Y. Shen, H. Qi, B. Yin, Deep
                                                                     learning on traffic prediction: Methods, analysis
                                                                     and future directions, IEEE Transactions on Intelli-
5. Conclusion                                                        gent Transportation Systems (2021).
                                                                 [8] B. M. Williams, L. A. Hoel, Modeling and fore-
In this paper, we present a novel traffic prediction frame-          casting vehicular traffic flow as a seasonal arima
work, spatial-temporal Transformer Network with Self-                process: Theoretical basis and empirical results,
supervised Learning (ST-TSNet) for learning spatial-                 Journal of Transportation Engineering-asce (2003).
temporal features. ST-TSNet is equipped with Pre-Conv            [9] J. hua Guo, W. Huang, B. M. Williams, Adaptive
Block and ViT to capture local and spatial dependencies.             kalman filter approach for stochastic short-term
In addition, we observe the similarity in traffic flow data,         traffic flow rate prediction and uncertainty quantifi-
which enables us to take advantage of the historical data            cation, Transportation Research Part C-emerging
as the base prediction for the future. Finally, we propose           Technologies (2014).
a pretext task named stochastic argumentation to enable         [10] R. Salakhutdinov, Deep learning, in: Proc. of KDD,
models to further explore spatial-temporal representa-               2014.
tions under limited data. Experiments on two datasets           [11] Y. Lv, Y. Duan, W. Kang, Z. Li, F. Wang, Traffic flow
demonstrate the superiority of our proposed methods.                 prediction with big data: A deep learning approach,
                                                                     IEEE Transactions on Intelligent Transportation
Acknowledgments                                                      Systems (2015).
                                                                [12] C. Chen, K. Li, S. Teo, G. Chen, X. Zou, X. Yang,
This research was funded by the National Natural Sci-                R. Vijay, J. Feng, Z. Zeng, Exploiting spatio-
ence Foundation of China under Grant No.62062033, and                temporal correlations with multiple 3d convolu-
the Natural Science Foundation of Jiangxi Province un-               tional neural networks for citywide vehicle flow
der Grant No.20212BAB202008. Zhangzhi, in particular,                prediction, 2018 IEEE International Conference on
would like to thank his father Jianhua Peng and mother               Data Mining (ICDM) (2018).
Changmei Zhang for countless love and support during            [13] Z. Zhao, W. Chen, X. Wu, P. C. Y. Chen, J. Liu, Lstm
the work was developed. I love you all.                              network: a deep learning approach for short-term
                                                                     traffic forecast, Iet Intelligent Transport Systems
                                                                     (2017).
[14] S. Fiorini, G. Pilotti, M. Ciavotta, A. Maurino, 3d-
     clost: A cnn-lstm approach for mobility dynamics
     prediction in smart cities, 2020 IEEE International
     Conference on Big Data (Big Data) (2020).
[15] Z. Xu, Y. Wang, M. Long, J. Wang, Predcnn: Predic-
     tive learning with cascade convolutions, in: Proc.
     of IJCAI, 2018.
[16] S. Fiorini, M. Ciavotta, A. Maurino, Listening to
     the city, attentively: A spatio-temporal attention
     boosted autoencoder for the short-term flow pre-
     diction problem, ArXiv preprint (2021).
[17] R. Zhang, P. Isola, A. A. Efros, Colorful image
     colorization, in: Proc. of ECCV, 2016.
[18] D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell,
     A. A. Efros, Context encoders: Feature learning by
     inpainting, in: Proc. of CVPR, 2016.
[19] I. Misra, L. van der Maaten, Self-supervised learning
     of pretext-invariant representations, in: Proc. of
     CVPR, 2020.
[20] M. Noroozi, P. Favaro, Unsupervised learning of
     visual representations by solving jigsaw puzzles, in:
     Proc. of ECCV, 2016.
[21] S. Gidaris, P. Singh, N. Komodakis, Unsupervised
     representation learning by predicting image rota-
     tions, in: Proc. of ICLR, 2018.
[22] A. Hassani, S. Walton, N. Shah, A. Abuduweili, J. Li,
     H. Shi, Escaping the big data paradigm with com-
     pact transformers, ArXiv abs/2104.05704 (2021).
[23] J. Zhang, Y. Zheng, D. Qi, R. Li, X. Yi, T. Li, Pre-
     dicting citywide crowd flows using deep spatio-
     temporal residual networks, Artif. Intell. (2018).
[24] H. Wang, H. Su, Star: A concise deep learning
     framework for citywide human mobility prediction,
     2019 20th IEEE International Conference on Mobile
     Data Management (MDM) (2019).
[25] D. P. Kingma, J. Ba, Adam: A method for stochastic
     optimization, in: Proc. of ICLR, 2015.