Spatial-temporal Transformer Network with Self-supervised Learning for Traffic Flow Prediction Zhangzhi Peng* , Xiaohui Huang East China Jiaotong University Abstract Traffic flow prediction plays a critical role in improving the quality, security, and efficiency of Intelligent Transportation Systems (ITS). Accurate prediction requires modeling spatial and temporal characteristics simultaneously. Existing works usually extract the spatial features by CNN-based modules and temporal features by RNN-based modules. However, the CNN- based modules are locally biased, performing poorly in global spatial dependencies; and the RNN-based modules concentrate on learning the high-level temporal dynamics (e.g., periodicity), and fail to consider the numerical closeness between future data and historical observations as a strong prior knowledge for the prediction. To alleviate these limitations, we propose a Spatial-temporal Transformer Network with Self-supervised Learning (ST-TSNet). ST-TSNet uses a Pre-Conv Block and vision transformer to learn the spatial dependencies in both local and global contexts. Furthermore, a skip connection from the input of historical records to the output prediction is introduced to utilize similar patterns to improve the prediction results. Finally, a self-supervised strategy called stochastic augmentation is proposed to explore spatial-temporal representations from massive traffic data to benefit the prediction task. Experiments on two datasets, TaxiBJ and TaxiNYC, demonstrate the effectiveness of ST-TSNet. The codes is available at https://github.com/pengzhangzhi/spatial-temporal-transformer. 1. Introduction few abrupt changes, showing many similarities in adja- cent frames. As depicted in the time series of Fig.1, the Traffic flow prediction is a build block in Intelligent Trans- ratio of current traffic flow to the previous one (shown portation Systems (ITS), which is essential for providing in blue line) floats up and down within a fixed ratio of high-quality traffic service. An accurate prediction of 1 as the traffic flow (shown in orange line) periodically future traffic flow data depends on modeling the spatial- evolves. This means that adjacent traffic flow snapshots temporal information from the previous observations. have a close value and exhibit similar distribution. Thus, This problem can be considered from the spatial and tem- an intuitive idea is to use historical observations as the poral perspectives. From the spatial perspective, learn- base prediction for future data. Such motivation provides ing the local spatial correlations is essential since traffic a prior knowledge that forces the model to predict the fu- volume is most influenced by its nearest neighbors. How- ture data partially based on the original historical records ever, in real-world scenarios, two distant regions may be instead of completely depending on the extracted tempo- strongly correlated in their traffic distributions as they ral patterns. However, such similarity is overlooked in feature the similar functionality (e.g., transportation hub). existing methods [5, 2], as they process the historical data Most of existing works [1, 2, 3] adopt the convolutional for high-order temporal characteristics (e.g., periodicity), layers as their backbone to extract the spatial features, distorting the numerical similarity. which may introduce short-range bias due to their small With the rapid growth of traffic sensors deployed, a receptive field. These methods perform well in extract- massive amount of traffic flow data is collected but not ing local context while hindering in global dependencies. fully utilized. Similarly, in the field of natural language Recently, Vision transformer (ViT) [4] has shown impres- processing (NLP), TB-level unlabel corpus are collected sive performance in computer vision, due to its innate but relatively fewer label data is available for various power at extracting non-local features. We are motivated language tasks. The gap , however, in NLP is successfully to apply ViT to learn the long-range spatial dependencies. alleviated by self-supervised learning [6], where unlabel From the temporal perspective, many works have been data are utilized to learn language representations and proposed to extract complex temporal patterns, e.g., daily then transferred to facilitate downstream tasks. While and weekly periodicity [1, 2]. However, we argue that in the field of traffic flow prediction, current training a simple temporal characteristic: temporal similarity is algorithms are supervised learning, where the histori- overlooked. Traffic flow data are generally smooth with cal records are regarded as input and the traffic data in STRL’22: First International Workshop on Spatio-Temporal Reasoning the next timestamp is served as label. No effective un- and Learning, July 24, 2022, Vienna, Austria supervised learning algorithms are proposed to learn * Corresponding author. spatial-temporal representations to facilitate the traffic $ pengzhangzhics@gmail.com (Z. Peng); hxh016@gmail.com flow prediction task. (X. Huang) Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Driven by these analyses, we propose a novel frame- Attribution 4.0 International (CC BY 4.0). CEUR Workshop CEUR Workshop Proceedings (CEUR-WS.org) Proceedings http://ceur-ws.org ISSN 1613-0073 work called Spatial-temporal Transformer Network with π‘₯π‘₯�𝑛𝑛 MSEloss Add &Norm Skip Connection π‘₯π‘₯οΏ½π‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿ π‘₯π‘₯οΏ½π‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿ ViT Fusion Feed π‘₯π‘₯�𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 π‘₯π‘₯�𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 Forward ViT ViT Traffic flow map Add &Norm Pre-Conv Pre-Conv Multi-Head Attention Block Block 𝑋𝑋𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 concatenate 𝑋𝑋𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 Linear Projection patchify Stem π‘₯π‘₯π‘›π‘›βˆ’3𝐿𝐿𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 π‘₯π‘₯π‘›π‘›βˆ’2𝐿𝐿𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 π‘₯π‘₯π‘›π‘›βˆ’πΏπΏπ‘€π‘€π‘€π‘€π‘€π‘€π‘€π‘€ π‘₯π‘₯π‘›π‘›βˆ’6 … π‘₯π‘₯π‘›π‘›βˆ’1 π‘₯π‘₯𝑛𝑛 Time Past Future Present 2.5 2 Pre-training strategy 1.5 1 Conv3 0.5 0 Pre-Conv Block Conv1 Conv2 + Conv4 Time Slot Figure 1: The overall architecture of Spatial-temporal Transformer Network with Self-supervised Learning (ST-TSNet). The three time axes illustrate our pre-training strategy. The time series at the bottom shows the periodicity of traffic flow data; the blue line denotes the ratio, and the orange line denotes the normalized traffic flow observations. The figure reveals that as the traffic flow periodically changes, the ratio floats up and down from a fixed value 1. Self-supervised Learning (ST-TSNet). ST-TSNet consists 2. Related Work of a Pre-Conv Block and ViT for learning spatial cor- relations in both local and global contexts. In addition, Traffic Flow Prediction. There are two types of flow we directly connect the historical data to the output to data in the traffic flow prediction task: grid-like raster make full use of the historical data as the base predictions. data and graph data and thus two distinct paradigms are Lastly, a self-supervised task named stochastic augmen- derived for the two types of data [7]. In our work, we tation is proposed to pre-train our ST-TSNet to learn focus on raster data. Existing mainstream traffic predic- spatial-temporal representations and fine-tune them to tion methods for raster data fall into one of the following benefit the prediction task. classes: statistical methods or deep learning methods. The contributions of this work are summarized as fol- Statistical methods include auto-regressive integrated lows. moving average (ARIMA) [8], Kalman filtering [9] and historical average. These methods often require strong β€’ We propose a novel framework Spatial-temporal and trivial theoretical assumptions, which may violate Transformer Network with Self-supervised Learn- the nonlinearity of traffic flow data, thus having poor ing (ST-TSNet) to capture spatial-temporal fea- performance in the real world. Recent advances have tures. witnessed the impressive capacity of deep learning to β€’ We employ a simple yet effective skip connection extract nonlinear features from big data [10]. Many re- strategy, plugged into ST-TSNet, to make full use searchers are inspired to apply deep learning to handle of the temporal similarities in traffic flow data. traffic flow prediction task. Existing deep learning meth- β€’ We introduce self-supervised learning to our ods are based on convolutional neural networks (CNNs) framework and design a pre-training task and recurrent neural networks (RNNs) [11]. ST-ResNet called stochastic augmentation to explore spatial- [1] first employs the CNNs with residual connections to temporal features to boost traffic flow prediction learn the spatial dependencies and construct historical task. data into different branches according to the temporal β€’ We conduct extensive experiments on two bench- semantics to learn temporal features. Similar ideas are marks (TaxiBJ and TaxiNYC) to evaluate the ef- adopted by subsequent works [2, 12] in which 3D convo- fectiveness of our methods and the results show lution is used to learn the spatial-temporal dependencies. that our ST-TSNet outperforms state-of-the-art Moreover, RNN-based models [13, 14] are inspired to methods. use convolutional layer to capture spatial features and sequential hierarchy (e.g., LSTM and GRU) to extract outflow, and H and W denote the number of rows and temporal patterns. However, these methods are time- columns of the grid map, respectively. The purpose of consuming as they make predictions step by step and traffic flow prediction is to predict π‘₯𝑛 given historical may suffer gradient vanishing or explosion when captur- traffic flow records π‘‹β„Žπ‘–π‘  = {x𝑑 | 𝑑 = 0, . . . , 𝑛 βˆ’ 1}. As ing long-range sequences [15]. To alleviate the problems, shown in Fig.1, the historical data is summarized into [15, 16] discard the recurrent chain structure and employ two categories in the time axis: Closeness sequence Multiplicative Cascade Unit (CMU) with autoencoders π‘‹π‘π‘™π‘œπ‘ π‘’ = {π‘‹π‘›βˆ’1 , π‘‹π‘›βˆ’2 , Β· Β· Β· , π‘‹π‘›βˆ’(𝑑𝑐 βˆ’1) , π‘‹π‘›βˆ’π‘‘π‘ } ∈ while preserving the convolutional layers for learning R2×𝑑𝑐 Γ—π»Γ—π‘Š is a concatenation of recent his- spatial features. The methods used by existing works can torical data where 𝑑𝑐 is the length of close- be considered from spatial and sequential perspectives. ness sequence. Trend sequence π‘‹π‘‘π‘Ÿπ‘’π‘›π‘‘ = From the spatial perspective, convolutional layers are the {π‘‹π‘›βˆ’πΏπ‘€π‘’π‘’π‘˜ , π‘‹π‘›βˆ’2Β·πΏπ‘€π‘’π‘’π‘˜ , Β· Β· Β· , π‘‹π‘›βˆ’π‘‘π‘‘ Β·πΏπ‘€π‘’π‘’π‘˜ } ∈ mainstream, including 2D and 3D convolution. From the R2×𝑑𝑑 Γ—π»Γ—π‘Š is a concatenation of periodic historical sequential perspective, there are many choices, including data from the past few weeks, where 𝑑𝑑 is the length of RNN, GRU, LSTM and CMU. Most existing works are trend sequence, πΏπ‘€π‘’π‘’π‘˜ is the number of intervals within a combination of these methods. In summary, existing a week. methods that based on CNNs suffer from short-range bias as the small receptive field limits their capacity to 3.2. Spatial-temporal Transformer extract global dependencies. Self-supervised Learning. Self-supervised learning Network is a great way to extract training signals from massive Overall, we employ a symmetric structure for handling amounts of unlabelled data and to learn general represen- the trend data π‘‹π‘‘π‘Ÿπ‘’π‘›π‘‘ and and the closeness data π‘‹π‘π‘™π‘œπ‘ π‘’ : tation to facilitate downstream tasks which the labelled a Pre-Conv Block followed by a ViT with two shortcuts data are limited. To generate supervision information (i.e., two blue lines shown in Fig.1) from the input to the from data, a general strategy is to define pre-training fusion layer. In the end, fusion layer adaptively merges tasks for models [17, 18] to learn semantic representa- four components (two residual components 𝑋 Λ† π‘Ÿπ‘ and 𝑋 Λ† π‘Ÿπ‘‘ , tions, and then transfer them to downstream tasks to two outputs 𝑋 Λ† π‘π‘™π‘œπ‘ π‘’ and 𝑋 Λ† π‘‘π‘Ÿπ‘’π‘›π‘‘ ) to generate prediction improve performance and robustness. Many works in ˆ𝑛 . π‘₯ computer vision have defined various tasks based on Pre-Conv Block. The traffic flow in a region is highly heuristic methods[19, 20]. For example, [21] learns vi- relevant to its nearby regions. We design a Pre-Conv sual representations by predicting the image rotations. Block for capturing such short-range dependencies. As In natural language processing (NLP), masked language illustrated in Fig.1, Conv1 and Conv2 are the main con- modeling, e.g., Bert [6] have shown to be excellent for volutional layers to capture short-range dependencies. pre-training language models. These methods mask a Thus, we employ a small kernel size ( i.e., 3 Γ— 3) which portion of the input sequence and train models to predict leads to the receptive field of 5. Such design ensures the the missing content with the rest. Such methods are effec- Pre-Conv Block only captures the local dependencies at tive for learning semantic correlations of elements within most in 5 Γ— 5 regions. The short-range dependencies a sequence, e.g., sentence. The traffic flow data can also are well-captured by the Pre-Conv Block while leaving be viewed as a sequence temporally, while the effective- the long-range features to the vision transformer. In- ness of self-supervised learning remains unexplored in serting CNNs before ViT has shown to be effective in traffic flow prediction task. strengthening the capacity of ViT [22]. Conv3 is the residual shortcut, employing 64 kernels with size 1 Γ— 1, which adds up to the main branch as a residual compo- 3. Methods nent. Generally, we will use much more kernels (e.g., 64) than that in Conv4. By enlarging and then reducing the 3.1. Problem Formulation number of channels, Pre-Conv Block can learn various We partition a city into an image-like grid map according spatial-temporal dependencies and then refine them into the longitude and latitude, as shown in the traffic flow a compact feature map. map of Fig.1, where each grid denotes a region. The Vision transformer. We apply vision transformer (ViT) value of a grid denotes the traffic flow (inflow or outflow). [4] after the Pre-Conv Block to capture the global depen- The device deployed at a region will periodically record dencies, as shown in the right of Fig.1. ViT is comprised of the number of people arriving at and departing from two main components: β€œPatchify” stem and transformer the location to collect the inflow and outflow. The encoder. β€œPatchify” stem spatially splits the input feature traffic flow map of the entire city at time 𝑑 is noted map into non-overlap 𝑝 Γ— 𝑝 patches and linearly projects as xt ∈ R2Γ—π»Γ—π‘Š , where 2 refers to the inflow and patches into tokens. Each token contains the information of a patch of regions. Then the tokens are fused with learnable positional encoding to preserve the 2D posi- Algorithm 1 The pre-training procedure with stochastic tional information and fed into transformer encoder. The augmentation. encoder utilizes a multi-head self-attention mechanism to Input: MASA model: π‘“πœƒ , closeness data: π‘‹π‘π‘™π‘œπ‘ π‘’ , trend model the long-range dependencies followed by a layer data: π‘‹π‘‘π‘Ÿπ‘’π‘›π‘‘ , and predicted future data: π‘₯𝑛 . normalization and residual connection (Add & Norm) Output: pre-trained MASA model. to the next sub-layer, where a Feed Forward Network repeat (FFN) and another Add & Norm are employed to further π‘‹π‘”π‘Ÿπ‘œπ‘’π‘ ← π‘‹π‘π‘™π‘œπ‘ π‘’ βˆͺ π‘‹π‘‘π‘Ÿπ‘’π‘›π‘‘ βˆͺ π‘₯𝑛 process the tokens. Lastly, the tokens are averaged and target 𝛼 ← π‘…π‘Žπ‘›π‘‘π‘œπ‘šπ‘†π‘Žπ‘šπ‘π‘™π‘–π‘›π‘”(π‘‹π‘”π‘Ÿπ‘œπ‘’π‘ ) then linearly transformed to generating output:𝑋 Λ† π‘π‘™π‘œπ‘ π‘’ Remaining snapshots Ω ← π‘‹π‘”π‘Ÿπ‘œπ‘’π‘ βˆ’ 𝛼 and 𝑋 Λ† π‘‘π‘Ÿπ‘’π‘›π‘‘ . Skip Connection. Skip Connection are employed to pre-trained data (Ω, 𝛼) transfer similar patterns from the historical observations predictions 𝑦ˆ ← π‘“πœƒ (Ω) to the output as the base prediction. To preserve the origi- loss ← 𝑀 π‘†πΈπΏπ‘œπ‘ π‘ (𝑦ˆ, 𝛼) nal similar patterns in historical data, we directly connect π‘π‘Žπ‘π‘˜π‘π‘Ÿπ‘œπ‘(π‘™π‘œπ‘ π‘ ) input 𝑋𝑐 and 𝑋𝑑 to the fusion layer, as shown in the blue update π‘“πœƒ line of Fig.1. Before connecting, we aggregate histori- until stop criteria is met; cal input data in the time dimension to match the shape. For two historical sequences π‘‹π‘π‘™π‘œπ‘ π‘’ ∈ R2×𝑑𝑐 Γ—π»Γ—π‘Š and π‘‹π‘‘π‘Ÿπ‘’π‘›π‘‘ ∈ R2×𝑑𝑑 Γ—π»Γ—π‘Š , we compute: the rest to predict the target. Such scheme can be ex- panded to three cases: (1) if the last frame is selected as the target, then this is similar to supervised training, the Λ† π‘Ÿπ‘ = 𝑓 (𝑋𝑐 ) ∈ R2Γ—1Γ—π»Γ—π‘Š , 𝑋 (1) historical records are used to predict future data; (2) if the Λ† π‘Ÿπ‘‘ = 𝑓 (𝑋𝑑 ) ∈ R2Γ—1Γ—π»Γ—π‘Š , 𝑋 (2) earliest frame is the target, then future observations are used to predict the past frame, as shown in the green axis where 𝑋 Λ† π‘Ÿπ‘ and 𝑋 Λ† π‘Ÿπ‘‘ are the two residual compo- of Fig.1; (3) if any intermediate frame is selected as the nents. 𝑓 (Β·) is an aggregation function R2Γ—π·Γ—π»Γ—π‘Š β†’ target, then the historical data and future observations R2Γ—1Γ—π»Γ—π‘Š , where 𝐷 denotes the length of historical are used to predict present, as shown in the red axis of data sequence. Here we use a summation function. Fi- Fig.1. Different from the downstream prediction task, nally, the two residual components will be fused in the where input historical records and future data are paired fusion layer. to be the training samples, our stochastic augmentation Fusion Layer. The degree of influence of the four compo- produces several times more samples for pretraining by nents (i.e., two outputs 𝑋 Λ† π‘π‘™π‘œπ‘ π‘’ , 𝑋 Λ† π‘‘π‘Ÿπ‘’π‘›π‘‘ and two residual randomly constructing input-target pairs. For example, components 𝑋 Λ† π‘Ÿπ‘ , 𝑋 Λ† π‘Ÿπ‘ ) is different, and the influence given a group of five frames, the supervised learning only in different regions also varies. Therefore, to dynami- gives one training sample as stated in case (1). While our cally calibrate their contributions, we follow [23] to use stochastic augmentation paradigm yields five pretraining a parametric-matrix-based fusion method, where the pa- samples (every frame in the group is selected to be the rameter matrices are learned from historical data. For- target once), five times more samples than supervised mally, training. With the large amount of pretraining samples, our models can explore useful spatial-temporal represen- Λ† π‘π‘Ÿπ‘’π‘‘ =𝑀𝑐 Β· 𝑋 𝑋 Λ† π‘π‘™π‘œπ‘ π‘’ + 𝑀𝑑 Β· 𝑋 Λ† π‘‘π‘Ÿπ‘’π‘›π‘‘ + tations for the downstream prediction task. Specifically (3) for the traffic flow prediction task, we define the group Λ† π‘Ÿπ‘ + π‘€π‘Ÿπ‘‘ Β· 𝑋 π‘€π‘Ÿπ‘ Β· 𝑋 Λ† π‘Ÿπ‘‘ , as the union of closeness data, trend data, and predicted where Β· denotes element-wise multiplication, 𝑀 is the ground truth: π‘‹π‘”π‘Ÿπ‘œπ‘’π‘ = π‘‹π‘π‘™π‘œπ‘ π‘’ ∩ π‘‹π‘‘π‘Ÿπ‘’π‘›π‘‘ ∩ π‘₯𝑛 . Then learnable parameter that measures the influence of each we randomly sample one snapshot as the target 𝛼 and component. the rest data Ω = π‘‹π‘”π‘Ÿπ‘œπ‘’π‘ βˆ’ 𝛼 as the input, construct- ing pre-training data (Ω, 𝛼) to pre-train our model. The algorithm is depicted in Alg.1. 3.3. Self-supervised Learning with Stochastic Augmentation 4. Experiments Our stochastic augmentation aims to pretrain our model to learn general spatial-temporal features to facilitate the 4.1. Dataset and Evaluation prediction task. The pretraining strategy is conceptually simple: we select a group of continuous traffic frames, Dataset. Our experiments are based on two traffic flow randomly sample a frame as the predicted target and use datasets: TaxiBJ and TaxiNYC. Additional external data Table 1 Performance comparison of different methods on TaxiBJ and TaxiNYC. TaxiBJ TaxiNYC Model RMSE MAPE (%) APE RMSE MAPE (%) APE HA 40.93 30.96 6.77E+07 164.31 27.19 7.94E+05 ST-ResNet [23] 17.56Β±0.91 15.74Β±0.94 4.81E+07Β±3.03E+05 35.87Β±0.60 22.52Β±3.43 6.57E+05Β±1.00E+05 MST3D [12] 21.34Β±0.55 22.02Β±1.40 4.81E+07Β±3.03E+05 48.91Β±1.98 23.98Β±1.30 6,98E+05Β±1.34E+04 ST-3DNet [2] 17.29Β±0.42 15.64Β±0.52 3.43E+07Β±1.13E+06 41.62Β±3.44 25.75Β±6.11 7.52E+05Β±1.78E+05 3D-CLoST [14] 17.10Β±0.23 16.22Β±0.20 3.55E+07Β±4.39E+05 48.17Β±3.16 22.18Β±1.05 6.48E+05Β±3.08E+04 STAR [24] 16.25Β±0.40 15.40Β±0.62 3.38E+07Β±1.36E+06 36.44Β±0.88 25.36Β±5.24 7.41E+05Β±1.53E+05 PredCNN [15] 17.42Β±0.12 15.69Β±0.17 3.43E+07Β±3.76E+05 40.91Β±0.51 25.65Β±2.16 7.49E+05Β±6.32E+04 STREED-Net [16] 15.61Β±0.11 14.73Β±0.21 3.22E+07Β±4.51E+05 36.22Β±0.72 20.29Β±1.48 5.93E+05Β±4.31E+04 ST-TSNet (ours) 16.04Β±0.08 14.63Β±0.05 3.20E+07Β±1.05E+5 34.34Β±0.32 15.68Β±0.09 4.58E+05Β±2.52E+03 Table 2 Ablation study of sub-modules in ST-TSNet. TaxiBJ TaxiNYC Variant RMSE MAPE (%) APE RMSE MAPE (%) APE ViT 20.16 34.68 7.60E+07 51.82 96.52 2.12E+08 ViT + SC 17.12Β±0.35 15.56Β±0.29 3.41E+07Β±6.29E+05 57.45Β±5.39 22.99Β±2.59 6.71E+07Β±7.57E+05 PC + SC 19.17Β±0.05 29.16Β±1.14 6.39E+07Β±2.50E+06 37.36Β±0.32 49.24Β±1.94 1.08E+08Β±4.25E+06 ViT + PC 16.34Β±0.21 14.70Β±0.13 3.22E+07Β±2.86E+05 37.29Β±2.88 16.83Β±0.24 4.91E+07Β±7.11E+04 ViT + PC + SC 16.14Β±0.16 14.62Β±0.06 3.20E+07Β±1.38E+05 34.87Β±0.39 16.18Β±0.20 4.72E+07Β±5.71E+04 ViT + PC + SC + SA 16.07Β±0.06 14.68Β±0.08 3.22E+07Β±1.72E05 34.47Β±0.23 15.90Β±0.08 4.64E+07Β±2.43E+04 ST-TSNet (w Ext) 16.04Β±0.08 14.63Β±0.05 3.21E+07Β±1.05E+05 34.34Β±0.32 15.68Β±0.09 4.58E+07Β±2.52E+05 including DayOfWeek, Weekday/Weekend, holidays, and predicted target back to the original value. We split the meteorological data (i.e., temperature, wind speed, and last 28 days as the test set for both datasets, and the re- weather) are processed into a one-hot vector. There maining are regarded as training data. During training, are 20,016 constructed samples in TaxiBJ and 41,856 in we select 90% of the training data for training models TaxiNYC. and the remaining 10% is the validation set to early-stop our training algorithm. Our model is implemented and β€’ TaxiBJ [23]: TaxiBJ is a citywide crowd flow trained by PyTorch. We use Adam [25] as the optimizer dataset collected every half hour in Beijing. Based with a learning rate of 0.001 for TaxiBJ and 0.005 for on the geographic area of Beijing, we partition TaxiNYC. Cosine learning rate decay is employed to ad- the Beijing city into 32 Γ— 32 regions. just the learning rate at each iteration. The batch size is β€’ TaxiNYC [16]: TaxiNYC is the taxi trip record 128 for both TaxiBJ and TaxiNYC. We run our model for dataset collected every one hour in New York 600 epochs on TaxiBJ and 800 epochs on TaxiNYC. Our City. New York City is divided into 16Γ—8 regions ViT has two blocks, and the patch size is set to (8, 8); the based on the longitude and latitude1 . token dimension is set to 128; the number of attention heads is 2; the size of FFN is 512. Evaluation Metric. Three metrics: Rooted Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE), and Absolute Percentage Error (APE) are used 4.3. Quantitative Comparison to evaluate our proposed method. We follow previous Table 1 shows the comparing results against the state- works that compute the metrics on traffic flow value of-the-art methods. We compare our ST-TSNet with the that is larger than 10 to ensure a fair comparison. We following baselines: HA, ST-ResNet [23], MST3D [12], conducted experiments ten times for reliable results and ST-3DNet [2], 3D-CLoST [14], STAR [24], and PredCNN presented the means and standard variances of the re- [15]. The results of the baselines are from [16]. sults. On TaxiBJ, our method exceeds the SOTA STREED- Net in terms of MAPE and APE and achieves comparable 4.2. Implementation Details results in RMSE. While on TaxiNYC, our method sig- nificantly outperforms the SOTA ST-ResNet across all Min-Max normalization is applied in our experiments metrics by a fair margin (1.53 RMSE, 4.61 MAPE, and to scale the data to range [βˆ’1, 1] and denormalize the 1.35E+05 APE improvement). 1 ST-TSNet has a more significant performance improve- The raw records are available at the NYC government website. A processed version for experiments is available at github ment on TaxiNYC than TaxiBJ. The possible reason of (a) Predictions on consecutive time (b) A prediction sample for each method (e) Absolute error of the above predictions (d) Residual weight (c) Self-attention map Inflow outflow Inflow outflow π‘€π‘€π‘Ÿπ‘Ÿπ‘Ÿπ‘Ÿ π‘€π‘€π‘Ÿπ‘Ÿπ‘‘π‘‘ Figure 2: Qualitative analysis of our methods. (a) comparing the predicted results of each method at different time slots. (b) visualizing a prediction sample for each method and (e) showing the absolute errors of these predictions. (c) illustrating the self-attention scores of four corner patches (the pentagram-marked) for other patches and revealing that they attend to remote patches (brighter color) for long-range spatial dependencies. (d) visualizing the inflow and outflow weight of the two residual components in the fusion layer; high-flow regions usually have a higher weight. the improvement is that the amount of data of TaxiNYC the predictions of each method at different time intervals. is twice that of TaxiBJ (41,856 vs. 20,016), which sig- The magnified subplot reveals that our method has better nificantly facilitates the pre-training. This result prove accuracy in predicting the peak. Fig.2 (b) spatially visu- the effectiveness of the self-supervised learning module alizes a prediction sample of each method, and Fig.2 (e) proposed in our method. STREED-Net and STAR have displays the absolute errors of these predictions, demon- impressive performance on TaxiBJ against other base- strating that our ST-TSNet has lower prediction errors lines due to the simple single-branch design. However, than baselines. Fig.2 (c) shows the self-attention map for such simple architecture performs worse than ours in four reference patches. The visualizations are produced a larger dataset TaxiNYC (1.88 RMSE higher than our by attention scores computed via query-key product in ST-TSNet) as there are rich spatial-temporal information the ViT. We use the pentagram-marked regions as query, that a single-branch structure can not extract effectively. and show which patch (region) they attend to. The four Although STREED-Net and PredCNN both introduce cas- corner patches usually attend to remote regions (brighter cading hierarchical structure in their backbone, STREED- color meaning higher attention scores) while caring less Net has better performance than PredCNN. The reason about their neighbors. The reason is that the short-range is that STREED-Net additionally introduces channel and features are perfectly captured and encoded into tokens spatial attention mechanisms to dynamically refine the by Pre-Conv Block, resulting in the ViT focusing more on learned features to generate predictions. Nevertheless, the long-range features. Fig.2 (d) visualizes the weights of the cascading hierarchical structure still suffers from inflow and outflow of two residual components. Combin- short-range bias as it only allows distant snapshots to ing the ground truth in Fig.2 (c), we observe that although interact at higher layers. ST-ResNet, STAR, and Pred- the weights vary in different regions and differ from in- CNN introduce a 2D convolutional layer, and MST3D, flow to outflow, they tend to concentrate on the regions ST-3DNet, and 3D-CloST employ 3D convolution. The with higher traffic flow. The reason is that these regions 3D convolutional layer is better than the 2D counterparts show a more regular time series, having more similar as it can additionally capture temporal features, while patterns in residual components. 2D convolutions are restricted to only capture spatial features. However, they all suffer from short-range bias 4.5. Ablation Study due to the small receptive field of convolution. More- over, they do not introduce the skip connection and any To verify the effectiveness of proposed methods, we de- additional pre-training strategies, resulting in inferior sign a list of variants by appending modules step by step performance. and comparing them on TaxiBJ and TaxiNYC. The basic variant is Vision transformer (ViT). We separately append skip connection (SC), Pre-Conv Block (PC), and stochas- 4.4. Qualitative Analysis tic argumentation pre-training (SA) to ViT to construct We offer four intuitive visualizations of proposed meth- other variants. We further consider the external factors ods to explain their behaviors in Fig.2. Fig.2 (a) compares on our ST-TSNet (ST-TSNet (w Ext)). We use an external module (two-layer multilayer perceptron) to model the References external features according to [5]. The external data is transformed and added together with the main output to [1] J. Zhang, Y. Zheng, D. Qi, Deep spatio-temporal yield prediction. residual networks for citywide crowd flows predic- The results in Table 2 show that: 1) the full version tion, in: Proc. of AAAI, 2017. of the our methods (i.e., ST-TSNet (w Ext)) achieves the [2] S. Guo, Y. Lin, S. Li, Z. Chen, H. Wan, Deep spa- best performance. 2) Adding each module step by step tial–temporal 3d convolutional neural networks for will progressively improve the performance. It suggests traffic data forecasting, IEEE Transactions on Intel- that each module is an indispensable component for our ligent Transportation Systems (2019). ST-TSNet. [3] S. Fang, Q. Zhang, G. Meng, S. Xiang, C. Pan, Gst- We additionally study the strategy of the skip con- net: Global spatial-temporal network for traffic flow nection by introducing a new residual component: the prediction, in: Proc. of IJCAI, 2019. Pre-Conv Block output π‘Œπ‘π‘œπ‘›π‘£ . We investigate two con- [4] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis- nection strategies: additionally and solely connect π‘Œπ‘π‘œπ‘›π‘£ senborn, X. Zhai, T. Unterthiner, M. Dehghani, to the fusion layer. Results show that the two strategies M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, degrade performance (1.66 and 1.35 RMSE degradation), N. Houlsby, An image is worth 16x16 words: Trans- suggesting that the π‘Œπ‘π‘œπ‘›π‘£ is harmful for prediction. The formers for image recognition at scale, in: Proc. of performance degradation may be caused by the convolu- ICLR, 2021. tional operations in Pre-Conv Block disrupt the semantic [5] J. Zhang, Y. Zheng, D. Qi, R. Li, X. Yi, Dnn-based information in historical data (e.g., traffic distributions), prediction model for spatio-temporal data, Proceed- resulting in the π‘Œπ‘π‘œπ‘›π‘£ and predicted target have different ings of the 24th ACM SIGSPATIAL International distributions. In contrast, the historical records ( π‘₯π‘‘π‘Ÿπ‘’π‘›π‘‘ Conference on Advances in Geographic Informa- and π‘₯π‘π‘™π‘œπ‘ π‘’ ) and the predicted target are collected from tion Systems (2016). the same distribution and temporally correlated. Thus [6] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: the historical records share similar patterns with the pre- Pre-training of deep bidirectional transformers for dicted target that can directly contribute to the prediction, language understanding, in: Proc. of ACL, 2019. while the π‘Œπ‘π‘œπ‘›π‘£ confuses the model. [7] X. Yin, G. Wu, J. Wei, Y. Shen, H. Qi, B. Yin, Deep learning on traffic prediction: Methods, analysis and future directions, IEEE Transactions on Intelli- 5. Conclusion gent Transportation Systems (2021). [8] B. M. Williams, L. A. Hoel, Modeling and fore- In this paper, we present a novel traffic prediction frame- casting vehicular traffic flow as a seasonal arima work, spatial-temporal Transformer Network with Self- process: Theoretical basis and empirical results, supervised Learning (ST-TSNet) for learning spatial- Journal of Transportation Engineering-asce (2003). temporal features. ST-TSNet is equipped with Pre-Conv [9] J. hua Guo, W. Huang, B. M. Williams, Adaptive Block and ViT to capture local and spatial dependencies. kalman filter approach for stochastic short-term In addition, we observe the similarity in traffic flow data, traffic flow rate prediction and uncertainty quantifi- which enables us to take advantage of the historical data cation, Transportation Research Part C-emerging as the base prediction for the future. Finally, we propose Technologies (2014). a pretext task named stochastic argumentation to enable [10] R. Salakhutdinov, Deep learning, in: Proc. of KDD, models to further explore spatial-temporal representa- 2014. tions under limited data. Experiments on two datasets [11] Y. Lv, Y. Duan, W. Kang, Z. Li, F. Wang, Traffic flow demonstrate the superiority of our proposed methods. prediction with big data: A deep learning approach, IEEE Transactions on Intelligent Transportation Acknowledgments Systems (2015). [12] C. Chen, K. Li, S. Teo, G. Chen, X. Zou, X. Yang, This research was funded by the National Natural Sci- R. Vijay, J. Feng, Z. Zeng, Exploiting spatio- ence Foundation of China under Grant No.62062033, and temporal correlations with multiple 3d convolu- the Natural Science Foundation of Jiangxi Province un- tional neural networks for citywide vehicle flow der Grant No.20212BAB202008. Zhangzhi, in particular, prediction, 2018 IEEE International Conference on would like to thank his father Jianhua Peng and mother Data Mining (ICDM) (2018). Changmei Zhang for countless love and support during [13] Z. Zhao, W. Chen, X. Wu, P. C. Y. Chen, J. Liu, Lstm the work was developed. I love you all. network: a deep learning approach for short-term traffic forecast, Iet Intelligent Transport Systems (2017). [14] S. Fiorini, G. Pilotti, M. Ciavotta, A. Maurino, 3d- clost: A cnn-lstm approach for mobility dynamics prediction in smart cities, 2020 IEEE International Conference on Big Data (Big Data) (2020). [15] Z. Xu, Y. Wang, M. Long, J. Wang, Predcnn: Predic- tive learning with cascade convolutions, in: Proc. of IJCAI, 2018. [16] S. Fiorini, M. Ciavotta, A. Maurino, Listening to the city, attentively: A spatio-temporal attention boosted autoencoder for the short-term flow pre- diction problem, ArXiv preprint (2021). [17] R. Zhang, P. Isola, A. A. Efros, Colorful image colorization, in: Proc. of ECCV, 2016. [18] D. Pathak, P. KrΓ€henbΓΌhl, J. Donahue, T. Darrell, A. A. Efros, Context encoders: Feature learning by inpainting, in: Proc. of CVPR, 2016. [19] I. Misra, L. van der Maaten, Self-supervised learning of pretext-invariant representations, in: Proc. of CVPR, 2020. [20] M. Noroozi, P. Favaro, Unsupervised learning of visual representations by solving jigsaw puzzles, in: Proc. of ECCV, 2016. [21] S. Gidaris, P. Singh, N. Komodakis, Unsupervised representation learning by predicting image rota- tions, in: Proc. of ICLR, 2018. [22] A. Hassani, S. Walton, N. Shah, A. Abuduweili, J. Li, H. Shi, Escaping the big data paradigm with com- pact transformers, ArXiv abs/2104.05704 (2021). [23] J. Zhang, Y. Zheng, D. Qi, R. Li, X. Yi, T. Li, Pre- dicting citywide crowd flows using deep spatio- temporal residual networks, Artif. Intell. (2018). [24] H. Wang, H. Su, Star: A concise deep learning framework for citywide human mobility prediction, 2019 20th IEEE International Conference on Mobile Data Management (MDM) (2019). [25] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: Proc. of ICLR, 2015.