Traffic Forecasting Using PaddlePaddle?

    Elena Akimova1,2 , Alexander Chernoskutov2 , Rostislav Kosivets2 , and
                            Alexander Volkanin2
    1
        Krasovskii Institute of Mathematics and Mechanics, Yekaterinburg, Russia
                    2
                      Ural Federal University, Yekaterinburg, Russia
                                   aen15@yandex.ru


        Abstract. Traffic forecasting problem is considered. A new traffic pre-
        diction algorithm is designed. The algorithm based on an original deep
        neural network model is implemented with PaddlePaddle deep learning
        framework using a long-short-term memory layer to improve the pre-
        diction accuracy. All experiments have been performed on Ural Federal
        University cluster with Nvidia Tesla K20 GPUs.

        Keywords: forecasting · deep learning · LSTM · PaddlePaddle · GPU


1   Introduction
Nowadays, real-time high-fidelity spatio-temporal data on transportation net-
works of overwhelming major cities have become available. People can obtain
datasets from loop detectors, mobile phones, traffic cameras, and self-driving
cars. This gold mine of the data can be utilized to learn the traffic behavior at
different times and locations. Solving the traffic congestion prediction problem
can potentially result in major saving of time and fuel. But it is still a challenge
to construct such a prediction basing on the spatio-temporal relations.
    In April 2017, the Asia Supercomputer Community and Inspur Group hosted
annual international event: “ASC Student Supercomputer Challenge”, which
main purpose was [1] “to challenge and inspire the next generation of HPC
scientists and engineers to deliver innovative solutions”. Every year one of the
tasks of the challenge is designed to address a specific applied problem, in which
some collaborating organization or sponsor is interested. In 2017 such a task
was the Traffic Prediction problem that should have been solved on the basis
of PaddlePaddle framework.1 It is implied that the approaches proposed by the
participants will be further used or somehow integrated in commercial or other
products by the parties involved in organization of the event. In order to do
that, the organizers are motivated to provide participants with the real-life (or
as close to it as possible) input data.
?
  This work was partly supported by the Center of Excellence “Geoinformation tech-
  nologies and geophisical data complex interpretation” of the Ural Federal University
  Program.
1
  The authors were awarded the Application Innovation Prize for the best solution of
  the problem.
                                    Traffic Forecasting Using PaddlePaddle     103

   In this paper, we describe the problem as it was stated and present our
approach and results.


1.1   Existing Solutions Overview


Among the parametric methods, one of the most successful is ARIMA2 , which
generated a whole class of methods (ARIMA with own subset, seasonal ARIMA,
ARIMA with exogenous factors, ARIMA with Kohonen maps, vector ARIMA).
All these methods are based on the assumption of stationary dispersion and mean
of time series. The ARIMA method shows better accuracy than predecessors in
predicting short-term traffic changes on highways.
    Parametric models have a number of advantages. First, such models are easy
to build and understand. Second, the solution is simpler and takes small amount
of computational time. However, due to the nonlinearity and stochastic nature of
the traffic, the parametric models are not able to take into account the uniqueness
of data of this nature in the whole and have a large prediction error in comparison
with nonparametric models.
    Recently, ITS3 have started to utilize full-connected architectures of deep
training models for predicting short-term traffic flow. Researchers of this field
have built a DNN4 to capture the space-time features of the transport stream and
developed a multi-tasking architecture for forecasting stationary and dynamic
road traffic [2, 3]. Other researchers suggested using the SAE5 model based on
predictions of short-term traffic flow [4]. These approaches allowed one to accu-
rately predict the future transport flow to some extent, however, they did not
use the local topology of the road network and long-term data on the transport
flow, which significantly reduced their predicting capabilities.
   A graph-based neural network model was also developed and showed an
improvement in predicting long-term dependencies while taking into account
spatial data features [5]. However, such a model gave low accuracy in forecasting
short term traffic.
    In recent studies, a model was developed that combines the architectures of
a convolutional neural network and LSTM6 , showing a slight improvement in
accuracy with regard to spatial features [6]. The convolutional neural network
layer processed spatial features, and several layers of LSTM processed short-term
variations and the frequency of the transport stream.

2
  Autoregressive integrated moving average
3
  Intelligent transportation systems
4
  Deep neural network
5
  Stacked autoencoder
6
  Long short-term memory recurrent neural network
104      Elena Akimova et al.

2     Problem Statement
2.1    Data Samples Representation
A city can be viewed as a set of connected roads, each road at any given time has
a numerical congestion characteristic Xui ,t ∈ {0, 1, 2, 3, 4}, i.e. a number which
represents how “severe” the congestion on the current road is (see Table 1).


                      Table 1. Numerical congestion characteristic

       Congestion
      characteristic Description
            0        No data
            1        Healthy traffic, vehicles can drive freely within the speed limit
            2        Minor congestion
            3        Medium congestion
            4        Drastic congestion, the traffic is almost completely stopped


    It may look like the traffic characteristic has been simplified too much, but in
this case we find it more suitable than some real physical quantity like average
speed because of the next reasons:
 – the traffic forecasting results in this particular case is targeted for human
   use (road users themselves). We find a short-scale congestion characteristic
   is much more intuitive for people because it is easy to understand and, most
   importantly, easy to compare current road condition to a “normal” traffic or
   to what it was like before;
 – the congestion characteristic incorporates roads’ parameters such as speed
   limits and road quality. For example, average speed of 40km/h can be consid-
   ered good in busy downtown or on a field road, but it is absolutely inadequate
   for a highway. So in the first case the congestion value can be defined as 1
   and in the second as 3, even though the average speed is the same. So, users
   do not need to take into consideration any additional parameters; they can
   understand how “good” or “bad” the traffic on the particular road is right
   away;
 – the collected data samples are usually not evenly distributed throughout the
   time, which can introduce instability to the system. For example, if speed
   data is acquired through the drivers’ cellphones, amount of collected data
   is proportional to the amount of drivers that decided to drive through the
   particular road. Coarsening the data, we are getting rid of its fluctuations
   and making it easy to interpolate in the case of insufficient data.
   Additionally to the collected time-dependent traffic data, we also consider
road connectivity information, which is represented by the oriented graph G(V, A),
where V is a road set, A is a set of ordered pairs of vertices ui , uj ∈ V denoting
an intersections of roads.
                                      Traffic Forecasting Using PaddlePaddle      105

2.2    Metric

In order to be able to compare different prediction results and reduce the task
to a minimization problem, a representative metric is to be chosen. In this case,
the results were evaluated by RMSE7 . The RMSE is very common choice for
many minimization problems. While its main advantages are continuousness and
differentiability, we also find it very intuitive at representing of how “good” the
result is. Simply analyzing the construct of the problem, we can determine a few
things about RMSE: in the worst case scenario, when the prediction and target
as far away from each other as possible, RMSE = 3 (since Xactual,i ∈ {1, 2, 3, 4}
and Xmodel,i ∈ {1, 2, 3, 4}); in the best case scenario, RMSE = 0.
Now the forecasting problem can be reduced to the minimization problem of
finding m traffic states of ui node in V using n previous states
                                  v                        
                                    uP  n
                                    u (Xu ,t − X ∗ )2
                                  t          i       ui ,t 
                         argmin  t=1                       ,                  (1)
                                                           
                                               m           


where Xui ,t is observed value of ui node at the instant t, while Xu∗i ,t is predicted
value of ui node at the instant t.


3     Initial Data Analysis

In the course of our work, we had only one data source for all the experiments;
but its spatial resolution was sufficient to conduct a number of independent tests
(by splitting it to several non-overlapping training and testing samples). The size
of the whole provided data relates to the size of the prediction as 400 to 1.


3.1    Data Format

The data is aggregated in 5-min intervals each from 00:00 a.m. on March 1st to
8:00 a.m. on May 25th, 2016. Every measurement is denoted by four states, as
was described earlier. A traffic intensity map is shown in Fig. 1. Our task was to
predict the traffic in the following 2 hours from 8:05 a.m. to 10:00 on May 25th.


3.2    Data Analysis

Initial data contain several anomaly regions. There are: periodic absences of
data (white regions) from 5:00 a.m. Saturday to 5:00 a.m. on Monday (Fig. 1),
stochastic anomalies and nonuniformness of values (Fig. 2). Small anomalies
were approximated by neighbor values, but big regions were just removed from
the training dataset.
7
    Root-mean-square error
106                 Elena Akimova et al.


                            No data                                    Fluency                                          Slow                                                       Congested                                          Extreme congested
    


    


    


    


    


     


     
                                                                                                                                                                                                                                                                        
                                                                                                                                           
                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                          


                                                                                                          Fig. 1. Initial data


                                                                                                                                                                                                                                                                                         No data
                                                                                                                                                                                                                                                                                         Fluent traffic
 160                                                                                                                                                                                                                                                                                     Slow traffic
                                                                                                                                                                                                                                                                                         Congestion
                                                                                                                                                                                                                                                                                         Extreme congestion
 150


 140


 130


 120

                         :00                                                   :00                                          :00                                             :00                                       :00                                             :00
               -2    8 08                             -2               8 09                              3-28
                                                                                                              10
                                                                                                                                                        -2           8 11                               3-2   8 12                                      3-2   8 13
          6-03                                    6-03                                              16-0                                           6-03                                            16-0                                            16-0
      201                                     201                                               20                                             201                                            20                                               20


                                                                                               Fig. 2. Anomaly example
                                     Traffic Forecasting Using PaddlePaddle     107

    Figure 1 shows the initial data for a graph of roads consisting of 349 nodes.
White gaps mean the absence of observations at the current time. The colored
spots are the values of the traffic congestion (from 1 to 4). The darker the
area the worse is the congestion. The figure shows that there are entire blocks of
missing data, and these blocks represent periodic. Most of these blocks represent
missed traffic information for two days - from 05:00 Saturday to 05:00 Monday.
One of these blocks is three days long, from 05:00 on 02/04/2016 to 05:00 on
05/04/2016. In addition to the missing data, there could be a lack of information
about the congestion on the road at any time, at arbitrary nodes of the network,
and with different time length. Thus, analysis of the initial data shows that they
are highly sporadic and inconsistent.
    In order to be able to preprocess the initial data and choose the algorithm
for solving the problem, we analyzed the average daily traffic on different days
of the week. Figure 3 shows the average traffic load graphs for the entire system
of roads.


         Fig. 3. Traffic intensity by the days of the week on separate roads


3.3   Data Preprocessing
The initial data contain both useful data for training the neural network (traffic
congestion values) and the filler values (zeros) denoting the instants when there
is no data available. If you feed such data to the neural network input during
learning without preprocessing it, a good result is not to be expected, since
blocks of missing data will disrupt the learning process.
    In order to improve the quality of traffic forecasting, all the data gaps should
be eliminated. We can split this task in two stages: the elimination of large pe-
riodic groups of gaps and the elimination of relatively-isolated gaps in random
108    Elena Akimova et al.

places. In the case of periodic blocks, we simply cut these blocks out of the
original data and concatenate the remaining parts in such a way that there is
no gaps in timestamps of the day. The random data gaps are somewhat more
difficult to handle because they can arise at arbitrary places and have an arbi-
trary length in time. The processing consists in interpolating such intervals with
averaged values from several closest surrounding points of known data. At the
top of Fig. 4, a part of initial data are shown, at the bottom the data are already
preprocessed; interpolated values are marked with red.
    After the preprocessing has been applied to the initial data, it has reduced
in size from, approximately, 85 days to 61 (due to 24 days of missing data).


                  Fig. 4. Interpolation of the missing traffic data


4     Implementation

4.1   Design of the Algorithm

Proposed algorithm is based on deep recurrent neural network with long short-
term memory layer. As shown in Fig. 5, the model consists of 5 layers: input
data (n neurons), full-connected layer (k neurons), LSTM layer (k neurons),
full-connected layer (4 neurons), and output data layer (4 neurons). The main
idea of the algorithm is that the intensity values of the neighbor nodes affect the
current node and, therefore, one should consider those values to predict traffic
intensity of the current node.
    At each time instant, the neural network input is fed with the traffic inten-
sity values from the neighboring roads or from the entire graph (if computing
                                       Traffic Forecasting Using PaddlePaddle   109

capabilities are sufficient) at the previous point of time. Training (and predic-
tion) is conducted for the current road at m time points after the time point,
from which the data are fed to the input. All m points of time are predicted in
parallel. This can be seen in Fig. 5. The final layer of the neural network outputs
a set of m values corresponding to each predicted instant. For implementation
of the neural network the PaddlePaddle [7] framework has been used.


                           Fig. 5. Neural network architecture


4.2     Model Training

Table 2 shows different model configurations. There are 3 models denoted LST Mλ ,
where λ is mean radius of neighboring nodes. Model LST Mv uses all the graph
nodes for training and predicting the following values; n is number of input val-
ues (for each to be predicted node), k is number of hidden neurons (for each to
be predicted node), epoch is number of epochs, learning rate is learning rate.
During training, we used sliding window method to predict the next m values.
    Even though our approach is designed and tested using PaddlePaddle, the
reader is advised to keep in mind that there is just one of the many implementa-
tions of the ANN8 algorithms, and all the described methods can be adapted for
any other ANN implementation without having any effect on the output result
whatsoever.
8
    Artificial neural network
110     Elena Akimova et al.

                           Table 2. Model configurations

                    modelλ      n    k     epoch   learning rate
                   LST M0      1    2      10          10−3
                   LST M1      5    6       5          10−3
                   LST M1      5    6      10          10−3
                   LST M2      12   15      5          10−3
                   LST M2      12   15     10          10−3
                   LST Mv      349 400     10          10−4


5     Experiments
Figure 6 presents the results of the tests. Increasing number of neighboring nodes
and epoch after some point does not show any significant reduce of RMSE, the
best model is (as expected) LST Mv that uses all graph nodes.
    One of the greatest advantages of the PaddlePaddle framework is that it can
utilize the power of modern GPU right “from the box”. All experiments have
been performed on Ural Federal University cluster with Nvidia Tesla K20 GPUs.
Comparison of the training times using CPU and GPU shown in Table 3. It is
apparent that small neural networks (with only dozens of neurons) cannot fully
take advantage of GPU computational power. But as the size of ANN grows, the
GPU learning time becomes up to 10 times smaller that of a CPU. Since there’s
an isolated instance of ANN for each node of the road graph we can easily spread
the learning process over a cluster of computational nodes without the need of
internode communications, thus getting a linear computational acceleration.


                     Table 3. Training times: CPU vs. GPU

                   modelλ      CPU time, min.      GPU time, min.
                  LST M2             78                460
                  LST Mv            4688               832


6     Conclusion
A new architecture of neural network and new preprocessing algorithm for short-
term traffic forecasting were proposed. Experiments with different types of neural
network layers showed that the simple full-connected layers with one LSTM layer
yield the best result for the task. The constructed implementation provides the
                                       Traffic Forecasting Using PaddlePaddle       111


                               Fig. 6. Training results


task to be easily scaled in number of road graph nodes by limiting the radius
of neighboring nodes. The PaddlePaddle framework allowed one to utilize in
implementation the power of modern high-performance GPU solutions without
modifying the source code.


References
1. ASC Student Supercomputer Challenge. http://www.asc-events.org/
2. Huang W., Song G., Hong H., and Xie K.: Deep architecture for traffic flow predic-
   tion: deep belief networks with multitask learning, IEEE Transactions on Intelligent
   Transportation Systems. Vol. 5, no. 5, p. 2191–2201 (2014).
3. Hinton G. E., Osindero S., and Teh Y.-W.: A fast learning algorithm for deep belief
   nets, Neural computation. Vol. 7, no. 18, p.1527–1554 (2006).
4. Lv Y., Duan Y., Kang W., Li Z., and Wang F.-Y.: Traffic flow prediction with big
   data: a deep learning approach, IEEE Transactions on Intelligent Transportation
   Systems. Vol. 2, no. 16, p. 865–873 (2015).
5. Shahsavari B.: Short-term traffic forecasting: Modeling and learning spatio-temporal
   relations in transportation networks using graph neural networks: mscs. University
   of California, Berkeley (2015).
6. Wu Y., Tan H.: Short-term traffic flow forecasting with spatial-temporal correlation
   in a hybrid deep learning framework. https://arxiv.org/pdf/1612.01022
7. PaddlePaddle:        parallel      distributed     deep       learning      platform.
   http://doc.paddlepaddle.org/release doc/0.9.0/doc/