Introduction

c Forecasting Using PaddlePaddle?

PaddlePaddle GPU

0 Krasovskii Institute of Mathematics and Mechanics , Yekaterinburg , Russia 1 Ural Federal University , Yekaterinburg , Russia

102 111

Tra c forecasting problem is considered. A new tra c prediction algorithm is designed. The algorithm based on an original deep neural network model is implemented with PaddlePaddle deep learning framework using a long-short-term memory layer to improve the prediction accuracy. All experiments have been performed on Ural Federal University cluster with Nvidia Tesla K20 GPUs.

forecasting deep learning LSTM

Introduction

In this paper, we describe the problem as it was stated and present our approach and results. 1.1

Existing Solutions Overview

Among the parametric methods, one of the most successful is ARIMA2, which generated a whole class of methods (ARIMA with own subset, seasonal ARIMA, ARIMA with exogenous factors, ARIMA with Kohonen maps, vector ARIMA). All these methods are based on the assumption of stationary dispersion and mean of time series. The ARIMA method shows better accuracy than predecessors in predicting short-term tra c changes on highways.

Parametric models have a number of advantages. First, such models are easy to build and understand. Second, the solution is simpler and takes small amount of computational time. However, due to the nonlinearity and stochastic nature of the tra c, the parametric models are not able to take into account the uniqueness of data of this nature in the whole and have a large prediction error in comparison with nonparametric models.

Recently, ITS3 have started to utilize full-connected architectures of deep training models for predicting short-term tra c ow. Researchers of this eld have built a DNN4 to capture the space-time features of the transport stream and developed a multi-tasking architecture for forecasting stationary and dynamic road tra c [ 2, 3 ]. Other researchers suggested using the SAE5 model based on predictions of short-term tra c ow [ 4 ]. These approaches allowed one to accurately predict the future transport ow to some extent, however, they did not use the local topology of the road network and long-term data on the transport ow, which signi cantly reduced their predicting capabilities.

A graph-based neural network model was also developed and showed an improvement in predicting long-term dependencies while taking into account spatial data features [ 5 ]. However, such a model gave low accuracy in forecasting short term tra c.

In recent studies, a model was developed that combines the architectures of a convolutional neural network and LSTM6, showing a slight improvement in accuracy with regard to spatial features [ 6 ]. The convolutional neural network layer processed spatial features, and several layers of LSTM processed short-term variations and the frequency of the transport stream. 2 Autoregressive integrated moving average 3 Intelligent transportation systems 4 Deep neural network 5 Stacked autoencoder 6 Long short-term memory recurrent neural network

Data Samples Representation

A city can be viewed as a set of connected roads, each road at any given time has a numerical congestion characteristic Xui;t 2 f0; 1; 2; 3; 4g, i.e. a number which represents how \severe" the congestion on the current road is (see Table 1).

It may look like the tra c characteristic has been simpli ed too much, but in this case we nd it more suitable than some real physical quantity like average speed because of the next reasons: { the tra c forecasting results in this particular case is targeted for human use (road users themselves). We nd a short-scale congestion characteristic is much more intuitive for people because it is easy to understand and, most importantly, easy to compare current road condition to a \normal" tra c or to what it was like before; { the congestion characteristic incorporates roads' parameters such as speed limits and road quality. For example, average speed of 40km/h can be considered good in busy downtown or on a eld road, but it is absolutely inadequate for a highway. So in the rst case the congestion value can be de ned as 1 and in the second as 3, even though the average speed is the same. So, users do not need to take into consideration any additional parameters; they can understand how \good" or \bad" the tra c on the particular road is right away; { the collected data samples are usually not evenly distributed throughout the time, which can introduce instability to the system. For example, if speed data is acquired through the drivers' cellphones, amount of collected data is proportional to the amount of drivers that decided to drive through the particular road. Coarsening the data, we are getting rid of its uctuations and making it easy to interpolate in the case of insu cient data.

Additionally to the collected time-dependent tra c data, we also consider road connectivity information, which is represented by the oriented graph G(V; A), where V is a road set, A is a set of ordered pairs of vertices ui; uj 2 V denoting an intersections of roads.

Metric

In order to be able to compare di erent prediction results and reduce the task to a minimization problem, a representative metric is to be chosen. In this case, the results were evaluated by RMSE7. The RMSE is very common choice for many minimization problems. While its main advantages are continuousness and di erentiability, we also nd it very intuitive at representing of how \good" the result is. Simply analyzing the construct of the problem, we can determine a few things about RMSE: in the worst case scenario, when the prediction and target as far away from each other as possible, RMSE = 3 (since Xactual;i 2 f1; 2; 3; 4g and Xmodel;i 2 f1; 2; 3; 4g); in the best case scenario, RMSE = 0. Now the forecasting problem can be reduced to the minimization problem of nding m tra c states of ui node in V using n previous states where Xui;t is observed value of ui node at the instant t, while Xui;t is predicted value of ui node at the instant t. 3

Initial Data Analysis

In the course of our work, we had only one data source for all the experiments; but its spatial resolution was su cient to conduct a number of independent tests (by splitting it to several non-overlapping training and testing samples). The size of the whole provided data relates to the size of the prediction as 400 to 1. 3.1

Data Format

The data is aggregated in 5-min intervals each from 00:00 a.m. on March 1st to 8:00 a.m. on May 25th, 2016. Every measurement is denoted by four states, as was described earlier. A tra c intensity map is shown in Fig. 1. Our task was to predict the tra c in the following 2 hours from 8:05 a.m. to 10:00 on May 25th. 3.2

Data Analysis

Initial data contain several anomaly regions. There are: periodic absences of data (white regions) from 5:00 a.m. Saturday to 5:00 a.m. on Monday (Fig. 1), stochastic anomalies and nonuniformness of values (Fig. 2). Small anomalies were approximated by neighbor values, but big regions were just removed from the training dataset.

7 Root-mean-square error

No data

Fluency

Slow

Congested

Extreme congested 300 250 200 150 100 50 0 160 150 140 130 120 2016-03-08 00:00 2016-03-15 00:00 2016-03-22 00:00 2016-03-29 00:00 2016-04-05 00:00 2016-04-12 00:00

2016-04-19 00:00 The initial data contain both useful data for training the neural network (tra c congestion values) and the ller values (zeros) denoting the instants when there is no data available. If you feed such data to the neural network input during learning without preprocessing it, a good result is not to be expected, since blocks of missing data will disrupt the learning process.

In order to improve the quality of tra c forecasting, all the data gaps should be eliminated. We can split this task in two stages: the elimination of large periodic groups of gaps and the elimination of relatively-isolated gaps in random places. In the case of periodic blocks, we simply cut these blocks out of the original data and concatenate the remaining parts in such a way that there is no gaps in timestamps of the day. The random data gaps are somewhat more di cult to handle because they can arise at arbitrary places and have an arbitrary length in time. The processing consists in interpolating such intervals with averaged values from several closest surrounding points of known data. At the top of Fig. 4, a part of initial data are shown, at the bottom the data are already preprocessed; interpolated values are marked with red.

After the preprocessing has been applied to the initial data, it has reduced in size from, approximately, 85 days to 61 (due to 24 days of missing data).

Implementation Design of the Algorithm

Proposed algorithm is based on deep recurrent neural network with long shortterm memory layer. As shown in Fig. 5, the model consists of 5 layers: input data (n neurons), full-connected layer (k neurons), LSTM layer (k neurons), full-connected layer (4 neurons), and output data layer (4 neurons). The main idea of the algorithm is that the intensity values of the neighbor nodes a ect the current node and, therefore, one should consider those values to predict tra c intensity of the current node.

At each time instant, the neural network input is fed with the tra c intensity values from the neighboring roads or from the entire graph (if computing capabilities are su cient) at the previous point of time. Training (and prediction) is conducted for the current road at m time points after the time point, from which the data are fed to the input. All m points of time are predicted in parallel. This can be seen in Fig. 5. The nal layer of the neural network outputs a set of m values corresponding to each predicted instant. For implementation of the neural network the PaddlePaddle [ 7 ] framework has been used. Table 2 shows di erent model con gurations. There are 3 models denoted LST M , where is mean radius of neighboring nodes. Model LST Mv uses all the graph nodes for training and predicting the following values; n is number of input values (for each to be predicted node), k is number of hidden neurons (for each to be predicted node), epoch is number of epochs, learning rate is learning rate. During training, we used sliding window method to predict the next m values.

Even though our approach is designed and tested using PaddlePaddle, the reader is advised to keep in mind that there is just one of the many implementations of the ANN8 algorithms, and all the described methods can be adapted for any other ANN implementation without having any e ect on the output result whatsoever.

8 Arti cial neural network

A new architecture of neural network and new preprocessing algorithm for shortterm tra c forecasting were proposed. Experiments with di erent types of neural network layers showed that the simple full-connected layers with one LSTM layer yield the best result for the task. The constructed implementation provides the task to be easily scaled in number of road graph nodes by limiting the radius of neighboring nodes. The PaddlePaddle framework allowed one to utilize in implementation the power of modern high-performance GPU solutions without modifying the source code.

ASC

Student Supercomputer Challenge . http://www.asc-events.org/

2. Huang

, Song

, Hong

, and Xie

: Deep architecture for tra c ow prediction: deep belief networks with multitask learning , IEEE Transactions on Intelligent Transportation Systems . Vol. 5 , no. 5 , p. 2191 { 2201 ( 2014 ).

3. Hinton

G. E.

, Osindero

, and Teh Y.-W.: A fast learning algorithm for deep belief nets, Neural computation . Vol. 7 , no. 18 , p. 1527 { 1554 ( 2006 ).

4. Lv

, Duan

, Kang

, Li Z. , and Wang F .-Y.: Tra c ow prediction with big data: a deep learning approach , IEEE Transactions on Intelligent Transportation Systems . Vol. 2 , no. 16 , p. 865 { 873 ( 2015 ).

5. Shahsavari

: Short-term tra c forecasting: Modeling and learning spatio-temporal relations in transportation networks using graph neural networks: mscs . University of California, Berkeley ( 2015 ).

6. Wu

, Tan

: Short-term tra c ow forecasting with spatial-temporal correlation in a hybrid deep learning framework . https://arxiv.org/pdf/1612.01022

7. PaddlePaddle: parallel distributed deep learning platform . http://doc.paddlepaddle. org/release doc/0 .9.0/doc/