Graph-based Sparse Neural Networks for Traffic
Signal Optimization
Lukasz Skowronek2 , Pawel Gora1,2 , Marcin Mozejko2 and Arkadiusz Klemenko2
1
  Faculty of Mathematics, Informatics and Mechanics,
University of Warsaw, Poland
2
  TensorCell


                                         Abstract
                                         We investigate the performance of sparsely connected neural networks, with connectivity determined
                                         by road network graphs, for solving the Traffic Signal Setting optimization problem. We conducted
                                         experiments on three realistic road network topologies and found these types of graph neural networks
                                         superior to fully connected ones, both in terms of generalization properties on fixed test sets and - more
                                         importantly - near target function minima obtained in the gradient descent optimization process. We
                                         additionally confirm the soundness of our method by showing that random perturbations of the actual
                                         graph lead to consistent deterioration of model performance.

                                         Keywords
                                         traffic optimization, graph neural networks, Traffic Signal Setting problem, surrogate modelling


1. Introduction
Traffic optimization problems have a natural underlying graph structure, determined by the
topology of the corresponding road network. In this paper, we introduce a neural network
architecture based on a road network graph adjacency matrix to solve the so-called Traffic
Signal Setting (TSS) problem, in which the goal is to find the optimal traffic signal settings for
given traffic conditions (as defined in [1]). Some variants of this problem were proven to be
NP-hard even for very simple traffic models ([2]), and therefore, heuristics and approximations
have been used to solve it ([1]), but the existing approaches still have some drawbacks. For
example, evaluating the quality of traffic signal settings using accurate traffic simulations (which
is a standard evaluation method) can be too time-consuming, especially for large-scale road
networks and/or online evaluation ([3, 4]). Also, the size of the space of possible solutions is
so large that it turns out infeasible, in any reasonable time, to obtain global minima (or even a
relatively good signal settings) of the simulator output by checking all the possible solutions or
doing a random search ([1]), as most points in the input space are far from the optimal solutions.
   A strategy used for solving these two difficulties was presented in [1] and consists of generat-
ing a reasonably sized training set using a traffic simulator and then fitting a machine learning

29th International Workshop on Concurrency, Specification and Programming (CS&P’21)
" p.gora@mimuw.edu.pl (P. Gora)
~ https://www.mimuw.edu.pl/~pawelg (P. Gora)
 0000-0002-8037-5704 (P. Gora)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
model to approximate the outcomes of traffic simulations very fast and accurately. The output of
these models can be then minimized using an optimization algorithm, such as gradient descent,
in hope to obtain close to optimal traffic signal settings. This strategy turned out to be quite
successful ([1, 5, 4]), yet the models’ accuracy degraded close to points considered as minima
by the optimization algorithm and the model, making further optimization far more difficult
([3, 4]).
   In this paper, we show that the graph-based neural networks (GNN) that we use, built based on
road network graphs, can outperform feed-forward fully connected neural networks (FCNN) on
this task. Comparing to the standard FCNNs, the introduced GNNs have most of the connections
deleted, keeping only the crucial connections between neurons (making the information flow
corresponding to the traffic flow in the road network), which makes this architecture relatively
sparse, easier to train and generalize. As a consequence, GNNs have better accuracy on the test
set, as well as close to the local optima found using gradient descent optimization applied to
the TSS problem. We also prove that GNN architectures work better than analogous ones built
based on perturbed adjacency matrices.
   The rest of the paper is organized as follows. Section 2 puts our work in the context of
a research in the domain of building surrogate models for complex processes, solving TSS
problem and using graph neural networks. Section 3 presents the two types of graph neural
network architectures we used. In Section 4, we describe the setup of our main experiments
including a description of the used datasets and their generation. Section 5 summarizes our
main experiment results showing that neural network architectures introduced in this paper
outperform other models used in such a task. In Section 6, we summarize the results of several
’sanity checks’ we performed in order to confirm that our results were not obtained by chance.
Moreover, the results in Section 6 can be especially interesting for researchers in graph neural
networks, e.g. we show that out-of-sample performance of graph-based sparse neural nets
decreases (almost) monotonously as a function of the distance of the adjacency matrix we use
for constructing the network from the true adjacency matrix. We summarize the presented
research in Section 7, outlining some possible future research directions.


2. Related works
Complex processes, such as road traffic in cities, are difficult to study due to large number
of interacting components (e.g., vehicles), nondeterminism or sensitive dependence on initial
conditions. Very often, the only reasonable method to accurately predict the behaviour of such
systems is to apply computer simulations which can be time-consuming and usually can’t be
simplified due to computational irreducibility. However, in many tasks related to complex
processes it is not necessary to obtain very accurate predictions, it can be sufficient to get
approximate outcomes but as fast as possible (due to stochasticity or sensitive dependence on
initial conditions, it may be impossible to predict the exact value anyway). Therefore, in such
cases it is natural to build the so-called surrogate models (metamodels) approximating outcomes
of simulations very fast and with a good accuracy ([6]). Such applications are especially common
in the case of optimization tasks, in which quite often it is necessary to run multiple simulations
in order to evaluate many different input settings ([7, 8, 9]). This is the case of traffic optimization
problems [10], such as the TSS problem. Many such surrogate models are based on machine
learning methods, such as neural networks [7, 8], and also some previous works on solving
TSS [1, 5, 3, 4] use various machine learning techniques (e.g., based on neural networks or
gradient boosted decision trees) to build metamodels of traffic simulations, which were used to
evaluate quality of traffic signal settings. Such metamodels were able to approximate outputs of
simulations (the total times of waiting on red signals) with a very good accuracy (e.g., values
of the MAPE metrics were at the level 1 − 2%) and a few orders of magnitude faster than by
running microscopic simulations [1, 5]. Thanks to that, it was possible to use optimization
algorithms such as genetic algorithms or gradient descent, to find heuristically optimal signal
settings without performing extensive parameter space searches that would take weeks to
complete [1, 5, 3, 4]. However, information about the road network structure has never been
used in those experiments, even though it should naturally be relevant when optimizing traffic.
   Introducing the direct connection between the network architecture and the graph structure
can help to leverage additional information represented by a graph. Similarly to our work, [11]
introduces a graph NN layer in which each vertex has specific parameters assigned to combine
information from its neighbors. However, differently to our method, this layer uses only an
original graph matrix and skips the dual graph structure when performing computations. The
notable usage of a dual structure can be found in [12], where it is compressed to a PPMI matrix
using aggregated statistics from a random graph walk procedure. This aggregation is used to
introduce a vertex neighborhood context similarly to a popular T-SNE method [13]. [14] provides
an extensive overview of different graph neural networks architectures and applications.
   Due to their capability to capture a road-network structure, GNNs were used in multiple
traffic applications. In [15, 16, 17] authors used spatio-temporal GNNs for a traffic situation
prediction, whereas [18] used the same technique in order to predict the TAXI demand. However,
our application of graph neural networks in the traffic optimization domain and the Traffic
Signal Setting problem seems to be the first such approach.


3. Network architecture
The key idea in defining our sparse graph-based neural network architecture is an intuitively
compelling rule that information/signal should propagate locally between the net layers. By
locality, we mean the presence of only those neuron connections that have a corresponding non-
zero entry in the adjacency matrix of the corresponding graph. In the case of the road network,
in order to implement such a rule, the neurons in the successive layers of the neural network
should be linked to the neurons corresponding to vertices and/or edges of the corresponding
graph. Thus, we propose the following general ways to build a graph neural network (see
Section 1 of Supplementary materials ([19]) for mathematical formulas):

   1. Neurons in the even numbered layers, starting with the input layer as layer 0, correspond
      to graph vertices (in our case - road crossings). Neurons in the odd numbered layers
      correspond to graph edges (in our case - road segments). An exception should be the final
      layer with just one neuron. Connections from a vertex-localized layer to an edge-localized
      layer should only be present if a given vertex is an end of a given edge in the corre-
      sponding road network graph. There are exactly two such connections for every edge
      neuron. Connections from an edge-localized layer to a vertex-localized layer should only
      be present if the edge has the vertex as its end in the corresponding road network graph.
      The number of such connections is equal to the number of particular vertex neighbors.
                                                      or
   2. Neurons in all layers, with the exception of the output layer, correspond to road network
      graph vertices. Connections from a neuron in one layer to a neuron in the next one
      should only be present if the corresponding vertices are neighbors in the road network
      graph. The number of connections for the vertex node is equal to the number of the
      vertex neighbors.

   Although architecture 2 might seem to be more basic, architecture 1 appears to naturally
model a traffic flow through the road networks (see Supplementary materials ([19]), Section 2,
for a detailed explanation). In the rest of this paper, we focus solely on GNNs of the architecture
type 1.
   It should also be pointed out that GNNs can have multiple channels at each edge/vertex. The
number of channels in each layer is a hyperparameter of the network. In the following part, we
always assume the number of channels to be constant across the hidden layers of the network.
   One may also notice a similarity between our GNN architecture 2 and the graph neural
networks proposed by Thomas Kipf [20]. However, we do not share any weights in our model,
as we aim to focus on local patterns connected to roads / crossroads. Theoretically, we could
introduce some weight sharing in the ’edge’ layers of GNNs of type 1, but our first experiments
using this approach led to highly disappointing results.
   In typical ML literature terminology, our GNNs should likely be called ‘NNs with a fixed
sparse connectivity mask’. In case of multi-channel networks, sparsity is applied in the ‘spacial’,
but not in the ‘channel’ dimension.


4. Experiment setup
In order to train the surrogate models, it was necessary to generate datasets first. For this task,
we simulated vehicular traffic on 3 realistic road networks, corresponding to selected districts
in Warsaw: Centrum, Ochota and Mokotów, including 11, 21 and 42 intersections with traffic
signals, respectively. The simulations were run using a microscopic traffic simulator, Traffic
Simulation Framework [21], for which a road network description for Warsaw was obtained
from the OpenStreetMap service [22]. The inputs to the simulator were vectors of lengths 11,
21 and 42, respectively. Each position in a vector represented an offset of a traffic signal on a
corresponding intersection. The offsets are shifts with respect to a global two minute traffic
signal cycle start - times from the beginning of the simulation to the first switch from the green
signal state to the red signal state (it was assumed for simplicity that the duration of a green
signal phase is always equal to 58 seconds, while duration of a red signal phase is equal to 62,
constituting a 120-second cycle). The offsets were provided as integers, measured in seconds,
hence they ranged from 0 to 119 (note the periodicity of these variables). The simulator output
in each case was the total waiting time on red signals, summed for all the cars participating in
the simulation on a considered area (finding the inputs minimizing this output value was the
optimization goal of the considered TSS problem instance).
  Each simulation lasted 10 minutes and consisted of 42000 cars on the whole road network of
Warsaw. The datasets for Ochota, Mokotów and Centrum were generated using approximately
100000 randomly selected inputs for the TSF simulator (the input offset values from the set
{0, 1, . . . , 119} were sampled from the uniform distribution independently). These datasets are
publicly available to enable further research ([23]).
  After preparing the datasets, we trained GNN and FCNN networks as metamodels to solve
TSS using gradient descent. Before training, we scaled the inputs to [−1, 1] using the following
mappring 𝑥 ↦→ sin (2𝜋𝑥/120) and 𝑥 ↦→ cos (2𝜋𝑥/120), thus doubling the input size (actually,
increasing the number of input channels). This is motivated by the periodicity of the problem -
the neural network may learn that the offsets are periodic and values 0 and 120 correspond to
the same setting and this can improve the training [3, 4]. For the output, we used a standard
scaler that divides the data by its standard deviation and shifts the mean to zero.
  For each of the 3 considered road networks, we tested 9 different GNN architectures and 9
FCNN architectures. The 9 selected GNN architectures corresponded to all combinations of
values from the following parameter sets:
    • number of hidden GNN layers: 2, 3, 4 (not counting input and output layers);
    • number of channels per layer: 3, 4, 5.
The activation function we used was tanh, indicated as superior to ReLU in preliminary
experiments and in previous works [4].
   For comparison, we also tested 9 FCNN architectures with tanh activation function, covering
all combinations of parameter values from the following sets:
    • number of hidden layers: 2, 3, 4
    • number of neurons per layer: 20, 40, 100
  For each of the 3 datasets, we used the same 90/10 train/test split for each of the considered
18 hyperparameter settings (9 GNN architectures and 9 FCNN architecture). For each of the
architectures, we ran the following procedure:
   1. Train a model on the training set for about 1100 epochs (concretely, minimize on 105
      random mini-batches of size 997 (997 is the closest prime to 1000 - a prime number was
      chosen to assure better randomization, although it was not expected to have any real
      effect) using Adam optimizer ([24]) and a learning rate of 0.0035).
   2. Evaluate the trained model on the test set using the mean relative error with respect to
      the original outputs as the core metrics.
   3. Generate 100 gradient descent trajectories of the trained model output with respect to its
      inputs (in the original input space, backpropagating through the sin/cos transformation).
      Gradients were evaluated at inputs rounded in the original parameter space (our traffic
      simulator (TSF) accepts only integer inputs). Nesterov updater ([25]) with a learning rate
      of 0.01 and momentum of 0.9 is used. Each trajectory had 3000 steps. This is similar to
      approach used in [4].
   4. Every 30 steps, transform the current trajectory point to the original parameter space,
      round and send to the TSF simulator. Save the inputs and the simulator outputs to a new
      ‘simulation’ test set.
   5. Evaluate the trained model on the ‘simulation’ test set using various metrics (cf. the
      discussion in Section 5).

  All the experiments were run on virtual machines in the Azure cloud (NC6 with NVIDIA
Tesla K80 ([26])). The code used in the experiments can be found at ([27]). All of the models
trained for the main experiment and all the out-of-sample simulation datasets can be found at
[28]. The core dataset can be obtained at [23].


5. Experiment results

Table 1
Core results for the three best GNN and the three best FCNN architectures according to the accuracy
(MAPE) on the test set (i.e. gradient descent results did not affect the selection of these models).
               Measure                     Model      Ochota     Mokotów    Centrum
               Min. MAPE                   GNN        1.33%      0.76%      0.80%
               on the test set             FCNN       1.71%      0.94%      0.87%
               Min. simulation output      GNN        32,205     265,129    63,606
                                           FCNN       32,587     266,237    63,553
               Avg MAPE on the lowest      GNN        1.26%      0.53%      0.76%
               5% sim. outputs             FCNN       5.35%      3.04%      2.49%
               Avg MAPE on the lowest      GNN        1.75%      0.84%      1.22%
               10% sim. outputs            FCNN       4.53%      2.74%      2.25%
               Avg MAPE on the lowest      GNN        1.51%      1.00%      1.11%
               15% sim. outputs            FCNN       4.65%      2.56%      2.04%


  The key results of our experiments with GNNs are shown in Table 1, as well as in Figure 1,
complemented by the tables and figures in Section 3 of Supplementary Materials [19]. Table 1
shows a summary of core performance measures, calculated for three top GNN and three FCNN,
ranked based on the average accuracy on the test set (MAPE). The core measures presented are:

    • Minimum MAPE (mean absolute percentage error) on the test set. This number can be
      obtained before doing gradient descent. The minimum is taken among the 3 top ranked
      GNNs or FCNNs (according to the row description). Because of the model selection
      criterion we use for Table 1, this minimum is global within the respective 9-element
      model universe (GNN or FCNN).
    • Minimum simulation output obtained when doing gradient descent (note that while being
      interesting from a traffic optimization perspective, this measure lacks robustness, as it
      can be distorted by a single data point).
    • Average MAPE on 𝑥% (for 𝑥 = 5, 10, 15) gradient descent trajectory ends, selected
      according to the corresponding simulator output value (sorted lowest first). An average
      is taken over the three models selected, GNN or FCNN, according to the row description.

  First, let us note that the results of Table 1 show a better performance of GNNs comparing to
FCNNs, particularly in terms of minimum MAPE of the test set and average MAPE on the lowest
points from the gradient descent trajectory according to the simulation. The improvement is
visible for all the 3 road maps (Ochota, Mokotów, Centrum) and all the 5 core measures (with
the exception of the minimum simulation output value obtained for Centrum, where one FCNN
turned out to yield slightly (less than 0.1%) lower result than all the GNNs).
   To summarize, the core improvement areas are:

    • Much lower error on the test set.
    • Lower minimum simulator output value obtained when doing the gradient descent (except
      for Centrum, for which we can count a draw).
    • Much lower approximation error obtained on the trajectory ends corresponding to 5%,
      10% and 15% lowest simulator output values.

   Figure 1 as well as similar figures for Ochota and Mokotów (see Section 3 of Supplementary
Materials [19]) show the density of gradient descent trajectory points as heatmap plots. The
horizontal axis corresponds to gradient descent trajectory point number (recorded every 30
steps), and the vertical axis corresponds to the simulator output. Each trajectory had 3000 steps,
but we recorded points every 30 steps. The plots show a heatmap of these points on the (point
number, simulator output) plane. Thus, the more points in some area, the brighter the color.
Also, if one architecture reaches a lower minimum than another, the resulting heatmap is taller.
   Besides confirming some of the quantitative conclusions from Table 1, the heatmaps also
show that in many cases, the gradient descent is less ‘noisy’ for GNN, suggesting a smoother
function surface, less prone to overfit noise (this is best visible in the plots in Section 3 of [19]).


6. Consistency checks
The findings of the previous section call for some careful consistency checks before reaching
final conclusions. In particular, it is not fully clear that the actual adjacency matrix gives any
value. Perhaps, any similar graph, even not related to the problem at hand, can do equally well.
   To address that question, we decided to fix the number of layers to 3 and the number of
channels to 4 per layer (for GNN of type 1), and built our nets using random graphs with
various degrees of resemblance to the true problem graph (we repeated this for all the three
road networks we considered). As a measure of graph similarity, we used the symmetric
difference between the sets of graph edges. The random graphs were generated in two ways.
The first method (referred later as ‘Edge/Non-edge switching’) used random edge insertions
and deletions, with the desired value of the symmetric difference kept fixed. The second method
(referred later as ‘Vertex label permutation’) used random permutations of the vertex labels
while keeping the connection graph structure exactly the same. Graphs generated by this
method were isomorphic, but not identical to the original one.
   It is worth mentioning that although the first method generates truly random graphs similar
to the original one, the new graph might not represent a plausible road network. The second
method, on the contrary, always keeps the same, realistic road network graph structure, but it
provides spurious insights to the training algorithm as crossroads are switched.
   Results obtained by the two methods are shown in Figure 2, including Ochota, Mokotów and
Centrum. The plots show that the mean relative error achieved on the test sets by neural nets
Figure 1: Gradient descent trajectory density plots for Warsaw Ochota for the 3 best GNN and FCNN
models. Horizontal axis corresponds to trajectory point number (recorded every 30 steps), vertical axis
to simulator output value.


based on random graphs, after roughly 330 epochs of training, as a function of the distance of
the graph used for constructing the net to the true graph. The distance was measured using the
symmetric difference between the respective edge sets.
  As we can see, the median of the mean relative error, denoted with a red dot, grows almost
monotonously as a function of the distance of the graph we use to the actual problem graph.
This is visible for both graph sampling methods. The minimum average relative error attained
for a particular value of the distance also grows, perhaps with a bit more of noise.
                (a) Edge/Non-edge switching                (b) Vertex label permutation

Figure 2: Mean relative error achieved by a GNN on test set after roughly 330 epochs of training,
shown as a function of the distance of a random graph to the true one. In subfigure 2(a), edge/non-
edge switching (described in the text) was used for generating random graphs. In subfigure 2(b), vertex
label permutation was used. Red dots denote median result. Errorbars correspond to 5% quantiles.


7. Conclusions
We demonstrated the usefulness of sparsely connected neural networks, with sparsity based on
an adjacency graph, in a problem from the traffic optimization domain. GNN consistently out-
performed FCNN on fixed test sets for the three realistic road networks we considered (Ochota,
Mokotów and Centrum districts in Warsaw). More importantly, GNN achieved approximation
quality far superior to FCNN near unseen simulator output value minima. By using randomly
perturbed graphs, we also showed that the choice of the proper graph when constructing a
GNN is important for achieving good results on a test set.
   The kind of NN sparsity considered in this paper, where only some of the connections are
allowed, may be regarded as a kind of a regularizer based on the problem graph. It is similar
to L1 regularization of a fully connected neural network in that it keeps only some weights
non-zero in the trained model. The resulting networks have much fewer parameters than
analogous fully connected networks and turn out to generalize significantly better than any
architecture we considered so far for solving the TSS problem.


Acknowledgments
The presented research was supported by Microsoft’s “AI for Earth” computational grant.


References
 [1] P. Gora, K. Kurach, Approximating traffic simulation using neural networks and its
     application in traffic optimization, in: NIPS 2016 Workshop on Nonconvex Optimization
     for Machine Learning: Theory and Practice, 2016.
 [2] C. Yang, Y. Yeh, The model and properties of the traffic light problem, in: Proc. of
     International Conference on Algorithms, 1996, pp. 19–26.
 [3] P. Gora, M. Brzeski, M. Możejko, A. Klemenko, A. Kochański, Investigating performance
     of neural networks and gradient boosting models approximating microscopic traffic simu-
     lations in traffic optimization tasks, in: "NeurIPS 2018 Workshop "Machine Learning for
     Intelligent Transportation Systems", 2018.
 [4] M. Możejko, M. Brzeski, L. Madry, L. Skowronek, P. Gora, Traffic signal settings optimiza-
     tion using gradient descent, Schedae Informaticae 27 (2018).
 [5] P. Gora, M. Bardoński, Training neural networks to approximate traffic simulation out-
     comes, in: 2017 5th IEEE International Conference on Models and Technologies for
     Intelligent Transportation Systems (MT-ITS), IEEE, 2017, pp. 889––894.
 [6] J. Zhang, S. Chowdhury, J. Zhang, A. Messac, L. Castillo, Adaptive hybrid surrogate
     modeling for complex systems, AIAA J (2013) 643–656.
 [7] D. Rijnen, J. Rhuggenaath, P. R. d. O. d. Costa, Y. Zhang, Machine learning based simulation
     optimisation for trailer management, in: 2019 IEEE International Conference on Systems,
     Man and Cybernetics (SMC), 2019, pp. 3687–3692. doi:10.1109/SMC.2019.8914329.
 [8] R. D. Hurrion, A sequential method for the development of visual interactive meta-
     simulation models using neural networks, The Journal of the Operational Research Society
     51 (2000) 712–719.
 [9] R. R. Barton, M. Meckesheimer, Chapter 18 metamodel-based simulation optimization, in:
     S. G. Henderson, B. L. Nelson (Eds.), Simulation, volume 13 of Handbooks in Operations
     Research and Management Science, Elsevier, 2006, pp. 535 – 574.
[10] C. Osorio, M. Bierlaire, A surrogate model for traffic optimization of congested networks:
     an analytic queueing network approach, in: EPFL-REPORT-152480, 2009.
[11] A. Micheli, Neural network for graphs: A contextual constructive approach, IEEE
     Transactions on Neural Networks 20 (2009) 498–511.
[12] C. Zhuang, Q. Ma, Dual graph convolutional networks for graph-based semi-supervised
     classification, in: Proceedings of the 2018 World Wide Web Conference, WWW ’18,
     International World Wide Web Conferences Steering Committee, Republic and Canton
     of Geneva, CHE, 2018, p. 499–508. URL: https://doi.org/10.1145/3178876.3186116. doi:10.
     1145/3178876.3186116.
[13] L. van der Maaten, G. Hinton, Visualizing data using t-SNE, Journal of Machine Learning
     Research 9 (2008) 2579–2605. URL: http://www.jmlr.org/papers/v9/vandermaaten08a.html.
[14] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, P. S. Yu, A comprehensive survey on
     graph neural networks, CoRR abs/1901.00596 (2019). URL: http://arxiv.org/abs/1901.00596.
     arXiv:1901.00596.
[15] Y. Li, R. Yu, C. Shahabi, Y. Liu, Graph convolutional recurrent neural network: Data-driven
     traffic forecasting, CoRR abs/1707.01926 (2017). URL: http://arxiv.org/abs/1707.01926.
     arXiv:1707.01926.
[16] B. Yu, H. Yin, Z. Zhu, Spatio-temporal graph convolutional neural network: A deep
     learning framework for traffic forecasting, CoRR abs/1709.04875 (2017). URL: http://arxiv.
     org/abs/1709.04875. arXiv:1709.04875.
[17] S. Guo, Y. Lin, N. Feng, C. Song, H. Wan, Attention based spatial-temporal graph convolu-
     tional networks for traffic flow forecasting, in: AAAI, 2019.
[18] H. Yao, F. Wu, J. Ke, X. Tang, Y. Jia, S. Lu, P. Gong, J. Ye, Z. Li, Deep multi-view spatial-
     temporal network for taxi demand prediction, CoRR abs/1802.08714 (2018). URL: http:
     //arxiv.org/abs/1802.08714. arXiv:1802.08714.
[19] Supplementary, Supplementary materials, 2021. URL: https://drive.google.com/file/d/1sba_
     cunGhao4z4-loIfQYV7u4cXCdnWk/view?usp=sharing.
[20] T. Kipf, M. Welling, Semi-supervised classification with graph convolutional networks, in:
     5th International Conference on Learning Representations, ICLR 2017, Conference Track
     Proceedings, 2017.
[21] P. Gora, Traffic simulation framework - a cellular automaton based tool for simulating
     and investigating real city traffic, in: Recent Advances in Intelligent Information Systems,
     2009, pp. 641–653.
[22] OSM, Openstreetmap, 2021. URL: https://www.openstreetmap.org.
[23] Dataset, Dataset used for experiments, 2021. URL: https://drive.google.com/file/d/
     1aLUL3QPxGxeUVmqds6HWeGnVnQ5O4Mxr/view?usp=sharing.
[24] D. Kingma, J. Ba, Adam: A method for stochastic optimization, in: Proceedings of the 3rd
     International Conference on Learning Representations (ICLR), 2015.
[25] Y. Nesterov, A method for unconstrained convex minimization problem with the rate of
     convergence o (1/𝑘 2 ), Doklady AN USSR 269 (1983) 543—-547.
[26] VMs, Description of virtual machines used in experiments, 2021. URL: https://docs.
     microsoft.com/en-us/azure/virtual-machines/sizes-gpu.
[27] Code, Zipped repository of the code used in our experiments, 2021. URL: https://drive.
     google.com/file/d/1FF6q8GTJljkYjKSNMYsL5neoXbOqPIcm/view?usp=sharing.
[28] Models, Models trained for the main experiment and all the out-of-sample sim-
     ulation datasets, 2021. URL: https://drive.google.com/file/d/1mPnFt1Y1wGLGE-ha2_
     JiYsuJEebnfBYx/view?usp=sharing.