Generative adversarial networks to model air
pollution under uncertainty
                 Jamal Toutouh1 , Sergio Nesmachnow2 , and Diego Gabriel Rossit3
                 1
                   Massachusetts Institute of Technology, USA
                 2
                   Universidad de la Repblica, Uruguay
                 3
                   INMABB, DI, Universidad Nacional del Sur (UNS)-CONICET, Argentina
                 E-mail: toutouh@mit.edu, sergion@fing.edu.uy, diego.rossit@uns.edu.ar

                 Abstract. Urbanization trends worldwide show a clear preference for motorized road mobility,
                 which has led to a degradation of air quality in recent years. Modelling and forecasting ambient
                 air pollution is a relevant problem because it helps decision-makers and urban city planners
                 understand this phenomenon, which is a significant threat to citizens’ health. Generally, data-
                 driven models suffer from a lack of data. This article addresses the issue of having limited
                 access to road traffic density and pollution concentration data by applying deep generative
                 models, specifically, Conditional Generative Adversarial Networks (CGAN). The main idea is
                 to train CGANs to generate synthetic nitrogen dioxide concentration values given the road
                 traffic density. The experimental data analysis from Montevideo (Uruguay) shows that the
                 proposed method generates realistic (accurate and diverse) pollution data while using reduced
                 computational resources.


1. Introduction
The growth of the cities that prioritized motorized mobility (use of the individual or collective
vehicles) is having an undesired negative effect on the dwellers’ safety and quality of life.
A significant concern is the high generation of emissions (air pollutants) due to the rapid
development of car-oriented cities. Air pollution is a major concern because it has a negative
impact on the citizens’ health, e.g., it provokes several respiratory diseases and it reduces life
expectancy [1, 2].
   One of the most important sources of air pollutants in urban areas is road mobility [1, 2].
Thus, proposing mobility policies that reduce road traffic (i.e., the use of private vehicles) could
be an effective strategy to mitigate the generation of emissions and improve urban livability
and inhabitants’ health. However, it is not easy to understand the various phenomena that may
have implications for the production or dissipation of pollutants, e.g., weather or time of the day.
Even, the policy-makers may see these kinds of measure as a way to degrade the effectiveness
and efficiency of road transportation. For this reason, there have been different approaches to
evaluate the real impact of mobility policies on the air quality [3]. Thus, modelling, predicting,
and forecasting ambient air pollution allow policy-makers and urban city planners to provide
solutions to this issue.
Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
    Artificial Neural Networks (ANN) and Deep Learning (DL) are successfully applied to deal
with air outdoor pollution modelling, prediction, and forecasting, as data-driven methods [4].
On the one hand, the main advantage of this approach is that the use of ANNs does not require
an in-depth understanding of the physics and dynamics between air pollution concentration
levels and other explanatory variables. On the other hand, these kinds of approaches have a
set of open questions: the selection of the appropriate ANN model, the interpretation of the
results of that kind of black-box methods, and the results are problem-specific. Besides, it is a
significant matter that this kind of methods requires a vast amount of data to be trained [4].
    This article focuses on training generative models, as a data augmentation approach, to
produce new information units (levels of NO2 ) to feed data-driven ANN methods for modelling,
forecasting, and predicting outdoor pollution. Generative Adversarial Networks (GANs) are
successfully used to train generative models [5] to learn to represent an estimate of a data
distribution given by the training dataset. Thus, we propose the use of a specific type of GANs,
Conditional GANs (CGANs), to train generators able to create synthesized pollution data given
the road traffic volume. The real dataset used to train the CGANs is built by collecting the
levels of NO2 and road traffic volume gathered by sensors located in Montevideo (Uruguay).
Thus, the CGAN will produce new information units (levels of NO2 ) that approximate the
original training set. Notice that we are not trying to create a pollution forecasting method,
but a modelling one from training the generative models. A previous study applied a similar
approach to model pollution [6]. However, the authors did not consider the road traffic, which
is one of the primary sources of NO2 in urban environments.

2. Conditional GANs for pollution modeling
GANs consist of two artificial neural networks, a generator and a discriminator, that apply
adversarial learning. The generator is trained to deceive the discriminator by generating
“fake” or “artificial” data samples transforming its inputs from a random latent space. The
discriminator learns how to distinguish between the “real” and “artificial” data samples. GAN
training is formulated as a minimax optimization problem by the definitions of generator and
discriminator loss [5]. CGANs are an extension of GANs to deal with labeled training datasets
(structured in classes, i.e, each sample has a y label). The idea is to train generative models
able to create samples of a given class according to y.
   The general training of a CGAN is graphically described in Figure 1.


                          Figure 1. General CGANs training setups.
    The proposed approach applies a CGAN to train a generative model to get spatial pollution
data. The CGAN learns the probability distributions of the generated pollutant given the road
traffic density (given by a class from 1 to 5, from lowest to highest). Thus, the generative model
has the road traffic density class as an input and returns the predicted NO2 concentration, which
is a random value drawn from the probability distribution learned by the generator.
   As the CGAN training is defined as minimax optimization between the generator and the
discriminator, the training process may oscillate without converging to an equilibrium. Thus,
our method tracks the accuracy of the generator after each training epoch in terms of distance
between the real and the generated distributions, and keeps a copy of the best generator found.

3. Experimental setup
This section summarizes the main details of the methodology for training CGANs to create
synthetic pollution data.

3.1. Training dataset
The training dataset studied here is provided by the open data portal offered by the National
Government of Uruguay [7]. Specifically, the training dataset consists of NO2 concentration
in micrograms per cubic meter (µg/m3 ) and road traffic density in vehicles per hour gathered
in three different locations in Montevideo (i.e., Avda. Colon, Calle Tres Cruces, and Curva de
Maronas) during January–February 2020 (in which there are several periods without data). The
data is hourly averaged.
   The road traffic density is classified in five classes: A, B, C, D, and E, which represent
densities from very low to very high. Table 1 and Figure 2 summarizes the distributions of data
used as training dataset. As can be seen, the NO2 concentration increases with the road traffic
volume. The classes are highly unbalanced (see Table 1), i.e., A class has 1280 (the maximum)
and C class has 539 samples (the minimum). We randomly sampled over the classes for our
experiments to select 539 samples of each class to balance the dataset to avoid training biases.
Thus, the final training dataset size is 2695 (539×5).

                  traffic density class    mediean     iqr   number of samples
                  A - very low traffic       18.50   23.00               1280
                  B - low traffic            28.00   33.00                743
                  C - fluid flow traffic     32.00   41.75                539
                  D - high traffic           40.50   48.00                626
                  E - very high traffic      49.00   50.00                650

Table 1. Summary of the NO2 pollution distributions (in terms of median and interquartile
range) and the number of samples for each traffic density class.


     Figure 2. Training dataset, i.e., pollution data distribution given road traffic class.

3.2. CGAN design details
In our research, both ANNs, the generator and the discriminator, are implemented as multilayer
perceptrons (MLP) [8]. Our experiments explore the use of three different MLP architectures
for the generator and the discriminator in the proposed CGAN: Three different neural networks
architectures for the generator and the discriminator in the proposed CGAN: CGAN-1, three-
layer perceptron (256 neurons per hidden-layer); CGAN-2, four-layer perceptron (128 neurons
per hidden-layer); and CGAN-3, four-layer perceptron (256 neurons per hidden-layer). The
main idea is to analyze the impact on the efficacy and the efficiency of the CGAN training is
affected by the complexity of the ANNs (number of parameters).
   For all the CGANs evaluated, the generators’ input layer has size 65 (64 for reading the
random latent space and one for the road traffic density label). The output layer has size one
(for the predicted NO2 concentration). The input layer of the discriminator has size two (one
for the road traffic density label and one for the NO2 concentration), and the output layer has
size one (for the predicted label, i.e., fake or real ).

3.3. Metrics evaluated
The main goal of the experimental evaluation is to analyze the impact on the training, the
quality of the results, and the computational cost of the different CGAN architectures.
   The performance is evaluated according to the loss values computed for the generator and
the discriminator during the training process. In this research, the function applied to compute
the loss is the binary cross-entropy (BCE) [5]. Equations 1 and 2 present the discriminator and
generator loss, respectively.
           1                                   1
       Ld = Ex,y∼Pdata (x,y) [log(Dd (x, y))] − Ez∼Pz (z),y∼Py (y) [log(1 − Dd (Gg (z, y), y))] ,   (1)
           2                                   2
                          Lg = Ez∼Pz (z),y∼Py (y) [log(1 − Dd (Gg (z, y), y))]                      (2)
   In order to assess the quality of the generated samples, we evaluate the distance between the
real and syntezised data distributions (i.e., NO2 concentration values) for each road traffic
density class. We propose the use of the distance between distributions according to the
KolmogorovSmirnov statistical test. Thus, lower values indicate better sample quality.
   Finally, we also consider the computational time and the training epoch the generator created
the most accurate pollution distribution (iteration best found ) to evaluate the computational
cost of the proposed generative methods.

4. Experimental analysis
In order to perform the numerical analysis of our approach, the three CGAN approaches have
been configured with a learning rate of 0.0002, batch size of 25 samples (108 batches per training
epoch), and 500 training epochs. This section presents the results of performing 30 independent
runs for each CGAN training in National Supercomputing Center (Cluster-UY), Uruguay [9].
    Figure 3 illustrates the evolution of the loss of the generator and the discriminator during
the training process. The three approaches show similar behaviour. During the first 100 epochs,
both losses oscillate and increase. This behaviour is explained by the fact that the discriminator
is not trained enough to give the proper feedback required by the generator to learn. After that,
the discriminator starts becoming stronger and reduces the loss values. Thus, it is harder for
the generator to deceive the discriminator. For this reason, the generator begins increasing the
computed loss values, allowing the generator to learn how to create more accurate samples. At
the end of the training process, the generator and the discriminator have similar loss values,
which could indicate that they have reached an equilibrium.
    Table 2 summarizes the fake data distributions by showing the median and the interquartile
range (iqr) and Figure 4 illustrates the distributions. The median values of the NO2 generated
distributions are very close to the real ones. However, the diversity (in terms of iqr) of the
fake distributions is substantially lower than in real data. Figure 4 confirms that the fake
distributions are less dispersed than the real ones in Figure 2. This indicates that our methods
are able to capture the general behaviour of the NO2 concentration given the traffic class, but
it has some limitations to capture the whole real distribution. Different authors have proposed
methods to improve diversity in GAN training [10, 11, 12].
            (a) CGAN-1                         (b) CGAN-2                       (c) CGAN-3
       Figure 3. Discriminator and generator loss values through the training process.
                              real data          CGAN-1             CGAN-2            CGAN-3
  traffic density class
                            mediean     iqr     mediean iqr        mediean iqr       mediean iqr
  A - very low traffic        18.50   23.00        17.30    2.03     17.41   2.16      17.95    2.69
  B - low traffic             28.00   33.00       28.33     3.75     29.55   3.67       28.76   4.26
  C - fluid flow traffic      32.00   41.75        36.68    5.12     34.83   4.25      31.83    4.95
  D - high traffic            40.50   48.00        43.90    5.98    39.76    4.97       41.26   5.58
  E - very high traffic       49.00   50.00        51.26    6.48     52.72   6.46      49.70    7.84

Table 2. Summary of the NO2 pollution distributions synthesized by the computed generative
models in terms of median and interquartile range.


            (a) CGAN-1                         (b) CGAN-2                       (c) CGAN-3
Figure 4. Generated pollution distributions given road traffic class by the proposed CGANs.

   Table 3 reports the minimum (min), mean, and standard deviation (std) of the distance
between the synthesized data and the real data distributions (min distance), the computation
time of each independent run (computational time), and the training epoch the generator created
the most accurate pollution distribution (iteration best found ).

                      min distance             computational time             iteration best found
 CGAN type
                      min mean std              min    mean       std          min     mean        std
 CGAN-1              0.41   0.43 0.01         318.41 677.11 179.07            21.00 234.61 140.40
 CGAN-2              0.42   0.43 0.01          318.99 827.67 173.34           51.00 165.13 85.04
 CGAN-3              0.41   0.43 0.01          341.83 859.05 106.43          20.00 146.10 96.90
   Table 3. Experimental results in terms of best generator found and computational cost.
    In terms of the distance between the generated distributions and the real data (see Table 3),
the three proposed CGANs create accurate distributions with the same quality (non-significant
differences between them). This is in line with the results showed in Table 2 and Figure 4.
   Focusing on the computational cost, on the one hand, CGAN-1, the smallest model, shows the
shortest run times. On the other hand, CGAN-3 finds the best generators faster than the others.
These results are expected because when training ANNs with the same number of iterations, the
computational time increases with the ANN’s complexity (i.e., with the number of parameters
to train). In turn, bigger networks are able to capture more complex features requiring a lower
number of training epochs.

5. Conclusions and future work
We have proposed using CGANs to train generative models to create synthesized pollution data,
in this case, NO2 concentration, according to a given road data traffic density. The idea is to deal
with the lack of data suffered by data-driven methods for modelling, predicting, and forecasting
ambient air pollution.
    We have proposed three different CGANs according to the complexity of the ANNs
architectures used for the generator and the discriminator. The main results indicate that the
three proposed models generate accurate NO2 pollution while requiring a reduced computational
time. All the CGANs have shown robustness on the training because all the experiments
converged to accurate generators. However, these generative models are limited in terms of
producing diverse data samples.
    The main lines for future work are related to extend the proposed model to generate the
pollution of the whole city of Montevideo by taking into account information from more sensors,
defining the generative modelling problem taking into account other variables such as the weather
or the time, and applying the generated data to feed data-driven models to prove that they are
able to improve their accuracy after including fake samples.
Acknowledgements
This research was partially funded by European Unions Horizon 2020 research and innovation program
under the Marie Skodowska-Curie grant agreement No 799078, by the European Union H2020-ICT-2019-3
and UMA18-FEDERJA-003, and the Systems that Learn Initiative at MIT CSAIL.

References
 [1] Lebrusán I and Toutouh J 2020 Smart Cities 3 456–478
 [2] Lebrusán I and Toutouh J 2020 Air Quality, Atmosphere & Health 14(3) 333–342
 [3] Toutouh J, Lebrusán I and Nesmachnow S 2020 Computational Intelligence for Evaluating the Air Quality
       in the Center of Madrid, Spain International Conference on Optimization and Learning (Springer) pp
       115–127
 [4] Cabaneros S M, Calautit J K and Hughes B R 2019 Environmental Modelling & Software 119 285–304
 [5] Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A and Bengio Y 2014
       Generative adversarial nets Advances in neural information processing systems pp 2672–2680
 [6] Toutouh J 2021 Conditional generative adversarial networks to model urban outdoor air pollution Smart
       Cities ed Nesmachnow S and Hernández Callejo L (Cham: Springer International Publishing) pp 90–105
       ISBN 978-3-030-69136-3
 [7] Gobierno Abierto de AGESIC 2020 Catlogo Nacional de Datos Abiertos https://catalogodatos.gub.uy/
       Accessed: 2020-10-30
 [8] Hastie T, Tibshirani R and Friedman J 2009 The Elements of Statistical Learning (Springer New York)
 [9] Nesmachnow S and Iturriaga S 2019 Cluster-UY: Collaborative scientific high performance computing in
       uruguay Supercomputing pp 188–202
[10] Toutouh J, Hemberg E and O’Reilly U M 2019 Spatial evolutionary generative adversarial networks
       Proceedings of the Genetic and Evolutionary Computation Conference GECCO ’19 (New York, NY, USA:
       ACM) pp 472–480 ISBN 978-1-4503-6111-8
[11] Toutouh J, Hemberg E and O’Reily U M 2020 Re-purposing heterogeneous generative ensembles with
       evolutionary computation Proceedings of the 2020 Genetic and Evolutionary Computation Conference
       GECCO ’20 (New York, NY, USA: Association for Computing Machinery) p 425434 ISBN 9781450371285
[12] Toutouh J, Hemberg E and O’Reilly U M 2020 Data Dieting in GAN Training (Singapore: Springer
       Singapore) pp 379–400 ISBN 978-981-15-3685-4