=Paper= {{Paper |id=Vol-1885/159 |storemode=property |title=Evolution Strategies for Deep Neural Network Models Design |pdfUrl=https://ceur-ws.org/Vol-1885/159.pdf |volume=Vol-1885 |authors=Petra Vidnerová,Roman Neruda |dblpUrl=https://dblp.org/rec/conf/itat/VidnerovaN17 }} ==Evolution Strategies for Deep Neural Network Models Design== https://ceur-ws.org/Vol-1885/159.pdf
J. Hlaváčová (Ed.): ITAT 2017 Proceedings, pp. 159–166
CEUR Workshop Proceedings Vol. 1885, ISSN 1613-0073, c 2017 P. Vidnerová, R. Neruda



                   Evolution Strategies for Deep Neural Network Models Design

                                                   Petra Vidnerová, Roman Neruda

                                      Institute of Computer Science, The Czech Academy of Sciences
                                                            petra@cs.cas.cz

     Abstract: Deep neural networks have become the state-of-            The proposed algorithm is evaluated both on benchmark
     art methods in many fields of machine learning recently.         and real-life data sets. As the benchmark data we use the
     Still, there is no easy way how to choose a network ar-          MNIST data set that is classification of handwritten digits.
     chitecture which can significantly influence the network         The real data set is from the area of sensor networks for
     performance.                                                     air pollution monitoring. The data came from De Vito et
        This work is a step towards an automatic architecture         al [21, 5] and are described in detail in Section 5.1.
     design. We propose an algorithm for an optimization of a            The paper is organized as follows. Section 2 brings an
     network architecture based on evolution strategies. The al-      overview of related work. Section 3 briefly describes the
     gorithm is inspired by and designed directly for the Keras       main ideas of our approach. In Section 4 our algorithm
     library [3] which is one of the most common implementa-          based on evolution strategies is described. Section 5 sum-
     tions of deep neural networks.                                   marizes the results of our experiments. Finally, Section 6
        The proposed algorithm is tested on MNIST data set            brings conclusion.
     and the prediction of air pollution based on sensor mea-
     surements, and it is compared to several fixed architectures
     and support vector regression.                                   2 Related Work

                                                                      Neuroevolution techniques have been applied successfully
     1    Introduction                                                for various machine learning problems [6]. In classical
                                                                      neuroevolution, no gradient descent is involved, both ar-
                                                                      chitecture and weights undergo the evolutionary process.
     Deep neural networks (DNN) have become the state-of-             However, because of large computational requirements the
     art methods in many fields of machine learning in recent         applications are limited to small networks.
     years. They have been applied to various problems, in-              There were quite many attempts on architecture opti-
     cluding image recognition, speech recognition, and natural       mization via evolutionary process (e.g. [19, 1]) in previous
     language processing [8, 10].                                     decades. Successful evolutionary techniques evolving the
        Deep neural networks are feed-forward neural networks         structure of feed-forward and recurrent neural networks
     with multiple hidden layers between the input and output         include NEAT [18], HyperNEAT [17] and CoSyNE [7] al-
     layer. The layers typically have different units depending       gorithms.
     on the task at hand. Among the units, there are traditional         On the other hand, studies dealing with evolution of
     perceptrons, where each unit (neuron) realizes a nonlin-         deep neural networks and convolutional networks started
     ear function, such as the sigmoid function, or the rectified     to emerge only very recently. The training of one DNN
     linear unit (ReLU).                                              usually requires hours or days of computing time, quite
        While the learning of weights of the deep neural net-         often utilizing GPU processors for speedup. Naturally,
     work is done by algorithms based on the stochastic gradi-        the evolutionary techniques requiring thousands of train-
     ent descent, the choice of architecture, including a number      ing trials were not considered a feasible choice. Never-
     and sizes of layers, and a type of activation function, is       theless, there are several approaches to reduce the overall
     done manually by the user. However, the choice of archi-         complexity of neuroevolution for DNN. Still due to limited
     tecture has an important impact on the performance of the        computational resources, the studies usually focus only on
     DNN. Some kind of expertise is needed, and usually a trial       parts of network design.
     and error method is used in practice.                               For example, in [12] CMA-ES is used to optimize hy-
        In this work we exploit a fully automatic design of           perparameters of DNNs. In [9] the unsupervised convo-
     deep neural networks. We investigate the use of evolu-           lutional networks for vision-based reinforcement learning
     tion strategies for evolution of a DNN architecture. There       are studied, the structure of CNN is held fixed and only a
     are not many studies on evolution of DNN since such ap-          small recurrent controller is evolved. However, the recent
     proach has very high computational requirements. To keep         paper [16] presents a simple distributed evolutionary strat-
     the search space as small as possible, we simplify our           egy that is used to train relatively large recurrent network
     model focusing on implementation of DNN in the Keras             with competitive results on reinforcement learning tasks.
     library [3] that is a widely used tool for practical applica-       In [14] automated method for optimizing deep learning
     tions of DNNs.                                                   architectures through evolution is proposed, extending ex-
160                                                                                                                          P. Vidnerová, R. Neruda

      isting neuroevolution methods. Authors of [4] sketch a         Algorithm 1 (n, m)-Evolution strategy optimizing real-
      genetic approach for evolving a deep autoencoder network       valued vector and utilizing adaptive variance for each pa-
      enhancing the sparsity of the synapses by means of special     rameter
      operators. Finally, the paper [13] presents two version of       procedure (n, m)-ES
      an evolutionary and co-evolutionary algorithm for design             t ←0
      of DNN with various transfer functions.                              Initialize population Pt n by randomly generated
                                                                       vectors ~xt = (xt1 , . . . , xtN , σ1t , . . . , σNt )
      3   Our Approach                                                     Evaluate individuals in Pt
                                                                           while not terminating criterion do
      In our approach we use evolution strategies to search                    for i ← 1, . . . , m do
      for optimal architecture of DNN, while the weights are                        choose randomly a parent ~xti ,
      learned by gradient based technique.                                          generate an offspring ~yti
         The main idea of our approach is to keep the search                        by Gaussian mutation:
      space as small as possible, therefore the architecture spec-                  for j ← 1, . . . , N do
      ification is simplified. It directly follows the implementa-                      σ ′j ← σ j · (1 + α · N(0, 1))
      tion of DNN in Keras library, where networks are defined                          x′j ← x j + σ ′j · N(0, 1)
      layer by layer, each layer fully connected with the next                      end for
      layer. A layer is specified by number of neurons, type of                     insert ~yti to offspring candidate population Pt′
      an activation function (all neurons in one layer have the                end for
      same type of an activation function), and type of regular-               Deterministically choose Pt+1 as n best individ-
      ization (such as dropout).                                       uals from Pt′
         In this paper, we work only with fully connected feed-                Discard Pt and Pt′
      forward neural networks, but the approach can be further                 t ← t +1
      modified to include also convolutional layers. Then the              end while
      architecture specification would also contain type of layer      end procedure
      (dense or convolutional) and in case of convolutional layer
      size of the filter.
                                                                     4.1 Individuals
      4   Evolution Strategies for DNN Design                        Individuals are coding feed-forward neural networks im-
      Evolution strategies (ES) were proposed for work with          plemented as Keras model Sequential. The model imple-
      real-valued vectors representing parameters of complex         mented as Sequential is built layer by layer, similarly an
      optimization problems [2]. In the illustration algorithm       individual consists of blocks representing individual lay-
      bellow we can see a simple ES working with n individuals       ers.
      in a population and generating m offspring by means of
      Gaussian mutation. The environmental selection has two
                                                                          I=(      [size1 , drop1 , act1 , σ1size , σ1drop ]1 , . . . ,
      traditional forms for evolution strategies. The so called
      (n + m)-ES generates new generation by deterministically                    [sizeH , dropH , actH , σHsize , σHdrop ]H         ),
      choosing n best individuals from the set of (n + m) par-
      ents and offspring. The so called (n, m)-ES generates new
      generation by selecting from m new offspring (typically,          where H is the number of hidden layers, sizei
      m > n). The latter approach is considered more robust          is the number of neurons in corresponding layer
      against local optima premature convergence.                    that is dense (fully connected) layer, dropi is the
         Currently used evolution strategies may carry more          dropout rate (zero value represents no dropout),
      meta-parameters of the problem in the individual than just     acti ∈ {relu, tanh, sigmoid, hardsigmoid, linear}
      a vector of mutation variances. A successful version of        stands for activation function, and σisize and σidrop are
      evolution strategies, the so-called covariance matrix adap-    strategy coefficients corresponding to size and dropout.
      tation ES (CMA-ES) [12] uses a clever strategy to approx-         So far, we work only with dense layers, but the individ-
      imate the full N × N covariance matrix, thus representing      ual can be further generalized to work with convolutional
      a general N-dimensional normal distribution. Crossover         layers as well. Also other types of regularization can be
      operator is usually used within evolution strategies.          considered, we are limited to dropout for the first experi-
         In our implementation (n, m)-ES (see Alg. 1) is used.       ments.
      Offspring are generated using both mutation and crossover
      operators. Since our individuals are describing network
                                                                     4.2 Crossover
      topology, they are not vectors of real numbers. So our
      operators slightly differ from classical ES. The more detail   The operator crossover combines two parent individuals
      description follows.                                           and produces two offspring individuals. It is implemented
Evolution Strategies for Deep Neural Network Models Design                                                                                      161

      as one-point crossover, where the cross-point is on the bor-             4.5 Selection
      der of a block.
         Let two parents be                                                    The tournament selection is used, i.e. each turn of the tour-
                                                                               nament k individuals are selected at random and the one
                           I p1 = (B1p1 , B2p1 , . . . , Bkp1 )
                                                                               with the highest fitness, in our case the one with the low-
                          I p2 = (B1p2 , B2p2 , . . . , Blp2 ),                est crossvalidation error, is selected.
      then the crossover produces offspring                                       Our implementation of the proposed algorithm is avail-
                                                                               able at [20].
                  Io1 = (B1p1 , . . . , Bcp1
                                         p1     p2
                                             , Bcp2+1 , . . . , Blp2 )

                  Io1 = (B1p2 , . . . , Bcp2
                                         p2     p1
                                             , Bcp1+1 , . . . , Bkp1 ),
                                                                               5 Experiments
      where cp1 ∈ {1, . . . , k − 1} and cp2 ∈ {1, . . . , l − 1}.

                                                                               5.1 Data Set
      4.3 Mutation
      The operator mutation brings random changes to an in-                    For the first experiment we used real-world data from the
      dividual. Each time an individual is mutated, one of the                 application area of sensor networks for air pollution mon-
      following mutation operators is randomly chosen:                         itoring [21, 5], for the second experiment the well known
         • mutateLayer - introduces random changes to one ran-                 MNIST data set [11].
           domly selected layer. One of the following operators                   The sensor data contain tens of thousands measure-
           is randomly chosen:                                                 ments of gas multi-sensor MOX array devices recording
                                                                               concentrations of several gas pollutants collocated with a
               – changeLayerSize - the number of neurons is
                                                                               conventional air pollution monitoring station that provides
                 changed. Gaussian mutation is used, adapting
                                                                               labels for the data. The data are recorded in 1 hour in-
                 strategy parameters σ size , the final number is
                                                                               tervals, and there is quite a large number of gaps due to
                 rounded (since size has to be integer).
                                                                               sensor malfunctions. For our experiments we have chosen
               – changeDropOut - the dropout rate is changed                   data from the interval of March 10, 2004 to April 4, 2005,
                 using Gaussian mutation adapting strategy pa-                 taking into account each hour where records with missing
                 rameters σ drop.                                              values were omitted. There are altogether 5 sensors as in-
               – changeActivation - the activation function is                 puts and 5 target output values representing concentrations
                 changed, randomly chosen from the list of avail-              of CO, NO2 , NOx, C6H6, and NMHC.
                 able activations.                                                The whole time period is divided into five intervals.
         • addLayer - one randomly generated block is inserted                 Then, only one interval is used for training, the rest is uti-
           at random position.                                                 lized for testing. We considered five different choices of
                                                                               the training part selection. This task may be quite difficult,
         • delLayer - one randomly selected block is deleted.                  since the prediction is performed also in different parts of
         Note, that the ES like mutation comes in play only when               the year than the learning, e.g. the model trained on data
      size of layer or dropout parameter is changed. Otherwise                 obtained during winter may perform worse during summer
      the strategy parameters are ignored.                                     (as was suggested by experts in the application area).
                                                                                  Table 1 brings overview of data sets sizes. All tasks have
      4.4 Fitness                                                              8 input values (five sensors, temperature, absolute and rel-
                                                                               ative humidity) and 1 output (predicted value). All values
      Fitness function should reflect a quality of the network                 are normalized between h0, 1i.
      represented by an individual. To assess the generalization
      ability of the network represented by the individual we use
      a crossvalidation error. The lower the crossvalidation er-                          Table 1: Overview of data sets sizes.
      ror, the higher the fitness of the individual.
         Classical k-fold crossvalidation is used, i.e. the training                          Task       train set   test set
      set is split into k-folds and each time one fold is used for                            CO           1469       5875
      testing and the rest for training. The mean error on the                                NO2          1479       5914
      testing set over k run is evaluated.                                                    NOx          1480       5916
         The mean squared error is used as an error function:                                 C6H6         1799       7192
                                                                                              NMHC          178        709
                                     1 N
                         E = 100       ∑ ( f (xt ) − yt )2 ,
                                     N t=1
                                                                                 The MNIST data set contains 70 000 images of hand
      where T = (x1 , y1 ), . . . , (xN , yN ) is the actual testing set and   written digits, 28 × 28 pixel each (see Fig. 1). 60 000 are
      f is the function represented by the learned network.                    used for training, 10 000 for testing.
162                                                                                                                                                                                       P. Vidnerová, R. Neruda
          0                                0                                0                                0


           5                                5                                5                                5


          10                               10                               10                               10
                                                                                                                                                   Table 2: Test accuracies on the MNIST data set.
          15                               15                               15                               15


          20                               20                               20                               20


          25                               25                               25                               25
                                                                                                                                                   model               avg     std    min      max
          0
               0   5   10   15   20   25

                                           0
                                                0   5   10   15   20   25

                                                                            0
                                                                                 0   5   10   15   20   25

                                                                                                             0
                                                                                                                  0   5   10   15   20   25
                                                                                                                                                   baseline          98.34    0.13   98.18    98.55
           5


          10
                                            5


                                           10
                                                                             5


                                                                            10
                                                                                                              5


                                                                                                             10
                                                                                                                                                   evolved by ES     98.64    0.05   98.55    98.73
          15                               15                               15                               15


          20                               20                               20                               20


          25                               25                               25                               25

               0   5   10   15   20   25        0   5   10   15   20   25        0   5   10   15   20   25        0   5   10   15   20   25
                                                                                                                                              and the results are listed in Table 2, together with results
                                                                                                                                              obtained by the evolved network.
               Figure 1: Example of MNIST data set samples.                                                                                     The evolved network had also two hidden layers, first
                                                                                                                                              with 736 ReLU units and dropout parameter 0.09, the sec-
      5.2 Setup                                                                                                                               ond with 471 hard sigmoid units and dropout 0.2. The ES
                                                                                                                                              found a competitive result, the evolved network achieved
      For the sensor data the proposed algorithm was run for                                                                                  better accuracy than the baseline model.
      100 generations for each data set, with n = 10 and m = 30.
      During fitness function evaluation the network weights
      are trained by RMSprop (one of the standard algorithms)                                                                                 6 Conclusion
      for 500 epochs. Besides the ES classical GA was imple-
      mented and run on sensor data with same fitness function.
                                                                                                                                              We have proposed an algorithm for automatic design of
        For the MNIST data set, the algorithm was run for 30
                                                                                                                                              DNNs based on evolution strategies. The algorithm was
      generations, with n = 5 and m = 10, for fitness evaluation
                                                                                                                                              tested in experiments on the real-life sensor data set and
      the RMSprop was run for 20 epochs.
                                                                                                                                              MNIST dataset of handwritten digits. On sensor data set,
        When the best individual is obtained, the corresponding
                                                                                                                                              the solutions found by our algorithm outperforms SVR and
      network is built and trained on the whole training set and
                                                                                                                                              selected fixed architectures. The activation function dom-
      evaluated on the test set.
                                                                                                                                              inating in solutions is the ReLU function. For the MNIST
                                                                                                                                              data set, the network with ReLU and hard sigmoid units
      5.3 Results                                                                                                                             was found, outperforming the baseline solution. We have
                                                                                                                                              shown that our algorithm is able to found competitive so-
      The resulting testing errors obtained by GA and ES in                                                                                   lutions.
      the first experiment are listed in Table 3. There are av-                                                                                  The main limitation of the algorithm is the time com-
      erage, standard deviation, minimum and maximum errors                                                                                   plexity. One direction of our future work is to try to lower
      over 10 computations. The performance of ES over GA is                                                                                  the number of fitness evaluations using surrogate model-
      slightly better, the ES achieved lower errors in 15 cases,                                                                              ing or to use asynchronous evolution.
      GA in 11 cases.                                                                                                                            Also we plan to extend the algorithm to work also with
         Table 4 compares ES testing errors to results obtained                                                                               convolutional networks and to include more parameters,
      by support vector regression (SVR) with linear, RBF, poly-                                                                              such as other types of regularization, the type of optimiza-
      nomial, and sigmoid kernel function. SVR was trained us-                                                                                tion algorithm, etc.
      ing Scikit-learn library [15], hyperparameters were found                                                                                  The gradient based optimization algorithm depends sig-
      using grid search and crossvalidation.                                                                                                  nificantly on the random initialization of weights. One
         The ES outperforms the SVR, it found best results in                                                                                 way to overcome this is to combine the evolution of
      17 cases.                                                                                                                               weights and gradient based local search that is another
         Finally, Table 5 compares the testing error of evolved                                                                               possibility of future work.
      network to error of three fixed architectures (for example
      30-10-1 stands for 2 hidden layers of 30 and 10 neurons,
      one neuron in output layers, ReLU activation is used and
                                                                                                                                              Acknowledgment
      dropout 0.2). The evolved network achieved the most (10)
      best results.
         Since this task does not have much training samples,                                                                                 This work was partially supported by the Czech Grant
      also the networks evolved are quite small. The typical                                                                                  Agency grant 15-18108S and institutional support of the
      evolved network had one hidden layer of about 70 neu-                                                                                   Institute of Computer Science RVO 67985807.
      rons, dropout rate 0.3 and ReLU activation function.                                                                                       Access to computing and storage facilities owned by
         The second experiment was the classification of MNIST                                                                                parties and projects contributing to the National Grid In-
      letters. As a baseline architecture was taken the one from                                                                              frastructure MetaCentrum provided under the programme
      Keras examples, i.e. network with two hidden layers of                                                                                  "Projects of Large Research, Development, and Innova-
      512 ReLU units each, both with dropout 0.2. This network                                                                                tions Infrastructures" (CESNET LM2015042), is greatly
      has a fairly good performance. It was trained 10 times                                                                                  appreciated.
Evolution Strategies for Deep Neural Network Models Design                                                                163




      Table 3: Errors on test set for networks found by GA and ES. The average, standard deviation, minimum and maximum
      of 10 evaluations of the learning algorithm are listed.

                                                          GA                             ES
                                             avg       std   min    max      avg      std   min    max
                       CO part1            0.209    0.014 0.188    0.236   0.229   0.026 0.195    0.267
                       CO part2            0.801    0.135 0.600    1.048   0.657   0.024 0.631    0.694
                       CO part3            0.266    0.029 0.222    0.309   0.256   0.045 0.199    0.349
                       CO part4            0.404    0.226 0.186    0.865   0.526   0.108 0.308    0.701
                       CO part5            0.246    0.024 0.207    0.286   0.235   0.025 0.199    0.277
                       NOx part1           2.201    0.131 1.994    2.506   2.132   0.086 2.021    2.284
                       NOx part2           1.705    0.284 1.239    2.282   1.599   0.077 1.444    1.685
                       NOx part3           1.238    0.163 0.982    1.533   1.339   0.242 1.106    1.955
                       NOx part4           1.490    0.173 1.174    1.835   1.610   0.164 1.435    2.041
                       NOx part5           0.551    0.052 0.456    0.642   0.622   0.075 0.521    0.726
                       NO2 part1           1.697    0.266 1.202    2.210   1.506   0.217 1.132    1.823
                       NO2 part2           2.009    0.415 1.326    2.944   1.371   0.048 1.242    1.415
                       NO2 part3           0.593    0.082 0.532    0.815   0.660   0.078 0.599    0.863
                       NO2 part4           0.737    0.023 0.706    0.776   0.782   0.043 0.711    0.856
                       NO2 part5           1.265    0.158 1.054    1.580   0.730   0.111 0.520    0.905
                       C6H6 part1          0.013    0.005 0.006    0.024   0.013   0.004 0.007    0.018
                       C6H6 part2          0.039    0.015 0.025    0.079   0.034   0.010 0.020    0.050
                       C6H6 part3          0.019    0.011 0.009    0.041   0.048   0.015 0.016    0.075
                       C6H6 part4          0.030    0.015 0.014    0.061   0.020   0.010 0.010    0.042
                       C6H6 part5          0.017    0.015 0.004    0.051   0.027   0.011 0.014    0.051
                       NMHC part1          1.719    0.168 1.412    2.000   1.685   0.256 1.448    2.378
                       NMHC part2          0.623    0.164 0.446    1.047   0.713   0.097 0.566    0.865
                       NMHC part3          1.144    0.181 0.912    1.472   1.097   0.270 0.775    1.560
                       NMHC part4          1.220    0.206 0.994    1.563   1.099   0.166 0.898    1.443
                       NMHC part5          1.222    0.126 1.055    1.447   1.023   0.050 0.963    1.116
                                              11                              15
                                            44%                             60%
164                                                                                                            P. Vidnerová, R. Neruda




      Table 4: Test errors for evolved network and SVR with different kernel functions. For the evolved network the average,
      standard deviation, minimum and maximum of 10 evaluations of learning algorithm are listed.

                     Task                    Evolved network                             SVR
                                       avg       std    min  max           linear    RBF Poly.      Sigmoid
                     CO_part1        0.229    0.026 0.195 0.267            0.340    0.280 0.285        1.533
                     CO_part2        0.657    0.024 0.631 0.694            0.614    0.412 0.621        1.753
                     CO_part3        0.256    0.045 0.199 0.349            0.314    0.408 0.377        1.427
                     CO_part4        0.526    0.108 0.308 0.701            1.127    0.692 0.535        1.375
                     CO_part5        0.235    0.025 0.199 0.277            0.348    0.207 0.198        1.568
                     NOx_part1       2.132    0.086 2.021 2.284            1.062    1.447 1.202        2.537
                     NOx_part2       1.599    0.077 1.444 1.685            2.162    1.838 1.387        2.428
                     NOx_part3       1.339    0.242 1.106 1.955            0.594    0.674 0.665        2.705
                     NOx_part4       1.610    0.164 1.435 2.041            0.864    0.903 0.778        2.462
                     NOx_part5       0.622    0.075 0.521 0.726            1.632    0.730 1.446        2.761
                     NO2_part1       1.506    0.217 1.132 1.823            2.464    2.404 2.401        2.636
                     NO2_part2       1.371    0.048 1.242 1.415            2.118    2.250 2.409        2.648
                     NO2_part3       0.660    0.078 0.599 0.863            1.308    1.195 1.213        1.984
                     NO2_part4       0.782    0.043 0.711 0.856            1.978    2.565 1.912        2.531
                     NO2_part5       0.730    0.111 0.520 0.905           1.0773    1.047 0.967        2.129
                     C6H6_part1      0.013    0.004 0.007 0.018            0.300    0.511 0.219        1.398
                     C6H6_part2      0.034    0.010 0.020 0.050            0.378    0.489 0.369        1.478
                     C6H6_part3      0.048    0.015 0.016 0.075            0.520    0.663 0.538        1.317
                     C6H6_part4      0.020    0.010 0.010 0.042            0.217    0.459 0.123        1.279
                     C6H6_part5      0.027    0.011 0.014 0.051            0.215    0.297 0.188        1.526
                     NMHC_part1      1.685    0.256 1.448 2.378            1.718    1.666 1.621        3.861
                     NMHC_part2      0.713    0.097 0.566 0.865            0.934    0.978 0.839        3.651
                     NMHC_part3      1.097    0.270 0.775 1.560            1.580    1.280 1.438        2.830
                     NMHC_part4      1.099    0.166 0.898 1.443            1.720    1.565 1.917        2.715
                     NMHC_part5      1.023    0.050 0.963 1.116            1.238    0.944 1.407        2.960
                                        17                                      2       2     4
                                      68%                                     8%      8%    16%
Evolution Strategies for Deep Neural Network Models Design                                                           165




                            Table 5: Test errors for evolved network and three selected fixed architectures.

                   Task              Evolved network             50-1              30-10-1            30-10-30-1
                                       avg        std          avg    std         avg      std         avg     std
                   CO_part1          0.229     0.026         0.230 0.032        0.250 0.023          0.377 0.103
                   CO_part2          0.657     0.024         0.861 0.136        0.744 0.142          0.858 0.173
                   CO_part3          0.256     0.045         0.261 0.040        0.305 0.043          0.302 0.046
                   CO_part4          0.526     0.108         0.621 0.279        0.638 0.213          0.454 0.158
                   CO_part5          0.235     0.025         0.283 0.072        0.270 0.032          0.309 0.032
                   NOx_part1         2.132     0.086         2.158 0.203        2.095 0.131          2.307 0.196
                   NOx_part2         1.599     0.077         1.799 0.313        1.891 0.199          2.083 0.172
                   NOx_part3         1.339     0.242         1.077 0.125        1.092 0.178          0.806 0.185
                   NOx_part4         1.610     0.164         1.303 0.208        1.797 0.461          1.600 0.643
                   NOx_part5         0.622     0.075         0.644 0.075        0.677 0.055          0.778 0.054
                   NO2_part1         1.506     0.217         1.659 0.250        1.368 0.135          1.677 0.233
                   NO2_part2         1.371     0.048         1.762 0.237        1.687 0.202          1.827 0.264
                   NO2_part3         0.660     0.078         0.682 0.148        0.576 0.044          0.603 0.069
                   NO2_part4         0.782     0.043         1.109 0.923        0.757 0.059          0.802 0.076
                   NO2_part5         0.730     0.111         0.646 0.064        0.734 0.107          0.748 0.123
                   C6H6_part1        0.013     0.004         0.012 0.006        0.081 0.030          0.190 0.060
                   C6H6_part2        0.034     0.010         0.039 0.012        0.101 0.015          0.211 0.071
                   C6H6_part3        0.048     0.015         0.024 0.007        0.091 0.047          0.115 0.031
                   C6H6_part4        0.020     0.010         0.026 0.010        0.051 0.026          0.096 0.020
                   C6H6_part5        0.027     0.011         0.025 0.008        0.113 0.025          0.176 0.058
                   NMHC_part1        1.685     0.256         1.738 0.144        1.889 0.119          2.378 0.208
                   NMHC_part2        0.713     0.097         0.553 0.045        0.650 0.078          0.799 0.096
                   NMHC_part3        1.097     0.270         1.128 0.089        0.901 0.124          0.789 0.184
                   NMHC_part4        1.099     0.166         1.116 0.119        0.918 0.119          0.751 0.096
                   NMHC_part5        1.023     0.050         0.970 0.094        0.889 0.085          0.856 0.074
                                        10                       6                  4                    5
                                      40%                     24%                16%                  20%
166                                                                                                                     P. Vidnerová, R. Neruda

      References                                                          [16] T. Salimans, J. Ho, X. Chen, and I. Sutskever. Evolu-
                                                                               tion Strategies as a Scalable Alternative to Reinforcement
                                                                               Learning. ArXiv e-prints, March 2017.
                                                                          [17] Kenneth O. Stanley, David B. D’Ambrosio, and Jason
       [1] Jasmina Arifovic and Ramazan Gençay. Using genetic al-
                                                                               Gauci. A hypercube-based encoding for evolving large-
           gorithms to select architecture of a feedforward artificial
                                                                               scale neural networks. Artif. Life, 15(2):185–212, April
           neural network. Physica A: Statistical Mechanics and its
                                                                               2009.
           Applications, 289(3–4):574 – 594, 2001.
                                                                          [18] Kenneth O. Stanley and Risto Miikkulainen. Evolving neu-
       [2] H.-G. Beyer and H. P. Schwefel. Evolutionary strategies:
                                                                               ral networks through augmenting topologies. Evolutionary
           A comprehensive introduction. Natural Computing, pages
                                                                               Computation, 10(2):99–127, 2002.
           3–52, 2002.
                                                                          [19] B. u. Islam, Z. Baharudin, M. Q. Raza, and P. Nallagown-
       [3] François            Chollet.                          Keras.
                                                                               den. Optimization of neural network architecture using
           https://github.com/fchollet/keras, 2015.
                                                                               genetic algorithm for load forecasting. In 2014 5th Inter-
       [4] Omid E. David and Iddo Greental. Genetic algorithms for             national Conference on Intelligent and Advanced Systems
           evolving deep neural networks. In Proceedings of the Com-           (ICIAS), pages 1–6, June 2014.
           panion Publication of the 2014 Annual Conference on Ge-
                                                                          [20] Petra         Vidnerová.                         GAKeras.
           netic and Evolutionary Computation, GECCO Comp ’14,
                                                                               github.com/PetraVidnerova/GAKeras, 2017.
           pages 1451–1452, New York, NY, USA, 2014. ACM.
                                                                          [21] S. De Vito, E. Massera, M. Piga, L. Martinotto, and G. Di
       [5] S. De Vito, G. Fattoruso, M. Pardo, F. Tortorella, and
                                                                               Francia. On field calibration of an electronic nose for
           G. Di Francia. Semi-supervised learning techniques in ar-
                                                                               benzene estimation in an urban pollution monitoring sce-
           tificial olfaction: A novel approach to classification prob-
                                                                               nario. Sensors and Actuators B: Chemical, 129(2):750 –
           lems and drift counteraction. Sensors Journal, IEEE,
                                                                               757, 2008.
           12(11):3215–3224, Nov 2012.
       [6] Dario Floreano, Peter Dürr, and Claudio Mattiussi. Neu-
           roevolution: from architectures to learning. Evolutionary
           Intelligence, 1(1):47–62, 2008.
       [7] Faustino Gomez, Juergen Schmidhuber, and Risto Miikku-
           lainen. Accelerated neural evolution through cooperatively
           coevolved synapses. Journal of Machine Learning Re-
           search, pages 937–965, 2008.
       [8] Ian Goodfellow,          Yoshua Bengio,        and Aaron
           Courville.        Deep Learning.       MIT Press, 2016.
           http://www.deeplearningbook.org.
       [9] Jan Koutník, Juergen Schmidhuber, and Faustino Gomez.
           Evolving deep unsupervised convolutional networks for
           vision-based reinforcement learning. In Proceedings of
           the 2014 Annual Conference on Genetic and Evolutionary
           Computation, GECCO ’14, pages 541–548, New York, NY,
           USA, 2014. ACM.
      [10] Yann Lecun, Yoshua Bengio, and Geoffrey Hinton. Deep
           learning. Nature, 521(7553):436–444, 5 2015.
      [11] Yann LeCun and Corinna Cortes. The mnist database of
           handwritten digits, 2012.
      [12] Ilya Loshchilov and Frank Hutter. CMA-ES for hyper-
           parameter optimization of deep neural networks. CoRR,
           abs/1604.07269, 2016.
      [13] Tomas H. Maul, Andrzej Bargiela, Siang-Yew Chong, and
           Abdullahi S. Adamu. Towards evolutionary deep neural
           networks. In Flaminio Squazzoni, Fabio Baronio, Claudia
           Archetti, and Marco Castellani, editors, ECMS 2014 Pro-
           ceedings. European Council for Modeling and Simulation,
           2014.
      [14] Risto Miikkulainen, Jason Zhi Liang, Elliot Meyerson,
           Aditya Rawal, Dan Fink, Olivier Francon, Bala Raju,
           Hormoz Shahrzad, Arshak Navruzyan, Nigel Duffy, and
           Babak Hodjat. Evolving deep neural networks. CoRR,
           abs/1703.00548, 2017.
      [15] F. Pedregosa et al. Scikit-learn: Machine learning in
           Python. Journal of Machine Learning Research, 12:2825–
           2830, 2011.