=Paper=
{{Paper
|id=Vol-1885/159
|storemode=property
|title=Evolution
Strategies for Deep Neural Network Models Design
|pdfUrl=https://ceur-ws.org/Vol-1885/159.pdf
|volume=Vol-1885
|authors=Petra Vidnerová,Roman Neruda
|dblpUrl=https://dblp.org/rec/conf/itat/VidnerovaN17
}}
==Evolution
Strategies for Deep Neural Network Models Design==
J. Hlaváčová (Ed.): ITAT 2017 Proceedings, pp. 159–166
CEUR Workshop Proceedings Vol. 1885, ISSN 1613-0073, c 2017 P. Vidnerová, R. Neruda
Evolution Strategies for Deep Neural Network Models Design
Petra Vidnerová, Roman Neruda
Institute of Computer Science, The Czech Academy of Sciences
petra@cs.cas.cz
Abstract: Deep neural networks have become the state-of- The proposed algorithm is evaluated both on benchmark
art methods in many fields of machine learning recently. and real-life data sets. As the benchmark data we use the
Still, there is no easy way how to choose a network ar- MNIST data set that is classification of handwritten digits.
chitecture which can significantly influence the network The real data set is from the area of sensor networks for
performance. air pollution monitoring. The data came from De Vito et
This work is a step towards an automatic architecture al [21, 5] and are described in detail in Section 5.1.
design. We propose an algorithm for an optimization of a The paper is organized as follows. Section 2 brings an
network architecture based on evolution strategies. The al- overview of related work. Section 3 briefly describes the
gorithm is inspired by and designed directly for the Keras main ideas of our approach. In Section 4 our algorithm
library [3] which is one of the most common implementa- based on evolution strategies is described. Section 5 sum-
tions of deep neural networks. marizes the results of our experiments. Finally, Section 6
The proposed algorithm is tested on MNIST data set brings conclusion.
and the prediction of air pollution based on sensor mea-
surements, and it is compared to several fixed architectures
and support vector regression. 2 Related Work
Neuroevolution techniques have been applied successfully
1 Introduction for various machine learning problems [6]. In classical
neuroevolution, no gradient descent is involved, both ar-
chitecture and weights undergo the evolutionary process.
Deep neural networks (DNN) have become the state-of- However, because of large computational requirements the
art methods in many fields of machine learning in recent applications are limited to small networks.
years. They have been applied to various problems, in- There were quite many attempts on architecture opti-
cluding image recognition, speech recognition, and natural mization via evolutionary process (e.g. [19, 1]) in previous
language processing [8, 10]. decades. Successful evolutionary techniques evolving the
Deep neural networks are feed-forward neural networks structure of feed-forward and recurrent neural networks
with multiple hidden layers between the input and output include NEAT [18], HyperNEAT [17] and CoSyNE [7] al-
layer. The layers typically have different units depending gorithms.
on the task at hand. Among the units, there are traditional On the other hand, studies dealing with evolution of
perceptrons, where each unit (neuron) realizes a nonlin- deep neural networks and convolutional networks started
ear function, such as the sigmoid function, or the rectified to emerge only very recently. The training of one DNN
linear unit (ReLU). usually requires hours or days of computing time, quite
While the learning of weights of the deep neural net- often utilizing GPU processors for speedup. Naturally,
work is done by algorithms based on the stochastic gradi- the evolutionary techniques requiring thousands of train-
ent descent, the choice of architecture, including a number ing trials were not considered a feasible choice. Never-
and sizes of layers, and a type of activation function, is theless, there are several approaches to reduce the overall
done manually by the user. However, the choice of archi- complexity of neuroevolution for DNN. Still due to limited
tecture has an important impact on the performance of the computational resources, the studies usually focus only on
DNN. Some kind of expertise is needed, and usually a trial parts of network design.
and error method is used in practice. For example, in [12] CMA-ES is used to optimize hy-
In this work we exploit a fully automatic design of perparameters of DNNs. In [9] the unsupervised convo-
deep neural networks. We investigate the use of evolu- lutional networks for vision-based reinforcement learning
tion strategies for evolution of a DNN architecture. There are studied, the structure of CNN is held fixed and only a
are not many studies on evolution of DNN since such ap- small recurrent controller is evolved. However, the recent
proach has very high computational requirements. To keep paper [16] presents a simple distributed evolutionary strat-
the search space as small as possible, we simplify our egy that is used to train relatively large recurrent network
model focusing on implementation of DNN in the Keras with competitive results on reinforcement learning tasks.
library [3] that is a widely used tool for practical applica- In [14] automated method for optimizing deep learning
tions of DNNs. architectures through evolution is proposed, extending ex-
160 P. Vidnerová, R. Neruda
isting neuroevolution methods. Authors of [4] sketch a Algorithm 1 (n, m)-Evolution strategy optimizing real-
genetic approach for evolving a deep autoencoder network valued vector and utilizing adaptive variance for each pa-
enhancing the sparsity of the synapses by means of special rameter
operators. Finally, the paper [13] presents two version of procedure (n, m)-ES
an evolutionary and co-evolutionary algorithm for design t ←0
of DNN with various transfer functions. Initialize population Pt n by randomly generated
vectors ~xt = (xt1 , . . . , xtN , σ1t , . . . , σNt )
3 Our Approach Evaluate individuals in Pt
while not terminating criterion do
In our approach we use evolution strategies to search for i ← 1, . . . , m do
for optimal architecture of DNN, while the weights are choose randomly a parent ~xti ,
learned by gradient based technique. generate an offspring ~yti
The main idea of our approach is to keep the search by Gaussian mutation:
space as small as possible, therefore the architecture spec- for j ← 1, . . . , N do
ification is simplified. It directly follows the implementa- σ ′j ← σ j · (1 + α · N(0, 1))
tion of DNN in Keras library, where networks are defined x′j ← x j + σ ′j · N(0, 1)
layer by layer, each layer fully connected with the next end for
layer. A layer is specified by number of neurons, type of insert ~yti to offspring candidate population Pt′
an activation function (all neurons in one layer have the end for
same type of an activation function), and type of regular- Deterministically choose Pt+1 as n best individ-
ization (such as dropout). uals from Pt′
In this paper, we work only with fully connected feed- Discard Pt and Pt′
forward neural networks, but the approach can be further t ← t +1
modified to include also convolutional layers. Then the end while
architecture specification would also contain type of layer end procedure
(dense or convolutional) and in case of convolutional layer
size of the filter.
4.1 Individuals
4 Evolution Strategies for DNN Design Individuals are coding feed-forward neural networks im-
Evolution strategies (ES) were proposed for work with plemented as Keras model Sequential. The model imple-
real-valued vectors representing parameters of complex mented as Sequential is built layer by layer, similarly an
optimization problems [2]. In the illustration algorithm individual consists of blocks representing individual lay-
bellow we can see a simple ES working with n individuals ers.
in a population and generating m offspring by means of
Gaussian mutation. The environmental selection has two
I=( [size1 , drop1 , act1 , σ1size , σ1drop ]1 , . . . ,
traditional forms for evolution strategies. The so called
(n + m)-ES generates new generation by deterministically [sizeH , dropH , actH , σHsize , σHdrop ]H ),
choosing n best individuals from the set of (n + m) par-
ents and offspring. The so called (n, m)-ES generates new
generation by selecting from m new offspring (typically, where H is the number of hidden layers, sizei
m > n). The latter approach is considered more robust is the number of neurons in corresponding layer
against local optima premature convergence. that is dense (fully connected) layer, dropi is the
Currently used evolution strategies may carry more dropout rate (zero value represents no dropout),
meta-parameters of the problem in the individual than just acti ∈ {relu, tanh, sigmoid, hardsigmoid, linear}
a vector of mutation variances. A successful version of stands for activation function, and σisize and σidrop are
evolution strategies, the so-called covariance matrix adap- strategy coefficients corresponding to size and dropout.
tation ES (CMA-ES) [12] uses a clever strategy to approx- So far, we work only with dense layers, but the individ-
imate the full N × N covariance matrix, thus representing ual can be further generalized to work with convolutional
a general N-dimensional normal distribution. Crossover layers as well. Also other types of regularization can be
operator is usually used within evolution strategies. considered, we are limited to dropout for the first experi-
In our implementation (n, m)-ES (see Alg. 1) is used. ments.
Offspring are generated using both mutation and crossover
operators. Since our individuals are describing network
4.2 Crossover
topology, they are not vectors of real numbers. So our
operators slightly differ from classical ES. The more detail The operator crossover combines two parent individuals
description follows. and produces two offspring individuals. It is implemented
Evolution Strategies for Deep Neural Network Models Design 161
as one-point crossover, where the cross-point is on the bor- 4.5 Selection
der of a block.
Let two parents be The tournament selection is used, i.e. each turn of the tour-
nament k individuals are selected at random and the one
I p1 = (B1p1 , B2p1 , . . . , Bkp1 )
with the highest fitness, in our case the one with the low-
I p2 = (B1p2 , B2p2 , . . . , Blp2 ), est crossvalidation error, is selected.
then the crossover produces offspring Our implementation of the proposed algorithm is avail-
able at [20].
Io1 = (B1p1 , . . . , Bcp1
p1 p2
, Bcp2+1 , . . . , Blp2 )
Io1 = (B1p2 , . . . , Bcp2
p2 p1
, Bcp1+1 , . . . , Bkp1 ),
5 Experiments
where cp1 ∈ {1, . . . , k − 1} and cp2 ∈ {1, . . . , l − 1}.
5.1 Data Set
4.3 Mutation
The operator mutation brings random changes to an in- For the first experiment we used real-world data from the
dividual. Each time an individual is mutated, one of the application area of sensor networks for air pollution mon-
following mutation operators is randomly chosen: itoring [21, 5], for the second experiment the well known
• mutateLayer - introduces random changes to one ran- MNIST data set [11].
domly selected layer. One of the following operators The sensor data contain tens of thousands measure-
is randomly chosen: ments of gas multi-sensor MOX array devices recording
concentrations of several gas pollutants collocated with a
– changeLayerSize - the number of neurons is
conventional air pollution monitoring station that provides
changed. Gaussian mutation is used, adapting
labels for the data. The data are recorded in 1 hour in-
strategy parameters σ size , the final number is
tervals, and there is quite a large number of gaps due to
rounded (since size has to be integer).
sensor malfunctions. For our experiments we have chosen
– changeDropOut - the dropout rate is changed data from the interval of March 10, 2004 to April 4, 2005,
using Gaussian mutation adapting strategy pa- taking into account each hour where records with missing
rameters σ drop. values were omitted. There are altogether 5 sensors as in-
– changeActivation - the activation function is puts and 5 target output values representing concentrations
changed, randomly chosen from the list of avail- of CO, NO2 , NOx, C6H6, and NMHC.
able activations. The whole time period is divided into five intervals.
• addLayer - one randomly generated block is inserted Then, only one interval is used for training, the rest is uti-
at random position. lized for testing. We considered five different choices of
the training part selection. This task may be quite difficult,
• delLayer - one randomly selected block is deleted. since the prediction is performed also in different parts of
Note, that the ES like mutation comes in play only when the year than the learning, e.g. the model trained on data
size of layer or dropout parameter is changed. Otherwise obtained during winter may perform worse during summer
the strategy parameters are ignored. (as was suggested by experts in the application area).
Table 1 brings overview of data sets sizes. All tasks have
4.4 Fitness 8 input values (five sensors, temperature, absolute and rel-
ative humidity) and 1 output (predicted value). All values
Fitness function should reflect a quality of the network are normalized between h0, 1i.
represented by an individual. To assess the generalization
ability of the network represented by the individual we use
a crossvalidation error. The lower the crossvalidation er- Table 1: Overview of data sets sizes.
ror, the higher the fitness of the individual.
Classical k-fold crossvalidation is used, i.e. the training Task train set test set
set is split into k-folds and each time one fold is used for CO 1469 5875
testing and the rest for training. The mean error on the NO2 1479 5914
testing set over k run is evaluated. NOx 1480 5916
The mean squared error is used as an error function: C6H6 1799 7192
NMHC 178 709
1 N
E = 100 ∑ ( f (xt ) − yt )2 ,
N t=1
The MNIST data set contains 70 000 images of hand
where T = (x1 , y1 ), . . . , (xN , yN ) is the actual testing set and written digits, 28 × 28 pixel each (see Fig. 1). 60 000 are
f is the function represented by the learned network. used for training, 10 000 for testing.
162 P. Vidnerová, R. Neruda
0 0 0 0
5 5 5 5
10 10 10 10
Table 2: Test accuracies on the MNIST data set.
15 15 15 15
20 20 20 20
25 25 25 25
model avg std min max
0
0 5 10 15 20 25
0
0 5 10 15 20 25
0
0 5 10 15 20 25
0
0 5 10 15 20 25
baseline 98.34 0.13 98.18 98.55
5
10
5
10
5
10
5
10
evolved by ES 98.64 0.05 98.55 98.73
15 15 15 15
20 20 20 20
25 25 25 25
0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
and the results are listed in Table 2, together with results
obtained by the evolved network.
Figure 1: Example of MNIST data set samples. The evolved network had also two hidden layers, first
with 736 ReLU units and dropout parameter 0.09, the sec-
5.2 Setup ond with 471 hard sigmoid units and dropout 0.2. The ES
found a competitive result, the evolved network achieved
For the sensor data the proposed algorithm was run for better accuracy than the baseline model.
100 generations for each data set, with n = 10 and m = 30.
During fitness function evaluation the network weights
are trained by RMSprop (one of the standard algorithms) 6 Conclusion
for 500 epochs. Besides the ES classical GA was imple-
mented and run on sensor data with same fitness function.
We have proposed an algorithm for automatic design of
For the MNIST data set, the algorithm was run for 30
DNNs based on evolution strategies. The algorithm was
generations, with n = 5 and m = 10, for fitness evaluation
tested in experiments on the real-life sensor data set and
the RMSprop was run for 20 epochs.
MNIST dataset of handwritten digits. On sensor data set,
When the best individual is obtained, the corresponding
the solutions found by our algorithm outperforms SVR and
network is built and trained on the whole training set and
selected fixed architectures. The activation function dom-
evaluated on the test set.
inating in solutions is the ReLU function. For the MNIST
data set, the network with ReLU and hard sigmoid units
5.3 Results was found, outperforming the baseline solution. We have
shown that our algorithm is able to found competitive so-
The resulting testing errors obtained by GA and ES in lutions.
the first experiment are listed in Table 3. There are av- The main limitation of the algorithm is the time com-
erage, standard deviation, minimum and maximum errors plexity. One direction of our future work is to try to lower
over 10 computations. The performance of ES over GA is the number of fitness evaluations using surrogate model-
slightly better, the ES achieved lower errors in 15 cases, ing or to use asynchronous evolution.
GA in 11 cases. Also we plan to extend the algorithm to work also with
Table 4 compares ES testing errors to results obtained convolutional networks and to include more parameters,
by support vector regression (SVR) with linear, RBF, poly- such as other types of regularization, the type of optimiza-
nomial, and sigmoid kernel function. SVR was trained us- tion algorithm, etc.
ing Scikit-learn library [15], hyperparameters were found The gradient based optimization algorithm depends sig-
using grid search and crossvalidation. nificantly on the random initialization of weights. One
The ES outperforms the SVR, it found best results in way to overcome this is to combine the evolution of
17 cases. weights and gradient based local search that is another
Finally, Table 5 compares the testing error of evolved possibility of future work.
network to error of three fixed architectures (for example
30-10-1 stands for 2 hidden layers of 30 and 10 neurons,
one neuron in output layers, ReLU activation is used and
Acknowledgment
dropout 0.2). The evolved network achieved the most (10)
best results.
Since this task does not have much training samples, This work was partially supported by the Czech Grant
also the networks evolved are quite small. The typical Agency grant 15-18108S and institutional support of the
evolved network had one hidden layer of about 70 neu- Institute of Computer Science RVO 67985807.
rons, dropout rate 0.3 and ReLU activation function. Access to computing and storage facilities owned by
The second experiment was the classification of MNIST parties and projects contributing to the National Grid In-
letters. As a baseline architecture was taken the one from frastructure MetaCentrum provided under the programme
Keras examples, i.e. network with two hidden layers of "Projects of Large Research, Development, and Innova-
512 ReLU units each, both with dropout 0.2. This network tions Infrastructures" (CESNET LM2015042), is greatly
has a fairly good performance. It was trained 10 times appreciated.
Evolution Strategies for Deep Neural Network Models Design 163
Table 3: Errors on test set for networks found by GA and ES. The average, standard deviation, minimum and maximum
of 10 evaluations of the learning algorithm are listed.
GA ES
avg std min max avg std min max
CO part1 0.209 0.014 0.188 0.236 0.229 0.026 0.195 0.267
CO part2 0.801 0.135 0.600 1.048 0.657 0.024 0.631 0.694
CO part3 0.266 0.029 0.222 0.309 0.256 0.045 0.199 0.349
CO part4 0.404 0.226 0.186 0.865 0.526 0.108 0.308 0.701
CO part5 0.246 0.024 0.207 0.286 0.235 0.025 0.199 0.277
NOx part1 2.201 0.131 1.994 2.506 2.132 0.086 2.021 2.284
NOx part2 1.705 0.284 1.239 2.282 1.599 0.077 1.444 1.685
NOx part3 1.238 0.163 0.982 1.533 1.339 0.242 1.106 1.955
NOx part4 1.490 0.173 1.174 1.835 1.610 0.164 1.435 2.041
NOx part5 0.551 0.052 0.456 0.642 0.622 0.075 0.521 0.726
NO2 part1 1.697 0.266 1.202 2.210 1.506 0.217 1.132 1.823
NO2 part2 2.009 0.415 1.326 2.944 1.371 0.048 1.242 1.415
NO2 part3 0.593 0.082 0.532 0.815 0.660 0.078 0.599 0.863
NO2 part4 0.737 0.023 0.706 0.776 0.782 0.043 0.711 0.856
NO2 part5 1.265 0.158 1.054 1.580 0.730 0.111 0.520 0.905
C6H6 part1 0.013 0.005 0.006 0.024 0.013 0.004 0.007 0.018
C6H6 part2 0.039 0.015 0.025 0.079 0.034 0.010 0.020 0.050
C6H6 part3 0.019 0.011 0.009 0.041 0.048 0.015 0.016 0.075
C6H6 part4 0.030 0.015 0.014 0.061 0.020 0.010 0.010 0.042
C6H6 part5 0.017 0.015 0.004 0.051 0.027 0.011 0.014 0.051
NMHC part1 1.719 0.168 1.412 2.000 1.685 0.256 1.448 2.378
NMHC part2 0.623 0.164 0.446 1.047 0.713 0.097 0.566 0.865
NMHC part3 1.144 0.181 0.912 1.472 1.097 0.270 0.775 1.560
NMHC part4 1.220 0.206 0.994 1.563 1.099 0.166 0.898 1.443
NMHC part5 1.222 0.126 1.055 1.447 1.023 0.050 0.963 1.116
11 15
44% 60%
164 P. Vidnerová, R. Neruda
Table 4: Test errors for evolved network and SVR with different kernel functions. For the evolved network the average,
standard deviation, minimum and maximum of 10 evaluations of learning algorithm are listed.
Task Evolved network SVR
avg std min max linear RBF Poly. Sigmoid
CO_part1 0.229 0.026 0.195 0.267 0.340 0.280 0.285 1.533
CO_part2 0.657 0.024 0.631 0.694 0.614 0.412 0.621 1.753
CO_part3 0.256 0.045 0.199 0.349 0.314 0.408 0.377 1.427
CO_part4 0.526 0.108 0.308 0.701 1.127 0.692 0.535 1.375
CO_part5 0.235 0.025 0.199 0.277 0.348 0.207 0.198 1.568
NOx_part1 2.132 0.086 2.021 2.284 1.062 1.447 1.202 2.537
NOx_part2 1.599 0.077 1.444 1.685 2.162 1.838 1.387 2.428
NOx_part3 1.339 0.242 1.106 1.955 0.594 0.674 0.665 2.705
NOx_part4 1.610 0.164 1.435 2.041 0.864 0.903 0.778 2.462
NOx_part5 0.622 0.075 0.521 0.726 1.632 0.730 1.446 2.761
NO2_part1 1.506 0.217 1.132 1.823 2.464 2.404 2.401 2.636
NO2_part2 1.371 0.048 1.242 1.415 2.118 2.250 2.409 2.648
NO2_part3 0.660 0.078 0.599 0.863 1.308 1.195 1.213 1.984
NO2_part4 0.782 0.043 0.711 0.856 1.978 2.565 1.912 2.531
NO2_part5 0.730 0.111 0.520 0.905 1.0773 1.047 0.967 2.129
C6H6_part1 0.013 0.004 0.007 0.018 0.300 0.511 0.219 1.398
C6H6_part2 0.034 0.010 0.020 0.050 0.378 0.489 0.369 1.478
C6H6_part3 0.048 0.015 0.016 0.075 0.520 0.663 0.538 1.317
C6H6_part4 0.020 0.010 0.010 0.042 0.217 0.459 0.123 1.279
C6H6_part5 0.027 0.011 0.014 0.051 0.215 0.297 0.188 1.526
NMHC_part1 1.685 0.256 1.448 2.378 1.718 1.666 1.621 3.861
NMHC_part2 0.713 0.097 0.566 0.865 0.934 0.978 0.839 3.651
NMHC_part3 1.097 0.270 0.775 1.560 1.580 1.280 1.438 2.830
NMHC_part4 1.099 0.166 0.898 1.443 1.720 1.565 1.917 2.715
NMHC_part5 1.023 0.050 0.963 1.116 1.238 0.944 1.407 2.960
17 2 2 4
68% 8% 8% 16%
Evolution Strategies for Deep Neural Network Models Design 165
Table 5: Test errors for evolved network and three selected fixed architectures.
Task Evolved network 50-1 30-10-1 30-10-30-1
avg std avg std avg std avg std
CO_part1 0.229 0.026 0.230 0.032 0.250 0.023 0.377 0.103
CO_part2 0.657 0.024 0.861 0.136 0.744 0.142 0.858 0.173
CO_part3 0.256 0.045 0.261 0.040 0.305 0.043 0.302 0.046
CO_part4 0.526 0.108 0.621 0.279 0.638 0.213 0.454 0.158
CO_part5 0.235 0.025 0.283 0.072 0.270 0.032 0.309 0.032
NOx_part1 2.132 0.086 2.158 0.203 2.095 0.131 2.307 0.196
NOx_part2 1.599 0.077 1.799 0.313 1.891 0.199 2.083 0.172
NOx_part3 1.339 0.242 1.077 0.125 1.092 0.178 0.806 0.185
NOx_part4 1.610 0.164 1.303 0.208 1.797 0.461 1.600 0.643
NOx_part5 0.622 0.075 0.644 0.075 0.677 0.055 0.778 0.054
NO2_part1 1.506 0.217 1.659 0.250 1.368 0.135 1.677 0.233
NO2_part2 1.371 0.048 1.762 0.237 1.687 0.202 1.827 0.264
NO2_part3 0.660 0.078 0.682 0.148 0.576 0.044 0.603 0.069
NO2_part4 0.782 0.043 1.109 0.923 0.757 0.059 0.802 0.076
NO2_part5 0.730 0.111 0.646 0.064 0.734 0.107 0.748 0.123
C6H6_part1 0.013 0.004 0.012 0.006 0.081 0.030 0.190 0.060
C6H6_part2 0.034 0.010 0.039 0.012 0.101 0.015 0.211 0.071
C6H6_part3 0.048 0.015 0.024 0.007 0.091 0.047 0.115 0.031
C6H6_part4 0.020 0.010 0.026 0.010 0.051 0.026 0.096 0.020
C6H6_part5 0.027 0.011 0.025 0.008 0.113 0.025 0.176 0.058
NMHC_part1 1.685 0.256 1.738 0.144 1.889 0.119 2.378 0.208
NMHC_part2 0.713 0.097 0.553 0.045 0.650 0.078 0.799 0.096
NMHC_part3 1.097 0.270 1.128 0.089 0.901 0.124 0.789 0.184
NMHC_part4 1.099 0.166 1.116 0.119 0.918 0.119 0.751 0.096
NMHC_part5 1.023 0.050 0.970 0.094 0.889 0.085 0.856 0.074
10 6 4 5
40% 24% 16% 20%
166 P. Vidnerová, R. Neruda
References [16] T. Salimans, J. Ho, X. Chen, and I. Sutskever. Evolu-
tion Strategies as a Scalable Alternative to Reinforcement
Learning. ArXiv e-prints, March 2017.
[17] Kenneth O. Stanley, David B. D’Ambrosio, and Jason
[1] Jasmina Arifovic and Ramazan Gençay. Using genetic al-
Gauci. A hypercube-based encoding for evolving large-
gorithms to select architecture of a feedforward artificial
scale neural networks. Artif. Life, 15(2):185–212, April
neural network. Physica A: Statistical Mechanics and its
2009.
Applications, 289(3–4):574 – 594, 2001.
[18] Kenneth O. Stanley and Risto Miikkulainen. Evolving neu-
[2] H.-G. Beyer and H. P. Schwefel. Evolutionary strategies:
ral networks through augmenting topologies. Evolutionary
A comprehensive introduction. Natural Computing, pages
Computation, 10(2):99–127, 2002.
3–52, 2002.
[19] B. u. Islam, Z. Baharudin, M. Q. Raza, and P. Nallagown-
[3] François Chollet. Keras.
den. Optimization of neural network architecture using
https://github.com/fchollet/keras, 2015.
genetic algorithm for load forecasting. In 2014 5th Inter-
[4] Omid E. David and Iddo Greental. Genetic algorithms for national Conference on Intelligent and Advanced Systems
evolving deep neural networks. In Proceedings of the Com- (ICIAS), pages 1–6, June 2014.
panion Publication of the 2014 Annual Conference on Ge-
[20] Petra Vidnerová. GAKeras.
netic and Evolutionary Computation, GECCO Comp ’14,
github.com/PetraVidnerova/GAKeras, 2017.
pages 1451–1452, New York, NY, USA, 2014. ACM.
[21] S. De Vito, E. Massera, M. Piga, L. Martinotto, and G. Di
[5] S. De Vito, G. Fattoruso, M. Pardo, F. Tortorella, and
Francia. On field calibration of an electronic nose for
G. Di Francia. Semi-supervised learning techniques in ar-
benzene estimation in an urban pollution monitoring sce-
tificial olfaction: A novel approach to classification prob-
nario. Sensors and Actuators B: Chemical, 129(2):750 –
lems and drift counteraction. Sensors Journal, IEEE,
757, 2008.
12(11):3215–3224, Nov 2012.
[6] Dario Floreano, Peter Dürr, and Claudio Mattiussi. Neu-
roevolution: from architectures to learning. Evolutionary
Intelligence, 1(1):47–62, 2008.
[7] Faustino Gomez, Juergen Schmidhuber, and Risto Miikku-
lainen. Accelerated neural evolution through cooperatively
coevolved synapses. Journal of Machine Learning Re-
search, pages 937–965, 2008.
[8] Ian Goodfellow, Yoshua Bengio, and Aaron
Courville. Deep Learning. MIT Press, 2016.
http://www.deeplearningbook.org.
[9] Jan Koutník, Juergen Schmidhuber, and Faustino Gomez.
Evolving deep unsupervised convolutional networks for
vision-based reinforcement learning. In Proceedings of
the 2014 Annual Conference on Genetic and Evolutionary
Computation, GECCO ’14, pages 541–548, New York, NY,
USA, 2014. ACM.
[10] Yann Lecun, Yoshua Bengio, and Geoffrey Hinton. Deep
learning. Nature, 521(7553):436–444, 5 2015.
[11] Yann LeCun and Corinna Cortes. The mnist database of
handwritten digits, 2012.
[12] Ilya Loshchilov and Frank Hutter. CMA-ES for hyper-
parameter optimization of deep neural networks. CoRR,
abs/1604.07269, 2016.
[13] Tomas H. Maul, Andrzej Bargiela, Siang-Yew Chong, and
Abdullahi S. Adamu. Towards evolutionary deep neural
networks. In Flaminio Squazzoni, Fabio Baronio, Claudia
Archetti, and Marco Castellani, editors, ECMS 2014 Pro-
ceedings. European Council for Modeling and Simulation,
2014.
[14] Risto Miikkulainen, Jason Zhi Liang, Elliot Meyerson,
Aditya Rawal, Dan Fink, Olivier Francon, Bala Raju,
Hormoz Shahrzad, Arshak Navruzyan, Nigel Duffy, and
Babak Hodjat. Evolving deep neural networks. CoRR,
abs/1703.00548, 2017.
[15] F. Pedregosa et al. Scikit-learn: Machine learning in
Python. Journal of Machine Learning Research, 12:2825–
2830, 2011.