<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evolutionary federated learning on EEG-data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gábor Szegedi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Péter Kiss</string-name>
          <email>peter.kiss@inf.elte.hu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tomáš Horváth</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Data Science and Engineering ELTE - Eötvös Loránd University, Faculty of Informatics Budapest</institution>
          ,
          <addr-line>H-1117 Budapest, Pázmány Péter sétány 1/C.</addr-line>
          ,
          <country country="HU">Hungary</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>With the spread of digitalization across every aspects of society and economy, the amount of data generated keeps increasing. In some domains, this generation happens in such a massively distributed fashion that poses challenges for even collecting the data to build machine learning (ML) models on it, not to mention the processing power necessary for training. An important aspect of processing information that has been generated at users is privacy concerns, that is, users might be unwilling to expose anything that would enable one to draw any conclusion regarding to confidential information they possess. In this work, we present a experiment on a genetic algorithm based federated learning (FL) algorithm, that reduces the data transfer from individual users to the learner to a single fitness value.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The paradigm of federated learning (FL) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] addresses
the more and more timely scenario in which the data to
be processed is generated in a massively distributed
environment, where traditional approaches for building
machine learning (ML) models become extremely
challenging, mostly in logistical point of view. That is, when data
is generated at client devices as mobile phones, tablets or
smart watches, collecting, storing and processing all these
information in data centers might be difficult task
(aggregation problem) and, according to the idea of FL, not
necessary by all means.
      </p>
      <p>Another problem regarding the traditional data center
based solution is privacy concerns. It might happen that
users of the applications that build on centralized model
training are reluctant to share their possibly confidential
data. We believe a particularly fitting scenario for this
problem is the use case of medical applications. Each
medical institute might have a lot of patient data, but that
may be far from enough to train their own prediction
models. Here, sharing the data across a big number of institute
can yield a great help in developing automated diagnostic
tools. But being the private nature of these data, hospitals
probably decide not to share anything of this information
either to protect their reputation or due to legal regulations.</p>
      <p>
        As it is summarized in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], the characteristics of data
that FL is concerned with can be described as follows:
      </p>
      <p>Formally, we have K nodes and n data points, a set of
indices Pk (k 2 f1; : : : ; Kg) of data stored at node k, and
nk = jPkj is the number of data points at Pk. We assume
K
that Pk \ Pl = 0/ whenever l 6= k, thus åk=1 nk = n.
def</p>
      <p>We can then define the local loss for node as Fk(w) =
int1khåtri2aPinkinfgi( wex)a,mwphleer,egfiiv(ewn) tihsethpealroasms eotfriosuatriomnodwe.l aTththues
the problem to be minimized will become:</p>
      <p>K nk
minw2Rd f (w) = å
k=1 n</p>
      <p>Fk(w):
(1)</p>
      <p>
        To solve the learning problem (6) for neural networks
(NNs), the mainstream way is – starting from a common
initial parametrisation – to train local models using some
version of gradient descent methods, then aggregate the
local model updates (e.g. gradients), or equivalently the
local models themselves to update the global model. The
global model then will be sent back to the worker nodes.
Algorithm 1 is one of the most successful algorithms for
federated NN training, called FederatedAveraging [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        FederatedAveraging works pretty well solving the
aggregation problem, however using gradients or,
equivalently, the local models for the global aggregation step still
exposes some information on users data. To address
privacy concerns, the solution is usually to apply
achievements of differential privacy[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ][
        <xref ref-type="bibr" rid="ref4">4</xref>
        ][
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] atop the gradient
based learning process.
      </p>
      <p>In this paper we present a slightly different approach,
namely, we investigated whether it is possible to train NNs
in a federated fashion without using gradient in any
context. To approach the problem, it seemed to be a simple
choice to try evolutionary algorithms. Since a rich
literature is already available on evolutionary optimization of
Algorithm 1 FederatedAveraging
NNs, we only transfer this knowledge into the federated
environment.</p>
      <p>For the concrete task to be solved by our method we
have chosen classfication of EEG-signals using
convolutional neural networks (CNNs).</p>
      <p>The main contributions of this paper are
1. a proof of concept for applicability of genetic
algorithms to federated training of NNs without using
vulnerable gradients;
2. presenting Federated Neuroevolution (FNE) a simple
algorithm for the federated training, applying a
distributed fitness function.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Neuroevolution</title>
      <p>Evolutionary algorithms (EAs) follow the pattern of
evolution as it is observed by biologists in the nature.
According to this, in an infinite cycle of life, the most apt
individuals can produce offsprings possessing a potentially
slightly changed (mutated) mixture of their genoms that
might result an enhanced ability to face challenges in their
life. The main assumption in biology is that those
individuals survive and create descendants with a bigger chance,
who, in some aspects are superior to the others. The main
structure of an EA is sketched in Algorithm 2
Algorithm 2 EA
1: generate an initial population G0, i = 0
2: repeat
3: 8individual j 2 Gi : f j = f itness(individual j)
4: select parents from Gi based on their fitness
5: produce offspring generation Gi+1
6: 8individual j 2 Gi+1 : individual j
mutate(individual j)
7: until termination criterion is satisfied</p>
      <p>EAs – as nature inspired methods in general – are often
used to discover very complex, high dimensional and/or
non-convex search problems, therefore, attempts to apply
these methods on optimizing NNs has a long history.</p>
      <p>Recently, nature-inspired methods in relation with NNs
are used mostly for hyper-parameter tuning that includes
searching for an efficient architecture.</p>
      <p>
        A big part of this rich literature is concerned
specifically CNN-s, what we apply for our problem. Methods of
Genetic CNN [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], hierarchical evolution [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], large- scale
evolution [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], asynchronous CNN evolution [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and
automatic CNN design [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] give graph based methods to
design automatically the stack convolutional layers (skipping
potentially fully connected parts of the network) for image
classification trough genetical evolution of subsequent
layers with various innovative encoding techniques.
      </p>
      <p>In these scenarios, the learning itself is still based on
calculating the gradient and updating the model according
to that (backpropagation).</p>
      <p>
        Using backpropagation though being based on
calculation of gradients and on applying them on the weights of
the network is exactly what we want to avoid in our
experiment. Before the monocracy of derivative based
training algorithms however biology-inspired training
methods was a rather popular research topic, thus there is a
rich, though a bit dated literature concerned with our
constrained problem. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] give a summary of the
these initial approaches to neuroevolution (NE).
      </p>
      <p>
        There is a very interesting branch of applications of NE
for general NN-s, that includes techniques to purely
genetically train the architecture along with the weights of the
networks. Among the most important algorithms that
belong here it might be worth to mention NeuroEvolution of
Augmenting Topoplogies (NEAT) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], and
Hypercubebased NEAT (HyperNEAT) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and its specializattion for
modular evolution of NN-s, HyperNEAT-LEO [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and
Generative NE [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Despite of the power of HyperNEAT,
we decided first to focus on training a predefined
architecture, thus our method is based on more "traditional" NE
algorithms.
      </p>
      <p>For applying evolutionary approach on an issue, one
need to specify an encoding of the problem, a selection, a
crossover and a mutation method as well as a fitness
function.</p>
      <p>In the rest of this section we shortly describe the stages
of an EA along with a couple of examples of how these
stages have been implemented in some work on the field
of NE, that gave inspiration to our algorithm. At the end of
the section we also describe approaches aiming at handle
overfitting that has been proven a serious problem in NE.
2.1</p>
      <sec id="sec-2-1">
        <title>Encoding</title>
        <p>Genetic algorithms work on sequence of features that
would be mixed, or altered according some granularity
defined over them. Thus the first step in solving a problem
genetically is to provide a description of the search space.
We refer to this description as encoding, that can be direct
or indirect.
2.3</p>
      </sec>
      <sec id="sec-2-2">
        <title>Crossover</title>
        <p>Direct encoding is the more traditional way of problem
encoding, where sections of the genom more or less
correspond to specific parameters. Some of the early methods
hanlde some switches as well, that control the connectivity
of the specific perceptrons.</p>
        <p>
          [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] proposes a system based on a parallel genetic
algorithm, ANNA ELEONORA, for learning both topology
and for connection weights. Topology Utilizes binary
representation of networks, with granularity encoding that is
handled through one bit flag to determine connectivity,
that is, whether the given edge is present in the recent setup
or not, followed by the substring of weights. These
substrings are ordered in a way that connections into the same
neuron are grouped together.
        </p>
        <p>
          [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] presents a variant of EA applied immediately for
float weights. The input of the EA is a vector x of variables
, that are the parameters of the model (that is the weights
of the connections), the biases, and the newly invented
link switches. Link switches are variables, that control the
connectivity of the network, that is, negative value
represents that the edge is switched off. The search space
is constrained by upper and lower bounds on variables
(weights): x 2 I1 I2 Id , where Ii = [li; ui]; li; ui 2 R
for i = 1; 2; : : : ; d.
        </p>
        <p>
          Theoretically using the connectivity features of the
encoding the first method is able to evolve the architecture
too. The issue with this approach to encoding is that the
problem space grows very fast as we scale up the network
(which we need if we want to solve complex problems).
Indirect Encoding The scaling problem of Direct
Encoding can be solved with Indirect Encoding, that instead of
separated representation of model parameters uses
generative information. In HyperNEAT [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], which is maybe the
most important representative of this class, genes of the
genom are defining functions based on which weights can
be generated.
2.2
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Fitness</title>
        <p>
          The fitness function serves to specify how well a given
individual performs on the problem to be solved. A higher
value of the fitness function means a better solution for the
problem, while lower fitness value reports a poor
performance. Fitness is often normalized thus a function that
produces a fitness value 1 for a perfect solution, and 0 for
completely wrong setup can work well. As an example
for a normalized fitness in ML scenarios, [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] proposes a
1
fitness function for NN defined as fnorm = 1+err , where
err = åkm=1 åid=1mjydi yˆij , with d denoting the output
dimension, and m the number of examples, applying mean
absolute error.
        </p>
        <p>
          Crossover is a method that defines, how we combine
individuals of a generation to create offsprings for the next
generations. One simple way is – as in [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] – to
combine the parts of parent individuals at some cutting points.
Another approach is presented in [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], where crossover is
actually taking the average of the corresponding weights
of the two individuals: x(t+1) = x1(t)+x2(t) where x(t)s are
2
the individuals represented as vectors of parameters.
2.4
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>Mutation</title>
        <p>
          Mutation methods serve for adding extra variance to the
individual genoms to enable them to discover a bigger part
of the search space. [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] provides a representation that
translates different topologies and encoding length into a
common string format granting compositionally different
descriptions. At mutations, it applies three separate
probabilities for swapping bits such that granularity bits,
connectivity bit, and weight bits. For effectively explore the
search space, it uses EA simplex [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], instead of taking
three populations and creating a fourth based on those.
        </p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], where the possible values for genes are
constrained, mutation is carried out according to the
following formula: x(t+1) = x(t) + Bd , B 2 R2 is a diagonal
matrix, with a diagonal Bii 2 f0; 1g, and li xi(t+1) + di ui.
Based on this rule, the algorithm generates three
individuals/chromosomes: at the first only one element of the
diagonal of B can be one, at the second one a random
number of diagonal of elements, and at the last, Bii = 1 for all
i = 1; : : : ; d. The one with the best fitness of these three
will replace the weakest one in the next population.
2.5
        </p>
      </sec>
      <sec id="sec-2-5">
        <title>Overfitting</title>
        <p>Using EA usually involves a high computation demand,
which can be reduced through decreasing the number of
evaluations of the model, that is the size of training data
on which we want to try out the models defined by a given
generation of the genetic algorithms.</p>
        <p>
          Earlier applications of EA usually did not use separation
of data into training and test set (like [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ], for example).
Practitioners soon realized, however, that models trained
this way perform poorly on not seen data points, revealing
the tendency of evolutionary methods to strongly overfit
on the training problems. This issue got in the center of
interest, when as an attempt to reduce run time they tried
to use subsets of the training data to evaluate individuals.
        </p>
        <p>
          [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] made comprehensive experiments proving that
evolution is potentially able to extrapolate from the randomly
chosen test sets. A very promising direction to reduce
overfitting is random sampling, where at each generation,
a random subset of the training data is chosen and
evolution is performed based on the fitness on that sample.
The Random Sampling Technique (RST) [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] was
originally used for speeding up the GE runs in [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ], however, it
was already used for preventing overfitting. [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] and [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]
drive some experiments on RST, where they were testing
two parameters, the Random Subset Size (RSS) and the
Random Subset Reset (frequency of changing the subset).
They have found, interestingly, that the techniques
performs best when both these values are set to one, that is
in each iteration the fitness should be tested using a new
randomly chosen data point.
        </p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ], the authors present versions of “interleaved
sampling”, that means, instead of random subsets, fitness
at each round is evaluated alternating between one and all
training samples, with various switching frequencies. As
a result, they find that, on their test datasets, the best
technique would be to switch in each round between single
sample and all sample evaluation.
3
3.1
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>The problem</title>
      <sec id="sec-3-1">
        <title>Data</title>
        <p>
          For the experiment, we used the EEG Database Data Set
[
          <xref ref-type="bibr" rid="ref28">28</xref>
          ]. The dataset contains 120 EEG trial data about 122
patients who either belong to the alcoholic or to the control
group. In each trial, the patients were shown one or two
images of the Snodgrass and Vanderwart picture set [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ].
After showing them the stimuli, their brain activation was
measured for 1 second on 64 points at 256 Hertz. The
measurements are then labelled according to which group
they belong to, thus the task of the model to be built is to
predict which class of the two does a sample belong to.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Network architecture</title>
        <p>
          For the network architecture to train, we decided to use
the shallow convolutional network from [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ], that has been
designed specifically for EEG based multiclass prediction
problems. The essence of the network is three
convolutional layer that are intended to recognize specific patterns
in the signals. After two convolutional layers there is a
pooling layer and then comes the third convolutional layer.
On the output of this layer we applied batch normalization,
then added the output dense layer with sigmoid activation.
        </p>
        <p>
          For the control experiment, we used the AdaDelta
optimizer [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ] with Categorical Cross-Entropy loss function.
At training we used a batch size of 64, 1:0 as learning rate,
r = 0:95, and e = 10 7.
        </p>
        <p>The control model after 100 epochs achieved a
validation accuracy of 95% (see Figure 1).
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>The proposed methods</title>
      <p>The algorithm runs according to the process defined in
Algorithm 2. For starting off we create an initial generation
in which for each individuals initialize the weights of the
models randomly. From the initial generation then we
iterate along the fitness-selection-crossover-mutation loop.
In this section we describe the particular methods we used
for the different stages.
Selection The candidate set of individuals for crossover is
created by sorting the current generation’s models based
on their fitness and selecting the n 1 fittest models for
crossover. The last parent selected for mating is not among
the fittest ones, but chosen randomly from the rest, to add
more variance.</p>
      <p>Crossover functions Crossover method defines the way
according to which new individuals will be generated from
the parent generation.</p>
      <p>In our method, we pick two parents randomly from the
pool of parents to produce the required offspring amount.</p>
      <p>
        We run experiments with four crossover methods. The
first three require flattening the vector of weights. These
first three approaches are rather popular in EA research.
• Halving mix: In this approach, that is a simplified
version of the one in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], the vector of values from
the parents are taken to create the offspring vector
by taking the first half of it from the first parent and
the second half from the second parent. This was the
original approach in [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] too. Formally:
      </p>
      <p>n
fo f f springigi=1 =
(ai; if i n=2
bi; if i &gt; n=2
where n is the length of the model vectors and a =
(a1; a2; : : : ; an), b = (b1; b2; : : : ; bn) are the parent
vectors.
• Interleave mix: Here, the vector of values from the
parents are taken to create the offspring vector by
interleaving the two parent vectors. Formally:</p>
      <p>n
fo f f springigi=1 =
(ai; if i mod 2 = 0</p>
      <p>
        bi; if i mod 2 = 1
where n is the length of the model vectors and a; b are
the parent vectors.
(2)
(3)
• Mean mix: In this method, similarly to [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], the
vector of values from the parents are taken to create the
offspring vector by taking the mean of the two parent
vectors at each index. Formally:
      </p>
      <p>n
fo f f springigi=1 = f
ai + bi
2
g
(4)
• Kernelwise mix: In this algorithms we used a more
coarse units for the crossover. In each convolutional
layer there are multiple kernels/filters that hold key
pattern information. Similarly, in case of fully
connected layers, input weights belonging to a single
neurons describes some pattern in the previous
layers. These information portions are kept intact
during crossover. The offspring model is created by
randomly mixing the kernels inside each layer.</p>
      <p>In our experiments, the first three crossover methods did
not converge. This could be because these approaches are
very low level and do not care about the network structure
or the patterns learned in the kernels.</p>
      <p>Kernelwise mixing is a higher level approach, what we
tried after taking a look at how genetics works in nature. In
nature, the heredity is also a higher level mixing of genes,
instead of low level mix of organic molecules. Thus, traits
of the parents are kept intact. The resemblance to genetics
can be summarized as follows: the DNA is the network’s
weights, a gene is a filter and an organic molecule is a float
value. With this latest method, mixing the evolutionary
training was converging so we were applying this in our
approach.
4.1</p>
      <sec id="sec-4-1">
        <title>Mutation functions</title>
        <p>Crossover on its own results in generations that are only
combinations of the initial generation according to the
defined rules. Thus using merely crossover restrict the space
searched by the algorithm. To break this random
alternations of the offspring are applied in form of mutation
functions.</p>
        <p>For defining a mutation function we must define the
number of mutated values and the scale of the mutation on
these values. For the former we used probabilistic value
determining the chance of mutation for each value in the
model. The latter is a float value determining how much is
the impact on each mutating value.</p>
        <p>
          There are the following two main approaches we tried
for mutating values in a network:
• Mutate by offset (from [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ]): Here, we add a
random value to the selected values. In our
implementation, the offset was a random value between
[ mutation_rate; mutation_rate].
• Mutate by multiplication (from [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ]): Here,
we multiply the selected values with a
random value. In our implementation, the
multiplication factor was a random value between
[ 100 mutation_rate ; 100+mutation_rate ].
        </p>
        <p>100 100
After experimenting, we found a lot better convergence
rate with the second approach.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Federated fitness function</title>
        <p>For fitness, which should be maximized during the
evolutionary training, we have chosen the Negative Mean
Squared Error (NMSE), that is defined as in Equation (5).
fNMSE (w) =
1
1 n d</p>
        <p>å å (yˆ(ji)
n i=1 j=1
y(ji))2
(5)
where yˆ is the predicted output vector using parameters w,
y is the target output vector, d is the output dimension and
n is the number of examples. This a is slightly different
function, than the one in the example in section 2.2, but
it’s behaviour is the same( ¶¶wi fNMSE (w) ¶¶wi fnorm(w) &gt;
0; 8w; i )</p>
        <p>Applying the NMSE fitness for the original
optimization problem in Equation (6) our task will be to maximize
NMSE with respect to w:
maxw2Rd fNMSE (w) =
n
åk
i=1
yˆ(i)
y(i)k22:
(6)
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Federated optimization and avoiding overfitting</title>
        <p>In our setup the generation of individuals, that is the
selection, the crossover and mutation happens at a centralized
location at a parameter server. The connected nodes of the
system participate in the optimization through evaluating
the different proposed setup. The fitness of an individual
can be calculated as a weighted average over the local
fitness values, in theory, during the training tough, as we will
see we should not use this measurement to prevent
overfitting.</p>
        <p>
          Avoiding overfitting has been studied in [
          <xref ref-type="bibr" rid="ref22 ref25 ref26 ref27">25, 22, 26, 27</xref>
          ],
as it is discussed in Section 2.5. The main idea is that we
must not include the entire training set in the whole
duration of the training. Instead, what most articles propose, is
to use subsets of the training data in each generation. The
training subset can be changed every generation or kept
intact for a few generations. Studies interestingly show that
randomly selecting a single training sample is also very
effective both for convergence and avoiding over-fitting.
Another suggested tweak is to include the full dataset
every once in a while.
        </p>
        <p>Due the distributed nature of the problem, it was a rather
natural idea to incorporate the native data partitioning of
the federated setup, and to do the subset selection at a
higher level and treat the nodes as units of the subset
creation, instead of specific data points. Thus, in each
generation, the Federated Neuroevolution algorithm selects a
subset of the nodes for evaluating the fitness of the current
generation. To ensemble the evaluation sets, we have tried
the following three approaches:
1. Random single element for each generation: Here,
in each iteration we ask a randomly selected node to
evaluate the population’s fitness on a randomly
selected single training sample of it’s own.
2. Random subset for each generation: In this
approach a random subset of nodes are selected to
evaluate. We found this method the most efficient.</p>
      </sec>
      <sec id="sec-4-4">
        <title>3. Moving window subset for each generation: Here,</title>
        <p>we first order the nodes and then select a slice of the
list of nodes. This is the window and in every n
generation we move the window to the right by 1.</p>
        <p>The second approach of randomly selecting a subset of
nodes had the best performance. Even if, according to the
literature, method 1 works pretty well, in our experiments,
the training did not converge at all. The third method
seemed to be more promising, training convergence was
slower with this method than in the case of the second
method and the convergence also capped around 75%
validation accuracy.</p>
        <p>The main algorithm Using the fitness evaluation
methods we described in Section 4.3, the main run of the
optimization looks like the following:
• Validation: On the server, we retain a validation set
and in each generation we calculate and store the
validation accuracy of the fittest model of the current
generation. This is not far fetched as we can assume
that in a Federated setting the server driving the
learning would already have a dataset of it’s own.
• Avoiding critical points: Based on the history of
validation accuracies, we check the last n entries for a
match with the current validation accuracy. If there is
a match, we conclude that the evolution has reached
some kind of critical point of the fitness function as
local maximum or saddle point. That is, however we
try to combine and mutate the individuals of the
subsequent generations, the fitness/accuracy does not
increase. Our hypothesis is, that in this case the
population stuck in a higher region of the values of
fitness function, and in the neighbourhood defined by
our mutation rate the offsprings cannot find any
increasing directions. In this case we start gradually
increasing the mutation rate and the mutation chance
multiplier which is initially set to 1. Once the
algorithm is out of the local maximum, we reset the values
of the mutation rate and mutation chance to the
original values. There is an upper bound on the mutation
multiplier.
• Early stopping: We save the fittest model of each
generation, as an additional means to stop before we
overfit.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>We have run the described evolution algorithm for 5000
generations (Figure 2). For our setup, we observed that
the convergence was slow but steady, overall.</p>
      <p>Validation Accuracy
Fitness NMSE</p>
      <p>Minimum value
48:50%
0:3297</p>
      <p>Maximum value
85:28%
0:0903</p>
      <p>From a fully random state, the algorithm was able to
get to 85% validation accuracy as seen in Table 1. This
is, of course, a lot less than the baseline but still a good
result considering using Neuroevolution for training weights
which is not the best method for training NNs.</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and future work</title>
      <p>In this paper we described our experiments with a simple
method, what we call Federated Neuroevolution (FNE),
that is an application of EA adapted for FL of NNs.</p>
      <p>We found that our method is applicable on the studied
scenario yielding some advantages over the traditional FL
methods.</p>
      <p>
        An advantage of EA, compared to the gradient based
algorithms originated from [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ] or [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], is that it
requires even less client data transfer to the server. While
FedAVG exposes the client side data distribution and the
gradients during learning, FNE only expose the amount of
data points of the clients and an abstract fitness number of
the model.
      </p>
      <p>The clear disadvantage is that the convergence is a lot
slower. We needed 5000 iterations of the algorithm to
get to an 85% accuracy which is still less then the
baseline’s 95%. At this point though our purpose was merely
to demonstrate the feasibility of derivative-free learning of
NN-s in a FL scenario.</p>
      <p>In summary, the technique we introduced, trades off
learning speed for privacy gains. We may need a lot of
communication rounds which can be bad in a real-world
setting of mobile users, but for some use cases, like for
data from medical institutions, the rounds of
communication is not of primary importance, while keeping data
privacy is essentials. Another aspect of techniques similar
to FNE that might be interesting, is that there is no
traditional, backpropagation based learning, that is at client
side we can save this rather expensive stages of the
learning process.</p>
      <p>In the future we think there are several possible
directions to develop FNE to make it practical. First the
rather poor performance of the system might be improved
through experimenting with different submethods
(selection , crossover, etc.)</p>
      <p>Following the trends in genetic algorithms, the search
space could be extended to the network architecture too.
This way we could reduce the bias and variance introduced
by the model architecture that is chosen rather blindly at
the initiation phase of the learning.</p>
      <p>
        Bearing in mind the main purpose of the experiments,
that is prevent the communicating the gradients, a range of
derivative free methods are available as Differential
Evolution [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ], Particle Swarm Optimization[
        <xref ref-type="bibr" rid="ref36">36</xref>
        ] or other
biology inspired methods like Artifical Bee Colony [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ].
Similarly, advanced optimization methods as CMA-ES[
        <xref ref-type="bibr" rid="ref38">38</xref>
        ]
might be applied.
      </p>
      <p>It could be also interesting to experiment with more
efficient utilization of resources, since in the current setup in
each round the vast majority of nodes is idle.</p>
      <p>Acknowledgements
EFOP-3.6.3-VEKOP-16-201700001: Talent Management in Autonomous Vehicle
Control Technologies - The Project is supported by the
Hungarian Government and co-financed by the European
Social Fund.</p>
      <p>Supported by Telekom Innovation Laboratories
(TLabs), the Research and Development unit of Deutsche
Telekom.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J. Konecˇny</given-names>
            `, H. B.
            <surname>McMahan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ramage</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Richtárik</surname>
          </string-name>
          , “
          <article-title>Federated optimization: Distributed machine learning for on-device intelligence</article-title>
          ,
          <source>” arXiv preprint arXiv:1610.02527</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>H. B. McMahan</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Moore</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Ramage</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Hampson</surname>
          </string-name>
          et al., “
          <article-title>Communication-efficient learning of deep networks from decentralized data</article-title>
          ,
          <source>” arXiv preprint arXiv:1602.05629</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Chaudhuri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Monteleoni</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. D.</given-names>
            <surname>Sarwate</surname>
          </string-name>
          , “
          <article-title>Differentially private empirical risk minimization</article-title>
          ,
          <source>” Journal of Machine Learning Research</source>
          , vol.
          <volume>12</volume>
          , no.
          <source>Mar</source>
          , pp.
          <fpage>1069</fpage>
          -
          <lpage>1109</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Dwork</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roth</surname>
          </string-name>
          et al., “
          <article-title>The algorithmic foundations of differential privacy,” Foundations and Trends® in Theoretical Computer Science</article-title>
          , vol.
          <volume>9</volume>
          , no.
          <issue>3-4</issue>
          , pp.
          <fpage>211</fpage>
          -
          <lpage>407</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Abadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chu</surname>
          </string-name>
          , I. Goodfellow, H. B.
          <string-name>
            <surname>McMahan</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Mironov</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Talwar</surname>
          </string-name>
          , and L. Zhang, “
          <article-title>Deep learning with differential privacy,”</article-title>
          <source>in Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>308</fpage>
          -
          <lpage>318</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Xie</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Yuille</surname>
          </string-name>
          , “Genetic cnn,”
          <source>in Proceedings of the IEEE International Conference on Computer Vision</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>1379</fpage>
          -
          <lpage>1388</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fernando</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          , “
          <article-title>Hierarchical representations for efficient architecture search</article-title>
          ,
          <source>” arXiv preprint arXiv:1711.00436</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Desell</surname>
          </string-name>
          , “
          <article-title>Large scale evolution of convolutional neural networks using volunteer computing,”</article-title>
          <source>in Proceedings of the Genetic and Evolutionary Computation Conference Companion. ACM</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>127</fpage>
          -
          <lpage>128</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P.</given-names>
            <surname>Vidnerová</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Neruda</surname>
          </string-name>
          , “
          <article-title>Asynchronous evolution of convolutional networks</article-title>
          ,”
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , and G. G. Yen, “
          <article-title>Automatically designing cnn architectures using genetic algorithm for image classification</article-title>
          ,” arXiv preprint arXiv:
          <year>1808</year>
          .03818,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D.</given-names>
            <surname>Whitley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Starkweather</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bogart</surname>
          </string-name>
          , “
          <article-title>Genetic algorithms and neural networks: Optimizing connections and connectivity,” Parallel computing</article-title>
          , vol.
          <volume>14</volume>
          , no.
          <issue>3</issue>
          , pp.
          <fpage>347</fpage>
          -
          <lpage>361</lpage>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>X.</given-names>
            <surname>Yao</surname>
          </string-name>
          , “
          <article-title>Evolving artificial neural networks</article-title>
          ,
          <source>” Proceedings of the IEEE</source>
          , vol.
          <volume>87</volume>
          , no.
          <issue>9</issue>
          , pp.
          <fpage>1423</fpage>
          -
          <lpage>1447</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>K. O.</given-names>
            <surname>Stanley</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Miikkulainen</surname>
          </string-name>
          , “
          <article-title>Efficient evolution of neural network topologies,” in Proceedings of the 2002 Congress on Evolutionary Computation</article-title>
          .
          <source>CEC'02 (Cat. No. 02TH8600)</source>
          ,
          <source>vol. 2</source>
          . IEEE,
          <year>2002</year>
          , pp.
          <fpage>1757</fpage>
          -
          <lpage>1762</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gauci</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Stanley</surname>
          </string-name>
          , “
          <article-title>Generating large-scale neural networks through discovering geometric regularities,” in Proceedings of the 9th annual conference on Genetic and evolutionary computation</article-title>
          .
          <source>ACM</source>
          ,
          <year>2007</year>
          , pp.
          <fpage>997</fpage>
          -
          <lpage>1004</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>P.</given-names>
            <surname>Verbancsics</surname>
          </string-name>
          and
          <string-name>
            <given-names>K. O.</given-names>
            <surname>Stanley</surname>
          </string-name>
          , “
          <article-title>Constraining connectivity to encourage modularity in hyperneat,” in Proceedings of the 13th annual conference on Genetic and evolutionary computation</article-title>
          .
          <source>ACM</source>
          ,
          <year>2011</year>
          , pp.
          <fpage>1483</fpage>
          -
          <lpage>1490</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>P.</given-names>
            <surname>Verbancsics</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Harguess</surname>
          </string-name>
          , “
          <article-title>Generative neuroevolution for deep learning</article-title>
          ,
          <source>” arXiv preprint arXiv:1312.5355</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>V.</given-names>
            <surname>Maniezzo</surname>
          </string-name>
          , “
          <article-title>Genetic evolution of the topology and weight distribution of neural networks</article-title>
          ,
          <source>” IEEE Transactions on neural networks</source>
          , vol.
          <volume>5</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>39</fpage>
          -
          <lpage>53</lpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>H.</given-names>
            <surname>Lam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. H.</given-names>
            <surname>Leung</surname>
          </string-name>
          , and
          <string-name>
            <surname>P. K.-S</surname>
          </string-name>
          . Tam, “
          <article-title>Tuning of the structure and parameters of neural network using an improved genetic algorithm,”</article-title>
          <source>in IECON'01. 27th Annual Conference of the IEEE Industrial Electronics Society (Cat. No. 37243)</source>
          , vol.
          <volume>1</volume>
          . IEEE,
          <year>2001</year>
          , pp.
          <fpage>25</fpage>
          -
          <lpage>30</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>J.-T. Tsai</surname>
            ,
            <given-names>J.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Chou</surname>
          </string-name>
          , and T.-K. Liu, “
          <article-title>Tuning the structure and parameters of a neural network by using hybrid taguchi-genetic algorithm</article-title>
          ,
          <source>” IEEE Transactions on Neural Networks</source>
          , vol.
          <volume>17</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>69</fpage>
          -
          <lpage>80</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>H.</given-names>
            <surname>Bersini</surname>
          </string-name>
          and G. Seront, “
          <article-title>In search of a good crossover between evolution and optimization,” M anner and</article-title>
          <string-name>
            <surname>Manderick</surname>
          </string-name>
          , vol.
          <volume>1503</volume>
          , pp.
          <fpage>479</fpage>
          -
          <lpage>488</lpage>
          ,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Koza</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Koza</surname>
          </string-name>
          ,
          <article-title>Genetic programming: on the programming of computers by means of natural selection</article-title>
          . MIT press,
          <year>1992</year>
          , vol.
          <volume>1</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>W.</given-names>
            <surname>Langdon</surname>
          </string-name>
          , “
          <article-title>Minimising testing in genetic programming</article-title>
          ,
          <source>” RN</source>
          , vol.
          <volume>11</volume>
          , no.
          <issue>10</issue>
          , p.
          <fpage>1</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>C.</given-names>
            <surname>Gathercole</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Ross</surname>
          </string-name>
          , “
          <article-title>Dynamic training subset selection for supervised learning in genetic programming</article-title>
          ,
          <source>” in International Conference on Parallel Problem Solving from Nature</source>
          . Springer,
          <year>1994</year>
          , pp.
          <fpage>312</fpage>
          -
          <lpage>321</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Khoshgoftaar</surname>
          </string-name>
          , “
          <article-title>Reducing overfitting in genetic programming models for software quality classification</article-title>
          ,” in
          <source>Eighth IEEE International Symposium on High Assurance Systems Engineering</source>
          ,
          <year>2004</year>
          . Proceedings. IEEE,
          <year>2004</year>
          , pp.
          <fpage>56</fpage>
          -
          <lpage>65</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>I.</given-names>
            <surname>Gonçalves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Silva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Melo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Carreiras</surname>
          </string-name>
          , “
          <article-title>Random sampling technique for overfitting control in genetic programming</article-title>
          ,
          <source>” in European Conference on Genetic Programming</source>
          . Springer,
          <year>2012</year>
          , pp.
          <fpage>218</fpage>
          -
          <lpage>229</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>I.</given-names>
            <surname>Gonçalves</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Silva</surname>
          </string-name>
          , “
          <article-title>Experiments on controlling overfitting in genetic programming</article-title>
          ,
          <source>” in 15th Portuguese conference on artificial intelligence (EPIA</source>
          <year>2011</year>
          ),
          <year>2011</year>
          , pp.
          <fpage>10</fpage>
          -
          <lpage>13</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27] --, “
          <article-title>Balancing learning and overfitting in genetic programming with interleaved sampling of training data,”</article-title>
          <source>in European Conference on Genetic Programming</source>
          . Springer,
          <year>2013</year>
          , pp.
          <fpage>73</fpage>
          -
          <lpage>84</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>H.</given-names>
            <surname>Begleiter. EEG Database Data</surname>
          </string-name>
          <string-name>
            <surname>Set.</surname>
          </string-name>
          [Online]. Available: https://archive.ics.uci.edu/ml/datasets/eeg+database
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>J. G.</given-names>
            <surname>Snodgrass and M. Vanderwart</surname>
          </string-name>
          , “
          <article-title>A standardized set of 260 pictures: norms for name agreement, image agreement, familiarity, and visual complexity.” Journal of experimental psychology: Human learning and memory</article-title>
          , vol.
          <volume>6</volume>
          , no.
          <issue>2</issue>
          , p.
          <fpage>174</fpage>
          ,
          <year>1980</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>R. T.</given-names>
            <surname>Schirrmeister</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Springenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. D. J.</given-names>
            <surname>Fiederer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Glasstetter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Eggensperger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tangermann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hutter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Burgard</surname>
          </string-name>
          , and T. Ball, “
          <article-title>Deep learning with convolutional neural networks for eeg decoding and visualization,” Human brain mapping</article-title>
          , vol.
          <volume>38</volume>
          , no.
          <issue>11</issue>
          , pp.
          <fpage>5391</fpage>
          -
          <lpage>5420</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>M. D. Zeiler</surname>
          </string-name>
          , “
          <article-title>Adadelta: an adaptive learning rate method</article-title>
          ,
          <source>” arXiv preprint arXiv:1212.5701</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <surname>A. Gad.</surname>
          </string-name>
          (
          <year>2019</year>
          )
          <article-title>Artificial Neural Networks Optimization using Genetic Algorithm with Python</article-title>
          . [Online]. Available: https://towardsdatascience. com/artificial
          <article-title>-neural-networks-optimization-\ using-genetic-algorithm-with-python-1fe8ed17733e</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>R.</given-names>
            <surname>Oullette</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Browne</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Hirasawa</surname>
          </string-name>
          , “
          <article-title>Genetic algorithm optimization of a convolutional neural network for autonomous crack detection,” in Proceedings of the 2004 congress on evolutionary computation (IEEE Cat</article-title>
          .
          <source>No. 04TH8753)</source>
          ,
          <source>vol. 1</source>
          . IEEE,
          <year>2004</year>
          , pp.
          <fpage>516</fpage>
          -
          <lpage>521</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Monga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bengio</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Jozefowicz</surname>
          </string-name>
          , “
          <article-title>Revisiting distributed synchronous sgd</article-title>
          ,
          <source>” arXiv preprint arXiv:1604.00981</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ilonen</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-K. Kamarainen</surname>
            , and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Lampinen</surname>
          </string-name>
          , “
          <article-title>Differential evolution training algorithm for feed-forward neural networks</article-title>
          ,
          <source>” Neural Processing Letters</source>
          , vol.
          <volume>17</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>93</fpage>
          -
          <lpage>105</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>B. A.</given-names>
            <surname>Garro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sossa</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Vazquez</surname>
          </string-name>
          , “
          <article-title>Design of artificial neural networks using a modified particle swarm optimization algorithm</article-title>
          ,” in
          <source>2009 International Joint Conference on Neural Networks. IEEE</source>
          ,
          <year>2009</year>
          , pp.
          <fpage>938</fpage>
          -
          <lpage>945</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>B. A.</given-names>
            <surname>Garro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sossa</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Vázquez</surname>
          </string-name>
          , “
          <article-title>Artificial neural network synthesis by means of artificial bee colony (abc) algorithm,” in 2011 IEEE Congress of Evolutionary Computation (CEC)</article-title>
          . IEEE,
          <year>2011</year>
          , pp.
          <fpage>331</fpage>
          -
          <lpage>338</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>N.</given-names>
            <surname>Hansen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. D.</given-names>
            <surname>Müller</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Koumoutsakos</surname>
          </string-name>
          , “
          <article-title>Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (cma-es),” Evolutionary computation</article-title>
          , vol.
          <volume>11</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>18</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>