Evolutionary federated learning on EEG-data

                                             Gábor Szegedi, Péter Kiss, and Tomáš Horváth

                                             Department of Data Science and Engineering
                                       ELTE – Eötvös Loránd University, Faculty of Informatics
                                   Budapest, H-1117 Budapest, Pázmány Péter sétány 1/C., Hungary
                                wayasam@gmail.com, {peter.kiss, tomas.horvath}@inf.elte.hu

Abstract: With the spread of digitalization across every                     • Massively Distributed Data points are stored across
aspects of society and economy, the amount of data gen-                        a large number K of nodes. In particular, the number
erated keeps increasing. In some domains, this generation                      of nodes can be much bigger than the average number
happens in such a massively distributed fashion that poses                     of training examples stored on a single node (n/K).
challenges for even collecting the data to build machine
learning (ML) models on it, not to mention the process-                      • Non-IID Data on each node may be drawn from a dif-
ing power necessary for training. An important aspect of                       ferent distribution. That is, the data points available
processing information that has been generated at users is                     locally are far from being a representative sample of
privacy concerns, that is, users might be unwilling to ex-                     the overall distribution.
pose anything that would enable one to draw any conclu-                      • Unbalanced Different nodes may vary by orders of
sion regarding to confidential information they possess. In                    magnitude in the number of training examples they
this work, we present a experiment on a genetic algorithm                      hold.
based federated learning (FL) algorithm, that reduces the
data transfer from individual users to the learner to a single               Formally, we have K nodes and n data points, a set of
fitness value.                                                            indices Pk (k ∈ {1, . . . , K}) of data stored at node k, and
                                                                          nk = |Pk | is the number of data points at Pk . We assume
                                                                          that Pk ∩ Pl = 0/ whenever l 6= k, thus ∑Kk=1 nk = n.
1    Introduction                                                                                                                      def
                                                                             We can then define the local loss for node as Fk (w) =
                                                                           1
The paradigm of federated learning (FL) [1] addresses                     nk ∑i∈Pk f i (w), where f i (w) is the loss of our model at the
the more and more timely scenario in which the data to                    ith training example, given the parametrisation w. Thus
be processed is generated in a massively distributed en-                  the problem to be minimized will become:
vironment, where traditional approaches for building ma-                                                    K
chine learning (ML) models become extremely challeng-                                                          nk
                                                                                         minw∈Rd f (w) = ∑        Fk (w).             (1)
ing, mostly in logistical point of view. That is, when data                                                k=1 n
is generated at client devices as mobile phones, tablets or                  To solve the learning problem (6) for neural networks
smart watches, collecting, storing and processing all these               (NNs), the mainstream way is – starting from a common
information in data centers might be difficult task (aggre-               initial parametrisation – to train local models using some
gation problem) and, according to the idea of FL, not nec-                version of gradient descent methods, then aggregate the
essary by all means.                                                      local model updates (e.g. gradients), or equivalently the
   Another problem regarding the traditional data center                  local models themselves to update the global model. The
based solution is privacy concerns. It might happen that                  global model then will be sent back to the worker nodes.
users of the applications that build on centralized model                 Algorithm 1 is one of the most successful algorithms for
training are reluctant to share their possibly confidential               federated NN training, called FederatedAveraging [2].
data. We believe a particularly fitting scenario for this                    FederatedAveraging works pretty well solving the ag-
problem is the use case of medical applications. Each                     gregation problem, however using gradients or, equiva-
medical institute might have a lot of patient data, but that              lently, the local models for the global aggregation step still
may be far from enough to train their own prediction mod-                 exposes some information on users data. To address pri-
els. Here, sharing the data across a big number of institute              vacy concerns, the solution is usually to apply achieve-
can yield a great help in developing automated diagnostic                 ments of differential privacy[3][4][5] atop the gradient
tools. But being the private nature of these data, hospitals              based learning process.
probably decide not to share anything of this information                    In this paper we present a slightly different approach,
either to protect their reputation or due to legal regulations.           namely, we investigated whether it is possible to train NNs
   As it is summarized in [1], the characteristics of data                in a federated fashion without using gradient in any con-
that FL is concerned with can be described as follows:                    text. To approach the problem, it seemed to be a simple
     Copyright ©2019 for this paper by its authors. Use permitted under   choice to try evolutionary algorithms. Since a rich liter-
Creative Commons License Attribution 4.0 International (CC BY 4.0).       ature is already available on evolutionary optimization of
Algorithm 1 FederatedAveraging                                       EAs – as nature inspired methods in general – are often
 1: procedure S ERVER                                             used to discover very complex, high dimensional and/or
 2:    initialize w0                                              non-convex search problems, therefore, attempts to apply
 3:    for t = 0; 1; 2; ... do                                    these methods on optimizing NNs has a long history.
 4:         m ← max(C · K, 1)                                        Recently, nature-inspired methods in relation with NNs
 5:         St ← m client nodes randomly                          are used mostly for hyper-parameter tuning that includes
 6:         for all k ∈ St in parallel do                         searching for an efficient architecture.
 7:               k
                wt+1  ← ClientUpdate(k, wt )                         A big part of this rich literature is concerned specifi-
 8:         end for                                               cally CNN-s, what we apply for our problem. Methods of
 9:         wt+1 = ∑Kk=1 nnk wt+1
                               k                                  Genetic CNN [6], hierarchical evolution [7], large- scale
10:    end for                                                    evolution [8], asynchronous CNN evolution [9] and auto-
11: end procedure                                                 matic CNN design [10] give graph based methods to de-
12: procedure C LIENT U PDATE (k, w)                              sign automatically the stack convolutional layers (skipping
13:    B ← split Pk to set of batches                             potentially fully connected parts of the network) for image
14:    for all b ∈ B do                                           classification trough genetical evolution of subsequent lay-
15:         w ← w − η∇ f (w, b). gradient of loss on batch b      ers with various innovative encoding techniques.
16:    end for                                                       In these scenarios, the learning itself is still based on
17:    return W                                                   calculating the gradient and updating the model according
18: end procedure                                                 to that (backpropagation).
                                                                     Using backpropagation though being based on calcula-
                                                                  tion of gradients and on applying them on the weights of
NNs, we only transfer this knowledge into the federated
                                                                  the network is exactly what we want to avoid in our exper-
environment.
                                                                  iment. Before the monocracy of derivative based train-
   For the concrete task to be solved by our method we
                                                                  ing algorithms however biology-inspired training meth-
have chosen classfication of EEG-signals using convolu-
                                                                  ods was a rather popular research topic, thus there is a
tional neural networks (CNNs).
                                                                  rich, though a bit dated literature concerned with our con-
   The main contributions of this paper are
                                                                  strained problem. [11] and [12] give a summary of the
    1. a proof of concept for applicability of genetic algo-      these initial approaches to neuroevolution (NE).
       rithms to federated training of NNs without using vul-        There is a very interesting branch of applications of NE
       nerable gradients;                                         for general NN-s, that includes techniques to purely genet-
                                                                  ically train the architecture along with the weights of the
    2. presenting Federated Neuroevolution (FNE) a simple
                                                                  networks. Among the most important algorithms that be-
       algorithm for the federated training, applying a dis-
                                                                  long here it might be worth to mention NeuroEvolution of
       tributed fitness function.
                                                                  Augmenting Topoplogies (NEAT) [13], and Hypercube-
                                                                  based NEAT (HyperNEAT) [14] and its specializattion for
2     Neuroevolution                                              modular evolution of NN-s, HyperNEAT-LEO [15] and
                                                                  Generative NE [16]. Despite of the power of HyperNEAT,
Evolutionary algorithms (EAs) follow the pattern of evo-          we decided first to focus on training a predefined architec-
lution as it is observed by biologists in the nature. Ac-         ture, thus our method is based on more "traditional" NE
cording to this, in an infinite cycle of life, the most apt in-   algorithms.
dividuals can produce offsprings possessing a potentially            For applying evolutionary approach on an issue, one
slightly changed (mutated) mixture of their genoms that           need to specify an encoding of the problem, a selection, a
might result an enhanced ability to face challenges in their      crossover and a mutation method as well as a fitness func-
life. The main assumption in biology is that those individ-       tion.
uals survive and create descendants with a bigger chance,            In the rest of this section we shortly describe the stages
who, in some aspects are superior to the others. The main         of an EA along with a couple of examples of how these
structure of an EA is sketched in Algorithm 2                     stages have been implemented in some work on the field
                                                                  of NE, that gave inspiration to our algorithm. At the end of
Algorithm 2 EA                                                    the section we also describe approaches aiming at handle
 1: generate an initial population G0 , i = 0                     overfitting that has been proven a serious problem in NE.
 2: repeat
 3:    ∀individual j ∈ Gi : f j = f itness(individual j )
 4:    select parents from Gi based on their fitness              2.1   Encoding
 5:    produce offspring generation Gi+1                          Genetic algorithms work on sequence of features that
 6:    ∀individual j ∈ Gi+1 : individual j ←                      would be mixed, or altered according some granularity de-
    mutate(individual j )                                         fined over them. Thus the first step in solving a problem
 7: until termination criterion is satisfied
                                                                  genetically is to provide a description of the search space.
We refer to this description as encoding, that can be direct               2.3   Crossover
or indirect.
                                                                           Crossover is a method that defines, how we combine in-
                                                                           dividuals of a generation to create offsprings for the next
Direct encoding is the more traditional way of problem                     generations. One simple way is – as in [17] – to com-
encoding, where sections of the genom more or less corre-                  bine the parts of parent individuals at some cutting points.
spond to specific parameters. Some of the early methods                    Another approach is presented in [18], where crossover is
hanlde some switches as well, that control the connectivity                actually taking the average of the corresponding weights
of the specific perceptrons.
                                                                           of the two individuals: x(t+1) = x1 (t)+x
                                                                                                                  2
                                                                                                                     2 (t)
                                                                                                                           where x(t)s are
   [17] proposes a system based on a parallel genetic al-                  the individuals represented as vectors of parameters.
gorithm, ANNA ELEONORA, for learning both topology
and for connection weights. Topology Utilizes binary rep-
resentation of networks, with granularity encoding that is                 2.4   Mutation
handled through one bit flag to determine connectivity,
                                                                           Mutation methods serve for adding extra variance to the
that is, whether the given edge is present in the recent setup
                                                                           individual genoms to enable them to discover a bigger part
or not, followed by the substring of weights. These sub-
                                                                           of the search space. [17] provides a representation that
strings are ordered in a way that connections into the same
                                                                           translates different topologies and encoding length into a
neuron are grouped together.
                                                                           common string format granting compositionally different
   [18] presents a variant of EA applied immediately for                   descriptions. At mutations, it applies three separate prob-
float weights. The input of the EA is a vector x of variables              abilities for swapping bits such that granularity bits, con-
, that are the parameters of the model (that is the weights                nectivity bit, and weight bits. For effectively explore the
of the connections), the biases, and the newly invented                    search space, it uses EA simplex [20], instead of taking
link switches. Link switches are variables, that control the               three populations and creating a fourth based on those.
connectivity of the network, that is, negative value rep-                     In [18], where the possible values for genes are con-
resents that the edge is switched off. The search space                    strained, mutation is carried out according to the follow-
is constrained by upper and lower bounds on variables                      ing formula: x(t+1) = x(t) + Bδ , B ∈ R2 is a diagonal ma-
(weights): x ∈ I1 × I2 × · · · × Id , where Ii = [li , ui ], li , ui ∈ R                                                  (t+1)
                                                                           trix, with a diagonal Bii ∈ {0, 1}, and li ≤ xi      + δi ≤ u i .
for i = 1, 2, . . . , d.
                                                                           Based on this rule, the algorithm generates three individ-
   Theoretically using the connectivity features of the en-
                                                                           uals/chromosomes: at the first only one element of the di-
coding the first method is able to evolve the architecture
                                                                           agonal of B can be one, at the second one a random num-
too. The issue with this approach to encoding is that the
                                                                           ber of diagonal of elements, and at the last, Bii = 1 for all
problem space grows very fast as we scale up the network
                                                                           i = 1, . . . , d. The one with the best fitness of these three
(which we need if we want to solve complex problems).
                                                                           will replace the weakest one in the next population.

Indirect Encoding The scaling problem of Direct Encod-
ing can be solved with Indirect Encoding, that instead of                  2.5   Overfitting
separated representation of model parameters uses genera-                  Using EA usually involves a high computation demand,
tive information. In HyperNEAT [14], which is maybe the                    which can be reduced through decreasing the number of
most important representative of this class, genes of the                  evaluations of the model, that is the size of training data
genom are defining functions based on which weights can                    on which we want to try out the models defined by a given
be generated.                                                              generation of the genetic algorithms.
                                                                              Earlier applications of EA usually did not use separation
                                                                           of data into training and test set (like [21], for example).
2.2   Fitness
                                                                           Practitioners soon realized, however, that models trained
                                                                           this way perform poorly on not seen data points, revealing
The fitness function serves to specify how well a given in-
                                                                           the tendency of evolutionary methods to strongly overfit
dividual performs on the problem to be solved. A higher
                                                                           on the training problems. This issue got in the center of
value of the fitness function means a better solution for the
                                                                           interest, when as an attempt to reduce run time they tried
problem, while lower fitness value reports a poor perfor-
                                                                           to use subsets of the training data to evaluate individuals.
mance. Fitness is often normalized thus a function that
                                                                              [22] made comprehensive experiments proving that evo-
produces a fitness value 1 for a perfect solution, and 0 for
                                                                           lution is potentially able to extrapolate from the randomly
completely wrong setup can work well. As an example
                                                                           chosen test sets. A very promising direction to reduce
for a normalized fitness in ML scenarios, [19] proposes a
                                                 1                         overfitting is random sampling, where at each generation,
fitness function for NN defined as fnorm = 1+err    , where
                                                                           a random subset of the training data is chosen and evo-
               ∑d   |y −ŷ |
err = ∑m k=1
              i=1 i
                 md
                    i
                      , with d denoting the output dimen-                  lution is performed based on the fitness on that sample.
sion, and m the number of examples, applying mean abso-                    The Random Sampling Technique (RST) [23] was origi-
lute error.                                                                nally used for speeding up the GE runs in [24], however, it
was already used for preventing overfitting. [25] and [26]
drive some experiments on RST, where they were testing                        1.00
two parameters, the Random Subset Size (RSS) and the
                                                                              0.95
Random Subset Reset (frequency of changing the subset).
They have found, interestingly, that the techniques per-                      0.90
                                                                              0.85


                                                                   Accuracy
forms best when both these values are set to one, that is
in each iteration the fitness should be tested using a new                    0.80
randomly chosen data point.                                                   0.75
   In [27], the authors present versions of “interleaved                      0.70
sampling”, that means, instead of random subsets, fitness                                                                    train
                                                                              0.65                                           validation
at each round is evaluated alternating between one and all
                                                                              0.60
training samples, with various switching frequencies. As                             0 10 20 30 40 50 60 70 80 90 100
a result, they find that, on their test datasets, the best tech-                                 Epochs
nique would be to switch in each round between single
sample and all sample evaluation.
                                                                                 Figure 1: Baseline accuracy (backpropagation)
3     The problem
                                                                     Selection The candidate set of individuals for crossover is
3.1   Data
                                                                     created by sorting the current generation’s models based
For the experiment, we used the EEG Database Data Set                on their fitness and selecting the n − 1 fittest models for
[28]. The dataset contains 120 EEG trial data about 122              crossover. The last parent selected for mating is not among
patients who either belong to the alcoholic or to the control        the fittest ones, but chosen randomly from the rest, to add
group. In each trial, the patients were shown one or two             more variance.
images of the Snodgrass and Vanderwart picture set [29].
After showing them the stimuli, their brain activation was
measured for 1 second on 64 points at 256 Hertz. The                 Crossover functions Crossover method defines the way
measurements are then labelled according to which group              according to which new individuals will be generated from
they belong to, thus the task of the model to be built is to         the parent generation.
predict which class of the two does a sample belong to.                 In our method, we pick two parents randomly from the
                                                                     pool of parents to produce the required offspring amount.
                                                                        We run experiments with four crossover methods. The
3.2   Network architecture
                                                                     first three require flattening the vector of weights. These
For the network architecture to train, we decided to use             first three approaches are rather popular in EA research.
the shallow convolutional network from [30], that has been
designed specifically for EEG based multiclass prediction                     • Halving mix: In this approach, that is a simplified
problems. The essence of the network is three convolu-                          version of the one in [17], the vector of values from
tional layer that are intended to recognize specific patterns                   the parents are taken to create the offspring vector
in the signals. After two convolutional layers there is a                       by taking the first half of it from the first parent and
pooling layer and then comes the third convolutional layer.                     the second half from the second parent. This was the
On the output of this layer we applied batch normalization,                     original approach in [32] too. Formally:
then added the output dense layer with sigmoid activation.                                                     (
   For the control experiment, we used the AdaDelta opti-                                                        ai , if i ≤ n/2
mizer [31] with Categorical Cross-Entropy loss function.                                {o f f springi }ni=1 =                        (2)
                                                                                                                 bi , if i > n/2
At training we used a batch size of 64, 1.0 as learning rate,
ρ = 0.95, and ε = 10−7 .                                                        where n is the length of the model vectors and a =
   The control model after 100 epochs achieved a valida-                        (a1 , a2 , . . . , an ), b = (b1 , b2 , . . . , bn ) are the parent
tion accuracy of 95% (see Figure 1).                                            vectors.

                                                                              • Interleave mix: Here, the vector of values from the
4     The proposed methods
                                                                                parents are taken to create the offspring vector by in-
The algorithm runs according to the process defined in Al-                      terleaving the two parent vectors. Formally:
gorithm 2. For starting off we create an initial generation                                                (
in which for each individuals initialize the weights of the                                          n      ai , if i mod 2 = 0
                                                                                     {o f f springi }i=1 =                          (3)
models randomly. From the initial generation then we it-                                                    bi , if i mod 2 = 1
erate along the fitness-selection-crossover-mutation loop.
In this section we describe the particular methods we used                      where n is the length of the model vectors and a, b are
for the different stages.                                                       the parent vectors.
   • Mean mix: In this method, similarly to [18], the vec-          • Mutate by multiplication (from [33]): Here,
     tor of values from the parents are taken to create the           we multiply the selected values with a ran-
     offspring vector by taking the mean of the two parent            dom value.          In our implementation, the mul-
     vectors at each index. Formally:                                 tiplication factor was a random value between
                                                                      [ 100−mutation_rate
                                                                              100         , 100+mutation_rate
                                                                                                  100         ].
                                             ai + bi
                  {o f f springi }ni=1 = {           }    (4)
                                                2                After experimenting, we found a lot better convergence
                                                                 rate with the second approach.
      where n is the length of the model vectors and a, b are
      the parent vectors.
                                                                 4.2   Federated fitness function
   • Kernelwise mix: In this algorithms we used a more
     coarse units for the crossover. In each convolutional       For fitness, which should be maximized during the evo-
     layer there are multiple kernels/filters that hold key      lutionary training, we have chosen the Negative Mean
     pattern information. Similarly, in case of fully con-       Squared Error (NMSE), that is defined as in Equation (5).
     nected layers, input weights belonging to a single
     neurons describes some pattern in the previous lay-                                       1 n d (i)          (i)
                                                                           fNMSE (w) = −1 ∗      ∑   ∑   (ŷ j − y j )2    (5)
     ers. These information portions are kept intact dur-                                      n i=1 j=1
     ing crossover. The offspring model is created by ran-
     domly mixing the kernels inside each layer.                 where ŷ is the predicted output vector using parameters w,
                                                                 y is the target output vector, d is the output dimension and
   In our experiments, the first three crossover methods did     n is the number of examples. This a is slightly different
not converge. This could be because these approaches are         function, than the one in the example in section 2.2, but
very low level and do not care about the network structure       it’s behaviour is the same( ∂∂wi fNMSE (w) ∗ ∂∂wi fnorm (w) >
or the patterns learned in the kernels.                          0, ∀w, i )
   Kernelwise mixing is a higher level approach, what we            Applying the NMSE fitness for the original optimiza-
tried after taking a look at how genetics works in nature. In    tion problem in Equation (6) our task will be to maximize
nature, the heredity is also a higher level mixing of genes,     NMSE with respect to w:
instead of low level mix of organic molecules. Thus, traits
                                                                                                      n
of the parents are kept intact. The resemblance to genetics               maxw∈Rd fNMSE (w) = − ∑ kŷ(i) − y(i) k22 .      (6)
can be summarized as follows: the DNA is the network’s                                              i=1
weights, a gene is a filter and an organic molecule is a float
value. With this latest method, mixing the evolutionary
training was converging so we were applying this in our          4.3   Federated optimization and avoiding overfitting
approach.
                                                                 In our setup the generation of individuals, that is the selec-
                                                                 tion, the crossover and mutation happens at a centralized
4.1   Mutation functions                                         location at a parameter server. The connected nodes of the
                                                                 system participate in the optimization through evaluating
Crossover on its own results in generations that are only        the different proposed setup. The fitness of an individual
combinations of the initial generation according to the de-      can be calculated as a weighted average over the local fit-
fined rules. Thus using merely crossover restrict the space      ness values, in theory, during the training tough, as we will
searched by the algorithm. To break this random alter-           see we should not use this measurement to prevent overfit-
nations of the offspring are applied in form of mutation         ting.
functions.                                                          Avoiding overfitting has been studied in [25, 22, 26, 27],
   For defining a mutation function we must define the           as it is discussed in Section 2.5. The main idea is that we
number of mutated values and the scale of the mutation on        must not include the entire training set in the whole dura-
these values. For the former we used probabilistic value         tion of the training. Instead, what most articles propose, is
determining the chance of mutation for each value in the         to use subsets of the training data in each generation. The
model. The latter is a float value determining how much is       training subset can be changed every generation or kept in-
the impact on each mutating value.                               tact for a few generations. Studies interestingly show that
   There are the following two main approaches we tried          randomly selecting a single training sample is also very
for mutating values in a network:                                effective both for convergence and avoiding over-fitting.
                                                                 Another suggested tweak is to include the full dataset ev-
   • Mutate by offset (from [32]): Here, we add a ran-           ery once in a while.
     dom value to the selected values. In our imple-                Due the distributed nature of the problem, it was a rather
     mentation, the offset was a random value between            natural idea to incorporate the native data partitioning of
     [−mutation_rate, mutation_rate].                            the federated setup, and to do the subset selection at a
higher level and treat the nodes as units of the subset cre-                 • Early stopping: We save the fittest model of each
ation, instead of specific data points. Thus, in each gen-                     generation, as an additional means to stop before we
eration, the Federated Neuroevolution algorithm selects a                      overfit.
subset of the nodes for evaluating the fitness of the current
generation. To ensemble the evaluation sets, we have tried
                                                                       5      Results
the following three approaches:
 1. Random single element for each generation: Here,                   We have run the described evolution algorithm for 5000
    in each iteration we ask a randomly selected node to               generations (Figure 2). For our setup, we observed that
    evaluate the population’s fitness on a randomly se-                the convergence was slow but steady, overall.
    lected single training sample of it’s own.
                                                                                                  Minimum value      Maximum value
 2. Random subset for each generation: In this ap-                         Validation Accuracy       48.50%              85.28%
    proach a random subset of nodes are selected to eval-                  Fitness NMSE              −0.3297             −0.0903
    uate. We found this method the most efficient.
 3. Moving window subset for each generation: Here,                    Table 1: Federated Neuroevolution Performance on the
    we first order the nodes and then select a slice of the            EEG Dataset
    list of nodes. This is the window and in every n gen-
    eration we move the window to the right by 1.
   The second approach of randomly selecting a subset of
nodes had the best performance. Even if, according to the                  0.985
literature, method 1 works pretty well, in our experiments,                0.935                                     Accuracy
the training did not converge at all. The third method                     0.885
seemed to be more promising, training convergence was                      0.835
                                                                           0.785
                                                                 Fitness


slower with this method than in the case of the second
                                                                           0.735
method and the convergence also capped around 75% val-                     0.685
idation accuracy.                                                          0.635
                                                                           0.585
The main algorithm Using the fitness evaluation meth-                      0.535
ods we described in Section 4.3, the main run of the opti-                 0.485
mization looks like the following:                                                 0    1000      2000 3000       4000     5000
                                                                                                   Generations
   • Validation: On the server, we retain a validation set
     and in each generation we calculate and store the val-
     idation accuracy of the fittest model of the current
                                                                           −0.05
     generation. This is not far fetched as we can assume                               Fitness
     that in a Federated setting the server driving the learn-             −0.10
     ing would already have a dataset of it’s own.
                                                                           −0.15
   • Avoiding critical points: Based on the history of val-
                                                                 Fitness


     idation accuracies, we check the last n entries for a                 −0.20
     match with the current validation accuracy. If there is
                                                                           −0.25
     a match, we conclude that the evolution has reached
     some kind of critical point of the fitness function as                −0.30
     local maximum or saddle point. That is, however we
     try to combine and mutate the individuals of the sub-                 −0.35
                                                                                   0    1000      2000    3000    4000
     sequent generations, the fitness/accuracy does not in-                                        Generations
     crease. Our hypothesis is, that in this case the pop-
     ulation stuck in a higher region of the values of fit-
     ness function, and in the neighbourhood defined by                Figure 2: Running Federated Neuroevolution on the EEG
     our mutation rate the offsprings cannot find any in-              dataset for 5000 generations. Fitness is NMSE from equa-
     creasing directions. In this case we start gradually              tion 5, Accuracy is validation accuracy.
     increasing the mutation rate and the mutation chance
     multiplier which is initially set to 1. Once the algo-               From a fully random state, the algorithm was able to
     rithm is out of the local maximum, we reset the values            get to 85% validation accuracy as seen in Table 1. This
     of the mutation rate and mutation chance to the orig-             is, of course, a lot less than the baseline but still a good re-
     inal values. There is an upper bound on the mutation              sult considering using Neuroevolution for training weights
     multiplier.                                                       which is not the best method for training NNs.
6   Conclusion and future work                                    Hungarian Government and co-financed by the European
                                                                  Social Fund.
In this paper we described our experiments with a simple            Supported by Telekom Innovation Laboratories (T-
method, what we call Federated Neuroevolution (FNE),              Labs), the Research and Development unit of Deutsche
that is an application of EA adapted for FL of NNs.               Telekom.
   We found that our method is applicable on the studied
scenario yielding some advantages over the traditional FL
methods.                                                          References
   An advantage of EA, compared to the gradient based
algorithms originated from [1], [34] or [2], is that it re-        [1] J. Konečnỳ, H. B. McMahan, D. Ramage, and P. Richtárik,
quires even less client data transfer to the server. While             “Federated optimization: Distributed machine learning for
                                                                       on-device intelligence,” arXiv preprint arXiv:1610.02527,
FedAVG exposes the client side data distribution and the
                                                                       2016.
gradients during learning, FNE only expose the amount of
                                                                   [2] H. B. McMahan, E. Moore, D. Ramage, S. Hampson et al.,
data points of the clients and an abstract fitness number of
                                                                       “Communication-efficient learning of deep networks from
the model.
                                                                       decentralized data,” arXiv preprint arXiv:1602.05629,
   The clear disadvantage is that the convergence is a lot             2016.
slower. We needed 5000 iterations of the algorithm to              [3] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate, “Dif-
get to an 85% accuracy which is still less then the base-              ferentially private empirical risk minimization,” Journal of
line’s 95%. At this point though our purpose was merely                Machine Learning Research, vol. 12, no. Mar, pp. 1069–
to demonstrate the feasibility of derivative-free learning of          1109, 2011.
NN-s in a FL scenario.                                             [4] C. Dwork, A. Roth et al., “The algorithmic foundations of
   In summary, the technique we introduced, trades off                 differential privacy,” Foundations and Trends® in Theoret-
learning speed for privacy gains. We may need a lot of                 ical Computer Science, vol. 9, no. 3–4, pp. 211–407, 2014.
communication rounds which can be bad in a real-world              [5] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan,
setting of mobile users, but for some use cases, like for              I. Mironov, K. Talwar, and L. Zhang, “Deep learning
data from medical institutions, the rounds of communi-                 with differential privacy,” in Proceedings of the 2016 ACM
cation is not of primary importance, while keeping data                SIGSAC Conference on Computer and Communications
privacy is essentials. Another aspect of techniques similar            Security. ACM, 2016, pp. 308–318.
to FNE that might be interesting, is that there is no tra-         [6] L. Xie and A. Yuille, “Genetic cnn,” in Proceedings of the
ditional, backpropagation based learning, that is at client            IEEE International Conference on Computer Vision, 2017,
side we can save this rather expensive stages of the learn-            pp. 1379–1388.
ing process.                                                       [7] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and
   In the future we think there are several possible di-               K. Kavukcuoglu, “Hierarchical representations for efficient
                                                                       architecture search,” arXiv preprint arXiv:1711.00436,
rections to develop FNE to make it practical. First the
                                                                       2017.
rather poor performance of the system might be improved
                                                                   [8] T. Desell, “Large scale evolution of convolutional neu-
through experimenting with different submethods (selec-
                                                                       ral networks using volunteer computing,” in Proceedings
tion , crossover, etc.)
                                                                       of the Genetic and Evolutionary Computation Conference
   Following the trends in genetic algorithms, the search              Companion. ACM, 2017, pp. 127–128.
space could be extended to the network architecture too.           [9] P. Vidnerová and R. Neruda, “Asynchronous evolution of
This way we could reduce the bias and variance introduced              convolutional networks,” 2018.
by the model architecture that is chosen rather blindly at        [10] Y. Sun, B. Xue, M. Zhang, and G. G. Yen, “Automatically
the initiation phase of the learning.                                  designing cnn architectures using genetic algorithm for im-
   Bearing in mind the main purpose of the experiments,                age classification,” arXiv preprint arXiv:1808.03818, 2018.
that is prevent the communicating the gradients, a range of       [11] D. Whitley, T. Starkweather, and C. Bogart, “Genetic algo-
derivative free methods are available as Differential Evolu-           rithms and neural networks: Optimizing connections and
tion [35], Particle Swarm Optimization[36] or other biol-              connectivity,” Parallel computing, vol. 14, no. 3, pp. 347–
ogy inspired methods like Artifical Bee Colony [37]. Sim-              361, 1990.
ilarly, advanced optimization methods as CMA-ES[38]               [12] X. Yao, “Evolving artificial neural networks,” Proceedings
might be applied.                                                      of the IEEE, vol. 87, no. 9, pp. 1423–1447, 1999.
   It could be also interesting to experiment with more ef-       [13] K. O. Stanley and R. Miikkulainen, “Efficient evolution
ficient utilization of resources, since in the current setup in        of neural network topologies,” in Proceedings of the 2002
each round the vast majority of nodes is idle.                         Congress on Evolutionary Computation. CEC’02 (Cat. No.
                                                                       02TH8600), vol. 2. IEEE, 2002, pp. 1757–1762.
                                                                  [14] J. Gauci and K. Stanley, “Generating large-scale neural net-
Acknowledgements        EFOP-3.6.3-VEKOP-16-2017-                      works through discovering geometric regularities,” in Pro-
00001: Talent Management in Autonomous Vehicle                         ceedings of the 9th annual conference on Genetic and evo-
Control Technologies - The Project is supported by the                 lutionary computation. ACM, 2007, pp. 997–1004.
[15] P. Verbancsics and K. O. Stanley, “Constraining connectiv-     [31] M. D. Zeiler, “Adadelta: an adaptive learning rate method,”
     ity to encourage modularity in hyperneat,” in Proceedings           arXiv preprint arXiv:1212.5701, 2012.
     of the 13th annual conference on Genetic and evolutionary      [32] A. Gad. (2019) Artificial Neural Networks Op-
     computation. ACM, 2011, pp. 1483–1490.                              timization using Genetic Algorithm with Python.
[16] P. Verbancsics and J. Harguess, “Generative neuroevolution          [Online].      Available:        https://towardsdatascience.
     for deep learning,” arXiv preprint arXiv:1312.5355, 2013.           com/artificial-neural-networks-optimization-\
[17] V. Maniezzo, “Genetic evolution of the topology and                 using-genetic-algorithm-with-python-1fe8ed17733e
     weight distribution of neural networks,” IEEE Transactions     [33] R. Oullette, M. Browne, and K. Hirasawa, “Genetic al-
     on neural networks, vol. 5, no. 1, pp. 39–53, 1994.                 gorithm optimization of a convolutional neural network
[18] H. Lam, S. Ling, F. H. Leung, and P. K.-S. Tam, “Tuning             for autonomous crack detection,” in Proceedings of the
     of the structure and parameters of neural network using an          2004 congress on evolutionary computation (IEEE Cat. No.
     improved genetic algorithm,” in IECON’01. 27th Annual               04TH8753), vol. 1. IEEE, 2004, pp. 516–521.
     Conference of the IEEE Industrial Electronics Society (Cat.    [34] J. Chen, X. Pan, R. Monga, S. Bengio, and R. Jozefowicz,
     No. 37243), vol. 1. IEEE, 2001, pp. 25–30.                          “Revisiting distributed synchronous sgd,” arXiv preprint
[19] J.-T. Tsai, J.-H. Chou, and T.-K. Liu, “Tuning the struc-           arXiv:1604.00981, 2016.
     ture and parameters of a neural network by using hybrid        [35] J. Ilonen, J.-K. Kamarainen, and J. Lampinen, “Differential
     taguchi-genetic algorithm,” IEEE Transactions on Neural             evolution training algorithm for feed-forward neural net-
     Networks, vol. 17, no. 1, pp. 69–80, 2006.                          works,” Neural Processing Letters, vol. 17, no. 1, pp. 93–
[20] H. Bersini and G. Seront, “In search of a good crossover            105, 2003.
     between evolution and optimization,” M anner and Mand-         [36] B. A. Garro, H. Sossa, and R. A. Vazquez, “Design of arti-
     erick, vol. 1503, pp. 479–488, 1992.                                ficial neural networks using a modified particle swarm op-
[21] J. R. Koza and J. R. Koza, Genetic programming: on the              timization algorithm,” in 2009 International Joint Confer-
     programming of computers by means of natural selection.             ence on Neural Networks. IEEE, 2009, pp. 938–945.
     MIT press, 1992, vol. 1.                                       [37] B. A. Garro, H. Sossa, and R. A. Vázquez, “Artificial neural
[22] W. Langdon, “Minimising testing in genetic program-                 network synthesis by means of artificial bee colony (abc)
     ming,” RN, vol. 11, no. 10, p. 1, 2011.                             algorithm,” in 2011 IEEE Congress of Evolutionary Com-
[23] C. Gathercole and P. Ross, “Dynamic training subset se-             putation (CEC). IEEE, 2011, pp. 331–338.
     lection for supervised learning in genetic programming,” in    [38] N. Hansen, S. D. Müller, and P. Koumoutsakos, “Reducing
     International Conference on Parallel Problem Solving from           the time complexity of the derandomized evolution strategy
     Nature. Springer, 1994, pp. 312–321.                                with covariance matrix adaptation (cma-es),” Evolutionary
[24] Y. Liu and T. Khoshgoftaar, “Reducing overfitting in ge-            computation, vol. 11, no. 1, pp. 1–18, 2003.
     netic programming models for software quality classifica-
     tion,” in Eighth IEEE International Symposium on High As-
     surance Systems Engineering, 2004. Proceedings. IEEE,
     2004, pp. 56–65.
[25] I. Gonçalves, S. Silva, J. B. Melo, and J. M. Carreiras,
     “Random sampling technique for overfitting control in ge-
     netic programming,” in European Conference on Genetic
     Programming. Springer, 2012, pp. 218–229.
[26] I. Gonçalves and S. Silva, “Experiments on controlling
     overfitting in genetic programming,” in 15th Portuguese
     conference on artificial intelligence (EPIA 2011), 2011, pp.
     10–13.
[27] ——, “Balancing learning and overfitting in genetic pro-
     gramming with interleaved sampling of training data,” in
     European Conference on Genetic Programming. Springer,
     2013, pp. 73–84.
[28] H. Begleiter. EEG Database Data Set. [Online]. Available:
     https://archive.ics.uci.edu/ml/datasets/eeg+database
[29] J. G. Snodgrass and M. Vanderwart, “A standardized set of
     260 pictures: norms for name agreement, image agreement,
     familiarity, and visual complexity.” Journal of experimental
     psychology: Human learning and memory, vol. 6, no. 2, p.
     174, 1980.
[30] R. T. Schirrmeister, J. T. Springenberg, L. D. J. Fiederer,
     M. Glasstetter, K. Eggensperger, M. Tangermann, F. Hutter,
     W. Burgard, and T. Ball, “Deep learning with convolutional
     neural networks for eeg decoding and visualization,” Hu-
     man brain mapping, vol. 38, no. 11, pp. 5391–5420, 2017.