Evolutionary federated learning on EEG-data Gábor Szegedi, Péter Kiss, and Tomáš Horváth Department of Data Science and Engineering ELTE – Eötvös Loránd University, Faculty of Informatics Budapest, H-1117 Budapest, Pázmány Péter sétány 1/C., Hungary wayasam@gmail.com, {peter.kiss, tomas.horvath}@inf.elte.hu Abstract: With the spread of digitalization across every • Massively Distributed Data points are stored across aspects of society and economy, the amount of data gen- a large number K of nodes. In particular, the number erated keeps increasing. In some domains, this generation of nodes can be much bigger than the average number happens in such a massively distributed fashion that poses of training examples stored on a single node (n/K). challenges for even collecting the data to build machine learning (ML) models on it, not to mention the process- • Non-IID Data on each node may be drawn from a dif- ing power necessary for training. An important aspect of ferent distribution. That is, the data points available processing information that has been generated at users is locally are far from being a representative sample of privacy concerns, that is, users might be unwilling to ex- the overall distribution. pose anything that would enable one to draw any conclu- • Unbalanced Different nodes may vary by orders of sion regarding to confidential information they possess. In magnitude in the number of training examples they this work, we present a experiment on a genetic algorithm hold. based federated learning (FL) algorithm, that reduces the data transfer from individual users to the learner to a single Formally, we have K nodes and n data points, a set of fitness value. indices Pk (k ∈ {1, . . . , K}) of data stored at node k, and nk = |Pk | is the number of data points at Pk . We assume that Pk ∩ Pl = 0/ whenever l 6= k, thus ∑Kk=1 nk = n. 1 Introduction def We can then define the local loss for node as Fk (w) = 1 The paradigm of federated learning (FL) [1] addresses nk ∑i∈Pk f i (w), where f i (w) is the loss of our model at the the more and more timely scenario in which the data to ith training example, given the parametrisation w. Thus be processed is generated in a massively distributed en- the problem to be minimized will become: vironment, where traditional approaches for building ma- K chine learning (ML) models become extremely challeng- nk minw∈Rd f (w) = ∑ Fk (w). (1) ing, mostly in logistical point of view. That is, when data k=1 n is generated at client devices as mobile phones, tablets or To solve the learning problem (6) for neural networks smart watches, collecting, storing and processing all these (NNs), the mainstream way is – starting from a common information in data centers might be difficult task (aggre- initial parametrisation – to train local models using some gation problem) and, according to the idea of FL, not nec- version of gradient descent methods, then aggregate the essary by all means. local model updates (e.g. gradients), or equivalently the Another problem regarding the traditional data center local models themselves to update the global model. The based solution is privacy concerns. It might happen that global model then will be sent back to the worker nodes. users of the applications that build on centralized model Algorithm 1 is one of the most successful algorithms for training are reluctant to share their possibly confidential federated NN training, called FederatedAveraging [2]. data. We believe a particularly fitting scenario for this FederatedAveraging works pretty well solving the ag- problem is the use case of medical applications. Each gregation problem, however using gradients or, equiva- medical institute might have a lot of patient data, but that lently, the local models for the global aggregation step still may be far from enough to train their own prediction mod- exposes some information on users data. To address pri- els. Here, sharing the data across a big number of institute vacy concerns, the solution is usually to apply achieve- can yield a great help in developing automated diagnostic ments of differential privacy[3][4][5] atop the gradient tools. But being the private nature of these data, hospitals based learning process. probably decide not to share anything of this information In this paper we present a slightly different approach, either to protect their reputation or due to legal regulations. namely, we investigated whether it is possible to train NNs As it is summarized in [1], the characteristics of data in a federated fashion without using gradient in any con- that FL is concerned with can be described as follows: text. To approach the problem, it seemed to be a simple Copyright ©2019 for this paper by its authors. Use permitted under choice to try evolutionary algorithms. Since a rich liter- Creative Commons License Attribution 4.0 International (CC BY 4.0). ature is already available on evolutionary optimization of Algorithm 1 FederatedAveraging EAs – as nature inspired methods in general – are often 1: procedure S ERVER used to discover very complex, high dimensional and/or 2: initialize w0 non-convex search problems, therefore, attempts to apply 3: for t = 0; 1; 2; ... do these methods on optimizing NNs has a long history. 4: m ← max(C · K, 1) Recently, nature-inspired methods in relation with NNs 5: St ← m client nodes randomly are used mostly for hyper-parameter tuning that includes 6: for all k ∈ St in parallel do searching for an efficient architecture. 7: k wt+1 ← ClientUpdate(k, wt ) A big part of this rich literature is concerned specifi- 8: end for cally CNN-s, what we apply for our problem. Methods of 9: wt+1 = ∑Kk=1 nnk wt+1 k Genetic CNN [6], hierarchical evolution [7], large- scale 10: end for evolution [8], asynchronous CNN evolution [9] and auto- 11: end procedure matic CNN design [10] give graph based methods to de- 12: procedure C LIENT U PDATE (k, w) sign automatically the stack convolutional layers (skipping 13: B ← split Pk to set of batches potentially fully connected parts of the network) for image 14: for all b ∈ B do classification trough genetical evolution of subsequent lay- 15: w ← w − η∇ f (w, b). gradient of loss on batch b ers with various innovative encoding techniques. 16: end for In these scenarios, the learning itself is still based on 17: return W calculating the gradient and updating the model according 18: end procedure to that (backpropagation). Using backpropagation though being based on calcula- tion of gradients and on applying them on the weights of NNs, we only transfer this knowledge into the federated the network is exactly what we want to avoid in our exper- environment. iment. Before the monocracy of derivative based train- For the concrete task to be solved by our method we ing algorithms however biology-inspired training meth- have chosen classfication of EEG-signals using convolu- ods was a rather popular research topic, thus there is a tional neural networks (CNNs). rich, though a bit dated literature concerned with our con- The main contributions of this paper are strained problem. [11] and [12] give a summary of the 1. a proof of concept for applicability of genetic algo- these initial approaches to neuroevolution (NE). rithms to federated training of NNs without using vul- There is a very interesting branch of applications of NE nerable gradients; for general NN-s, that includes techniques to purely genet- ically train the architecture along with the weights of the 2. presenting Federated Neuroevolution (FNE) a simple networks. Among the most important algorithms that be- algorithm for the federated training, applying a dis- long here it might be worth to mention NeuroEvolution of tributed fitness function. Augmenting Topoplogies (NEAT) [13], and Hypercube- based NEAT (HyperNEAT) [14] and its specializattion for 2 Neuroevolution modular evolution of NN-s, HyperNEAT-LEO [15] and Generative NE [16]. Despite of the power of HyperNEAT, Evolutionary algorithms (EAs) follow the pattern of evo- we decided first to focus on training a predefined architec- lution as it is observed by biologists in the nature. Ac- ture, thus our method is based on more "traditional" NE cording to this, in an infinite cycle of life, the most apt in- algorithms. dividuals can produce offsprings possessing a potentially For applying evolutionary approach on an issue, one slightly changed (mutated) mixture of their genoms that need to specify an encoding of the problem, a selection, a might result an enhanced ability to face challenges in their crossover and a mutation method as well as a fitness func- life. The main assumption in biology is that those individ- tion. uals survive and create descendants with a bigger chance, In the rest of this section we shortly describe the stages who, in some aspects are superior to the others. The main of an EA along with a couple of examples of how these structure of an EA is sketched in Algorithm 2 stages have been implemented in some work on the field of NE, that gave inspiration to our algorithm. At the end of Algorithm 2 EA the section we also describe approaches aiming at handle 1: generate an initial population G0 , i = 0 overfitting that has been proven a serious problem in NE. 2: repeat 3: ∀individual j ∈ Gi : f j = f itness(individual j ) 4: select parents from Gi based on their fitness 2.1 Encoding 5: produce offspring generation Gi+1 Genetic algorithms work on sequence of features that 6: ∀individual j ∈ Gi+1 : individual j ← would be mixed, or altered according some granularity de- mutate(individual j ) fined over them. Thus the first step in solving a problem 7: until termination criterion is satisfied genetically is to provide a description of the search space. We refer to this description as encoding, that can be direct 2.3 Crossover or indirect. Crossover is a method that defines, how we combine in- dividuals of a generation to create offsprings for the next Direct encoding is the more traditional way of problem generations. One simple way is – as in [17] – to com- encoding, where sections of the genom more or less corre- bine the parts of parent individuals at some cutting points. spond to specific parameters. Some of the early methods Another approach is presented in [18], where crossover is hanlde some switches as well, that control the connectivity actually taking the average of the corresponding weights of the specific perceptrons. of the two individuals: x(t+1) = x1 (t)+x 2 2 (t) where x(t)s are [17] proposes a system based on a parallel genetic al- the individuals represented as vectors of parameters. gorithm, ANNA ELEONORA, for learning both topology and for connection weights. Topology Utilizes binary rep- resentation of networks, with granularity encoding that is 2.4 Mutation handled through one bit flag to determine connectivity, Mutation methods serve for adding extra variance to the that is, whether the given edge is present in the recent setup individual genoms to enable them to discover a bigger part or not, followed by the substring of weights. These sub- of the search space. [17] provides a representation that strings are ordered in a way that connections into the same translates different topologies and encoding length into a neuron are grouped together. common string format granting compositionally different [18] presents a variant of EA applied immediately for descriptions. At mutations, it applies three separate prob- float weights. The input of the EA is a vector x of variables abilities for swapping bits such that granularity bits, con- , that are the parameters of the model (that is the weights nectivity bit, and weight bits. For effectively explore the of the connections), the biases, and the newly invented search space, it uses EA simplex [20], instead of taking link switches. Link switches are variables, that control the three populations and creating a fourth based on those. connectivity of the network, that is, negative value rep- In [18], where the possible values for genes are con- resents that the edge is switched off. The search space strained, mutation is carried out according to the follow- is constrained by upper and lower bounds on variables ing formula: x(t+1) = x(t) + Bδ , B ∈ R2 is a diagonal ma- (weights): x ∈ I1 × I2 × · · · × Id , where Ii = [li , ui ], li , ui ∈ R (t+1) trix, with a diagonal Bii ∈ {0, 1}, and li ≤ xi + δi ≤ u i . for i = 1, 2, . . . , d. Based on this rule, the algorithm generates three individ- Theoretically using the connectivity features of the en- uals/chromosomes: at the first only one element of the di- coding the first method is able to evolve the architecture agonal of B can be one, at the second one a random num- too. The issue with this approach to encoding is that the ber of diagonal of elements, and at the last, Bii = 1 for all problem space grows very fast as we scale up the network i = 1, . . . , d. The one with the best fitness of these three (which we need if we want to solve complex problems). will replace the weakest one in the next population. Indirect Encoding The scaling problem of Direct Encod- ing can be solved with Indirect Encoding, that instead of 2.5 Overfitting separated representation of model parameters uses genera- Using EA usually involves a high computation demand, tive information. In HyperNEAT [14], which is maybe the which can be reduced through decreasing the number of most important representative of this class, genes of the evaluations of the model, that is the size of training data genom are defining functions based on which weights can on which we want to try out the models defined by a given be generated. generation of the genetic algorithms. Earlier applications of EA usually did not use separation of data into training and test set (like [21], for example). 2.2 Fitness Practitioners soon realized, however, that models trained this way perform poorly on not seen data points, revealing The fitness function serves to specify how well a given in- the tendency of evolutionary methods to strongly overfit dividual performs on the problem to be solved. A higher on the training problems. This issue got in the center of value of the fitness function means a better solution for the interest, when as an attempt to reduce run time they tried problem, while lower fitness value reports a poor perfor- to use subsets of the training data to evaluate individuals. mance. Fitness is often normalized thus a function that [22] made comprehensive experiments proving that evo- produces a fitness value 1 for a perfect solution, and 0 for lution is potentially able to extrapolate from the randomly completely wrong setup can work well. As an example chosen test sets. A very promising direction to reduce for a normalized fitness in ML scenarios, [19] proposes a 1 overfitting is random sampling, where at each generation, fitness function for NN defined as fnorm = 1+err , where a random subset of the training data is chosen and evo- ∑d |y −ŷ | err = ∑m k=1 i=1 i md i , with d denoting the output dimen- lution is performed based on the fitness on that sample. sion, and m the number of examples, applying mean abso- The Random Sampling Technique (RST) [23] was origi- lute error. nally used for speeding up the GE runs in [24], however, it was already used for preventing overfitting. [25] and [26] drive some experiments on RST, where they were testing 1.00 two parameters, the Random Subset Size (RSS) and the 0.95 Random Subset Reset (frequency of changing the subset). They have found, interestingly, that the techniques per- 0.90 0.85 Accuracy forms best when both these values are set to one, that is in each iteration the fitness should be tested using a new 0.80 randomly chosen data point. 0.75 In [27], the authors present versions of “interleaved 0.70 sampling”, that means, instead of random subsets, fitness train 0.65 validation at each round is evaluated alternating between one and all 0.60 training samples, with various switching frequencies. As 0 10 20 30 40 50 60 70 80 90 100 a result, they find that, on their test datasets, the best tech- Epochs nique would be to switch in each round between single sample and all sample evaluation. Figure 1: Baseline accuracy (backpropagation) 3 The problem Selection The candidate set of individuals for crossover is 3.1 Data created by sorting the current generation’s models based For the experiment, we used the EEG Database Data Set on their fitness and selecting the n − 1 fittest models for [28]. The dataset contains 120 EEG trial data about 122 crossover. The last parent selected for mating is not among patients who either belong to the alcoholic or to the control the fittest ones, but chosen randomly from the rest, to add group. In each trial, the patients were shown one or two more variance. images of the Snodgrass and Vanderwart picture set [29]. After showing them the stimuli, their brain activation was measured for 1 second on 64 points at 256 Hertz. The Crossover functions Crossover method defines the way measurements are then labelled according to which group according to which new individuals will be generated from they belong to, thus the task of the model to be built is to the parent generation. predict which class of the two does a sample belong to. In our method, we pick two parents randomly from the pool of parents to produce the required offspring amount. We run experiments with four crossover methods. The 3.2 Network architecture first three require flattening the vector of weights. These For the network architecture to train, we decided to use first three approaches are rather popular in EA research. the shallow convolutional network from [30], that has been designed specifically for EEG based multiclass prediction • Halving mix: In this approach, that is a simplified problems. The essence of the network is three convolu- version of the one in [17], the vector of values from tional layer that are intended to recognize specific patterns the parents are taken to create the offspring vector in the signals. After two convolutional layers there is a by taking the first half of it from the first parent and pooling layer and then comes the third convolutional layer. the second half from the second parent. This was the On the output of this layer we applied batch normalization, original approach in [32] too. Formally: then added the output dense layer with sigmoid activation. ( For the control experiment, we used the AdaDelta opti- ai , if i ≤ n/2 mizer [31] with Categorical Cross-Entropy loss function. {o f f springi }ni=1 = (2) bi , if i > n/2 At training we used a batch size of 64, 1.0 as learning rate, ρ = 0.95, and ε = 10−7 . where n is the length of the model vectors and a = The control model after 100 epochs achieved a valida- (a1 , a2 , . . . , an ), b = (b1 , b2 , . . . , bn ) are the parent tion accuracy of 95% (see Figure 1). vectors. • Interleave mix: Here, the vector of values from the 4 The proposed methods parents are taken to create the offspring vector by in- The algorithm runs according to the process defined in Al- terleaving the two parent vectors. Formally: gorithm 2. For starting off we create an initial generation ( in which for each individuals initialize the weights of the n ai , if i mod 2 = 0 {o f f springi }i=1 = (3) models randomly. From the initial generation then we it- bi , if i mod 2 = 1 erate along the fitness-selection-crossover-mutation loop. In this section we describe the particular methods we used where n is the length of the model vectors and a, b are for the different stages. the parent vectors. • Mean mix: In this method, similarly to [18], the vec- • Mutate by multiplication (from [33]): Here, tor of values from the parents are taken to create the we multiply the selected values with a ran- offspring vector by taking the mean of the two parent dom value. In our implementation, the mul- vectors at each index. Formally: tiplication factor was a random value between [ 100−mutation_rate 100 , 100+mutation_rate 100 ]. ai + bi {o f f springi }ni=1 = { } (4) 2 After experimenting, we found a lot better convergence rate with the second approach. where n is the length of the model vectors and a, b are the parent vectors. 4.2 Federated fitness function • Kernelwise mix: In this algorithms we used a more coarse units for the crossover. In each convolutional For fitness, which should be maximized during the evo- layer there are multiple kernels/filters that hold key lutionary training, we have chosen the Negative Mean pattern information. Similarly, in case of fully con- Squared Error (NMSE), that is defined as in Equation (5). nected layers, input weights belonging to a single neurons describes some pattern in the previous lay- 1 n d (i) (i) fNMSE (w) = −1 ∗ ∑ ∑ (ŷ j − y j )2 (5) ers. These information portions are kept intact dur- n i=1 j=1 ing crossover. The offspring model is created by ran- domly mixing the kernels inside each layer. where ŷ is the predicted output vector using parameters w, y is the target output vector, d is the output dimension and In our experiments, the first three crossover methods did n is the number of examples. This a is slightly different not converge. This could be because these approaches are function, than the one in the example in section 2.2, but very low level and do not care about the network structure it’s behaviour is the same( ∂∂wi fNMSE (w) ∗ ∂∂wi fnorm (w) > or the patterns learned in the kernels. 0, ∀w, i ) Kernelwise mixing is a higher level approach, what we Applying the NMSE fitness for the original optimiza- tried after taking a look at how genetics works in nature. In tion problem in Equation (6) our task will be to maximize nature, the heredity is also a higher level mixing of genes, NMSE with respect to w: instead of low level mix of organic molecules. Thus, traits n of the parents are kept intact. The resemblance to genetics maxw∈Rd fNMSE (w) = − ∑ kŷ(i) − y(i) k22 . (6) can be summarized as follows: the DNA is the network’s i=1 weights, a gene is a filter and an organic molecule is a float value. With this latest method, mixing the evolutionary training was converging so we were applying this in our 4.3 Federated optimization and avoiding overfitting approach. In our setup the generation of individuals, that is the selec- tion, the crossover and mutation happens at a centralized 4.1 Mutation functions location at a parameter server. The connected nodes of the system participate in the optimization through evaluating Crossover on its own results in generations that are only the different proposed setup. The fitness of an individual combinations of the initial generation according to the de- can be calculated as a weighted average over the local fit- fined rules. Thus using merely crossover restrict the space ness values, in theory, during the training tough, as we will searched by the algorithm. To break this random alter- see we should not use this measurement to prevent overfit- nations of the offspring are applied in form of mutation ting. functions. Avoiding overfitting has been studied in [25, 22, 26, 27], For defining a mutation function we must define the as it is discussed in Section 2.5. The main idea is that we number of mutated values and the scale of the mutation on must not include the entire training set in the whole dura- these values. For the former we used probabilistic value tion of the training. Instead, what most articles propose, is determining the chance of mutation for each value in the to use subsets of the training data in each generation. The model. The latter is a float value determining how much is training subset can be changed every generation or kept in- the impact on each mutating value. tact for a few generations. Studies interestingly show that There are the following two main approaches we tried randomly selecting a single training sample is also very for mutating values in a network: effective both for convergence and avoiding over-fitting. Another suggested tweak is to include the full dataset ev- • Mutate by offset (from [32]): Here, we add a ran- ery once in a while. dom value to the selected values. In our imple- Due the distributed nature of the problem, it was a rather mentation, the offset was a random value between natural idea to incorporate the native data partitioning of [−mutation_rate, mutation_rate]. the federated setup, and to do the subset selection at a higher level and treat the nodes as units of the subset cre- • Early stopping: We save the fittest model of each ation, instead of specific data points. Thus, in each gen- generation, as an additional means to stop before we eration, the Federated Neuroevolution algorithm selects a overfit. subset of the nodes for evaluating the fitness of the current generation. To ensemble the evaluation sets, we have tried 5 Results the following three approaches: 1. Random single element for each generation: Here, We have run the described evolution algorithm for 5000 in each iteration we ask a randomly selected node to generations (Figure 2). For our setup, we observed that evaluate the population’s fitness on a randomly se- the convergence was slow but steady, overall. lected single training sample of it’s own. Minimum value Maximum value 2. Random subset for each generation: In this ap- Validation Accuracy 48.50% 85.28% proach a random subset of nodes are selected to eval- Fitness NMSE −0.3297 −0.0903 uate. We found this method the most efficient. 3. Moving window subset for each generation: Here, Table 1: Federated Neuroevolution Performance on the we first order the nodes and then select a slice of the EEG Dataset list of nodes. This is the window and in every n gen- eration we move the window to the right by 1. The second approach of randomly selecting a subset of nodes had the best performance. Even if, according to the 0.985 literature, method 1 works pretty well, in our experiments, 0.935 Accuracy the training did not converge at all. The third method 0.885 seemed to be more promising, training convergence was 0.835 0.785 Fitness slower with this method than in the case of the second 0.735 method and the convergence also capped around 75% val- 0.685 idation accuracy. 0.635 0.585 The main algorithm Using the fitness evaluation meth- 0.535 ods we described in Section 4.3, the main run of the opti- 0.485 mization looks like the following: 0 1000 2000 3000 4000 5000 Generations • Validation: On the server, we retain a validation set and in each generation we calculate and store the val- idation accuracy of the fittest model of the current −0.05 generation. This is not far fetched as we can assume Fitness that in a Federated setting the server driving the learn- −0.10 ing would already have a dataset of it’s own. −0.15 • Avoiding critical points: Based on the history of val- Fitness idation accuracies, we check the last n entries for a −0.20 match with the current validation accuracy. If there is −0.25 a match, we conclude that the evolution has reached some kind of critical point of the fitness function as −0.30 local maximum or saddle point. That is, however we try to combine and mutate the individuals of the sub- −0.35 0 1000 2000 3000 4000 sequent generations, the fitness/accuracy does not in- Generations crease. Our hypothesis is, that in this case the pop- ulation stuck in a higher region of the values of fit- ness function, and in the neighbourhood defined by Figure 2: Running Federated Neuroevolution on the EEG our mutation rate the offsprings cannot find any in- dataset for 5000 generations. Fitness is NMSE from equa- creasing directions. In this case we start gradually tion 5, Accuracy is validation accuracy. increasing the mutation rate and the mutation chance multiplier which is initially set to 1. Once the algo- From a fully random state, the algorithm was able to rithm is out of the local maximum, we reset the values get to 85% validation accuracy as seen in Table 1. This of the mutation rate and mutation chance to the orig- is, of course, a lot less than the baseline but still a good re- inal values. There is an upper bound on the mutation sult considering using Neuroevolution for training weights multiplier. which is not the best method for training NNs. 6 Conclusion and future work Hungarian Government and co-financed by the European Social Fund. In this paper we described our experiments with a simple Supported by Telekom Innovation Laboratories (T- method, what we call Federated Neuroevolution (FNE), Labs), the Research and Development unit of Deutsche that is an application of EA adapted for FL of NNs. Telekom. We found that our method is applicable on the studied scenario yielding some advantages over the traditional FL methods. References An advantage of EA, compared to the gradient based algorithms originated from [1], [34] or [2], is that it re- [1] J. Konečnỳ, H. B. McMahan, D. Ramage, and P. Richtárik, quires even less client data transfer to the server. While “Federated optimization: Distributed machine learning for on-device intelligence,” arXiv preprint arXiv:1610.02527, FedAVG exposes the client side data distribution and the 2016. gradients during learning, FNE only expose the amount of [2] H. B. McMahan, E. Moore, D. Ramage, S. Hampson et al., data points of the clients and an abstract fitness number of “Communication-efficient learning of deep networks from the model. decentralized data,” arXiv preprint arXiv:1602.05629, The clear disadvantage is that the convergence is a lot 2016. slower. We needed 5000 iterations of the algorithm to [3] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate, “Dif- get to an 85% accuracy which is still less then the base- ferentially private empirical risk minimization,” Journal of line’s 95%. At this point though our purpose was merely Machine Learning Research, vol. 12, no. Mar, pp. 1069– to demonstrate the feasibility of derivative-free learning of 1109, 2011. NN-s in a FL scenario. [4] C. Dwork, A. Roth et al., “The algorithmic foundations of In summary, the technique we introduced, trades off differential privacy,” Foundations and Trends® in Theoret- learning speed for privacy gains. We may need a lot of ical Computer Science, vol. 9, no. 3–4, pp. 211–407, 2014. communication rounds which can be bad in a real-world [5] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, setting of mobile users, but for some use cases, like for I. Mironov, K. Talwar, and L. Zhang, “Deep learning data from medical institutions, the rounds of communi- with differential privacy,” in Proceedings of the 2016 ACM cation is not of primary importance, while keeping data SIGSAC Conference on Computer and Communications privacy is essentials. Another aspect of techniques similar Security. ACM, 2016, pp. 308–318. to FNE that might be interesting, is that there is no tra- [6] L. Xie and A. Yuille, “Genetic cnn,” in Proceedings of the ditional, backpropagation based learning, that is at client IEEE International Conference on Computer Vision, 2017, side we can save this rather expensive stages of the learn- pp. 1379–1388. ing process. [7] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and In the future we think there are several possible di- K. Kavukcuoglu, “Hierarchical representations for efficient architecture search,” arXiv preprint arXiv:1711.00436, rections to develop FNE to make it practical. First the 2017. rather poor performance of the system might be improved [8] T. Desell, “Large scale evolution of convolutional neu- through experimenting with different submethods (selec- ral networks using volunteer computing,” in Proceedings tion , crossover, etc.) of the Genetic and Evolutionary Computation Conference Following the trends in genetic algorithms, the search Companion. ACM, 2017, pp. 127–128. space could be extended to the network architecture too. [9] P. Vidnerová and R. Neruda, “Asynchronous evolution of This way we could reduce the bias and variance introduced convolutional networks,” 2018. by the model architecture that is chosen rather blindly at [10] Y. Sun, B. Xue, M. Zhang, and G. G. Yen, “Automatically the initiation phase of the learning. designing cnn architectures using genetic algorithm for im- Bearing in mind the main purpose of the experiments, age classification,” arXiv preprint arXiv:1808.03818, 2018. that is prevent the communicating the gradients, a range of [11] D. Whitley, T. Starkweather, and C. Bogart, “Genetic algo- derivative free methods are available as Differential Evolu- rithms and neural networks: Optimizing connections and tion [35], Particle Swarm Optimization[36] or other biol- connectivity,” Parallel computing, vol. 14, no. 3, pp. 347– ogy inspired methods like Artifical Bee Colony [37]. Sim- 361, 1990. ilarly, advanced optimization methods as CMA-ES[38] [12] X. Yao, “Evolving artificial neural networks,” Proceedings might be applied. of the IEEE, vol. 87, no. 9, pp. 1423–1447, 1999. It could be also interesting to experiment with more ef- [13] K. O. Stanley and R. Miikkulainen, “Efficient evolution ficient utilization of resources, since in the current setup in of neural network topologies,” in Proceedings of the 2002 each round the vast majority of nodes is idle. Congress on Evolutionary Computation. CEC’02 (Cat. No. 02TH8600), vol. 2. IEEE, 2002, pp. 1757–1762. [14] J. Gauci and K. Stanley, “Generating large-scale neural net- Acknowledgements EFOP-3.6.3-VEKOP-16-2017- works through discovering geometric regularities,” in Pro- 00001: Talent Management in Autonomous Vehicle ceedings of the 9th annual conference on Genetic and evo- Control Technologies - The Project is supported by the lutionary computation. ACM, 2007, pp. 997–1004. [15] P. Verbancsics and K. O. Stanley, “Constraining connectiv- [31] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” ity to encourage modularity in hyperneat,” in Proceedings arXiv preprint arXiv:1212.5701, 2012. of the 13th annual conference on Genetic and evolutionary [32] A. Gad. (2019) Artificial Neural Networks Op- computation. ACM, 2011, pp. 1483–1490. timization using Genetic Algorithm with Python. [16] P. Verbancsics and J. Harguess, “Generative neuroevolution [Online]. Available: https://towardsdatascience. for deep learning,” arXiv preprint arXiv:1312.5355, 2013. com/artificial-neural-networks-optimization-\ [17] V. Maniezzo, “Genetic evolution of the topology and using-genetic-algorithm-with-python-1fe8ed17733e weight distribution of neural networks,” IEEE Transactions [33] R. Oullette, M. Browne, and K. Hirasawa, “Genetic al- on neural networks, vol. 5, no. 1, pp. 39–53, 1994. gorithm optimization of a convolutional neural network [18] H. Lam, S. Ling, F. H. Leung, and P. K.-S. Tam, “Tuning for autonomous crack detection,” in Proceedings of the of the structure and parameters of neural network using an 2004 congress on evolutionary computation (IEEE Cat. No. improved genetic algorithm,” in IECON’01. 27th Annual 04TH8753), vol. 1. IEEE, 2004, pp. 516–521. Conference of the IEEE Industrial Electronics Society (Cat. [34] J. Chen, X. Pan, R. Monga, S. Bengio, and R. Jozefowicz, No. 37243), vol. 1. IEEE, 2001, pp. 25–30. “Revisiting distributed synchronous sgd,” arXiv preprint [19] J.-T. Tsai, J.-H. Chou, and T.-K. Liu, “Tuning the struc- arXiv:1604.00981, 2016. ture and parameters of a neural network by using hybrid [35] J. Ilonen, J.-K. Kamarainen, and J. Lampinen, “Differential taguchi-genetic algorithm,” IEEE Transactions on Neural evolution training algorithm for feed-forward neural net- Networks, vol. 17, no. 1, pp. 69–80, 2006. works,” Neural Processing Letters, vol. 17, no. 1, pp. 93– [20] H. Bersini and G. Seront, “In search of a good crossover 105, 2003. between evolution and optimization,” M anner and Mand- [36] B. A. Garro, H. Sossa, and R. A. Vazquez, “Design of arti- erick, vol. 1503, pp. 479–488, 1992. ficial neural networks using a modified particle swarm op- [21] J. R. Koza and J. R. Koza, Genetic programming: on the timization algorithm,” in 2009 International Joint Confer- programming of computers by means of natural selection. ence on Neural Networks. IEEE, 2009, pp. 938–945. MIT press, 1992, vol. 1. [37] B. A. Garro, H. Sossa, and R. A. Vázquez, “Artificial neural [22] W. Langdon, “Minimising testing in genetic program- network synthesis by means of artificial bee colony (abc) ming,” RN, vol. 11, no. 10, p. 1, 2011. algorithm,” in 2011 IEEE Congress of Evolutionary Com- [23] C. Gathercole and P. Ross, “Dynamic training subset se- putation (CEC). IEEE, 2011, pp. 331–338. lection for supervised learning in genetic programming,” in [38] N. Hansen, S. D. Müller, and P. Koumoutsakos, “Reducing International Conference on Parallel Problem Solving from the time complexity of the derandomized evolution strategy Nature. Springer, 1994, pp. 312–321. with covariance matrix adaptation (cma-es),” Evolutionary [24] Y. Liu and T. Khoshgoftaar, “Reducing overfitting in ge- computation, vol. 11, no. 1, pp. 1–18, 2003. netic programming models for software quality classifica- tion,” in Eighth IEEE International Symposium on High As- surance Systems Engineering, 2004. Proceedings. IEEE, 2004, pp. 56–65. [25] I. Gonçalves, S. Silva, J. B. Melo, and J. M. Carreiras, “Random sampling technique for overfitting control in ge- netic programming,” in European Conference on Genetic Programming. Springer, 2012, pp. 218–229. [26] I. Gonçalves and S. Silva, “Experiments on controlling overfitting in genetic programming,” in 15th Portuguese conference on artificial intelligence (EPIA 2011), 2011, pp. 10–13. [27] ——, “Balancing learning and overfitting in genetic pro- gramming with interleaved sampling of training data,” in European Conference on Genetic Programming. Springer, 2013, pp. 73–84. [28] H. Begleiter. EEG Database Data Set. [Online]. Available: https://archive.ics.uci.edu/ml/datasets/eeg+database [29] J. G. Snodgrass and M. Vanderwart, “A standardized set of 260 pictures: norms for name agreement, image agreement, familiarity, and visual complexity.” Journal of experimental psychology: Human learning and memory, vol. 6, no. 2, p. 174, 1980. [30] R. T. Schirrmeister, J. T. Springenberg, L. D. J. Fiederer, M. Glasstetter, K. Eggensperger, M. Tangermann, F. Hutter, W. Burgard, and T. Ball, “Deep learning with convolutional neural networks for eeg decoding and visualization,” Hu- man brain mapping, vol. 38, no. 11, pp. 5391–5420, 2017.