<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Optimization in Federated Learning</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Vukasin</forename><surname>Felbab</surname></persName>
							<email>vukasindfelbab@gmail.com</email>
						</author>
						<author>
							<persName><forename type="first">Péter</forename><surname>Kiss</surname></persName>
							<email>peter.kiss@inf.elte.hu</email>
						</author>
						<author>
							<persName><forename type="first">Tomáš</forename><surname>Horváth</surname></persName>
							<email>tomas.horvath@inf.elte.hu</email>
						</author>
						<author>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">Department of Data Science and Engineering ELTE</orgName>
								<orgName type="department" key="dep2">Faculty of Informatics Budapest</orgName>
								<orgName type="institution">Eötvös Loránd University</orgName>
								<address>
									<postCode>H-1117</postCode>
									<settlement>Budapest</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="institution">Pázmány Péter sétány 1/C</orgName>
								<address>
									<country key="HU">Hungary</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Optimization in Federated Learning</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">A28D6A5DB7169A42066225CE9CB21497</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-23T23:35+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Federated learning (FL) is an emerging branch of machine learning (ML) research, that is examining the methods for scenarios, where individual nodes possess parts of the data, and the task is to form a single common model that fits to the whole distribution. In FL, we generally use mini batch gradient descent for optimizing weights of the models that appears to work very well for federated scenarios. For traditional machine learning setups, a number of modifications has been proposed to accelerate the learning process and to help to get over challenges posed by the high dimensionality and nonconvexity of search spaces of the parameters. In this paper we present our experiments on applying different popular optimization methods for training neural networks in a federated manner.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Federated Learning</head><p>Federated learning (FL) <ref type="bibr" target="#b0">[1]</ref> is a new paradigm in Machine Learning (ML), that is dealing with an increasingly important distributed optimization setting, that came into view with the spread of small user devices and applications written for them that can profit from ML. The domain of ML models is often the data collected on the devices, thus, to train these models, one should incorporate the knowledge contained into the learning process. The traditional way for this would be to transfer the information gathered at the users to data centers, where the training takes place, then send back the trained models to the users. That, apart from the obvious privacy concerns, can incur a huge communication overhead along with the need for significant storage and computational resources at the place of centralized training.</p><p>The idea proposed in <ref type="bibr" target="#b0">[1]</ref> is that, instead of moving the training data to centralized location, one could exploit the computational power residing at the user devices and distribute the training process across the participating nodes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1">Distributed Optimization</head><p>In ML, the goal is to find a model for the training data that minimizes a loss function f that defines how our learned model distribution differs from the empirical distribution.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This measure in general case can be formalized as a negative log likelihood.</p><formula xml:id="formula_0">f = −E x∼p data [log p model (x)]<label>(1)</label></formula><p>That is, if a given example x is drawn from the training data distribution, what is the probability that it will be present in the same form in the model distribution as well.</p><p>If the model is for predicting some value(s) y based on a vector of some attributes x this can be rewritten as f (x, y, w) = − log p(y|x; w)</p><p>, The problem we want to solve is to minimize the loss function f with respect to model parameters w, that is an aggregation of the losses over all available data point as follows: min</p><formula xml:id="formula_2">w∈R d f (w), where f (w) = 1 n n ∑ i=1 f i (w),<label>(3)</label></formula><p>where f i (w) def = f (x i , y i , w) denotes the loss on data point i given the parametrization w.</p><p>In the setup of FL the characteristics of data distribution from which our training examples (x i , y i ) will be drawn, are the following:</p><p>1. Massively Distributed. Data points are stored across a large number K of nodes. In particular, the number of nodes can be much bigger than the average number of training examples stored on a given node (n/K).</p><p>2. Non-IID. Data on each node may be drawn from a different distribution, i.e. the data points available locally are far from being a representative sample of the overall distribution. If we adapt the objective function (see Eq. 3) to these characteristics, our problem can be defined as introduced in the following paragraphs.</p><p>We have K nodes and n data points, a set of indices P k (k ∈ {1, . . . , K}) of data stored at node k, and n k = |P k | is the number of data points at P k . We assume that P k ∩ P l = / 0 whenever l = k, thus ∑ K k=1 n k = n.</p><p>We can then define the local loss for node as F k (w) def =</p><p>1 n k ∑ i∈P k f i (w) Thus the problem to be minimized will become:</p><formula xml:id="formula_3">min w∈R d f (w) = K ∑ k=1 n k n F k (w).<label>(4)</label></formula><p>To solve the problem in (4) the simplest algorithm is FederatedSGD introduced in <ref type="bibr" target="#b1">[2]</ref>, that is equivalent to minibatch gradient descent over all data, and it is a simple application of distributed synchronous Stochastic Gradient Descent (SGD) <ref type="bibr" target="#b2">[3]</ref> for the described setup. for t = 0; 1; 2; ... do</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>4:</head><p>for all k in the K nodes in parallel do 5:</p><formula xml:id="formula_4">w k t+1 ← ClientUpdate(k, w t ) 6:</formula><p>end for 7:</p><formula xml:id="formula_5">w t+1 = ∑ K k=1 n k n w k t+1 8:</formula><p>end for 9: end procedure 10: procedure CLIENTUPDATE(k, w) In neural network (NN) optimization, due to the non convexity of the loss functions, the most used methods for optimization of network parameters are gradient based, more specifically the versions of SGD <ref type="bibr" target="#b3">[4]</ref>. Gradient descent methods take derivatives of loss function according to the parameters of the model, then move the parameter values in the negative of the gradient.</p><p>The pure form of SGD samples a random function (e.g a random training data point) i t ∈ 1, 2, ..., n in iteration t and performs the update:</p><formula xml:id="formula_6">w t+1 = w t − η t ∇ f i t (w t ),<label>(5)</label></formula><p>where η t denotes the learning rate, which is, in the base case, decaying during the learning to enforce convergence. Intuitively, SGD works because evaluating the gradient at a single training example gives an unbiased estimation of derivative of the error function over all the training examples:</p><formula xml:id="formula_7">E[∇ f i t (w)] = ∇ f (w).</formula><p>In practice, instead of applying the gradient for w at each example, usually an average of gradients over b randomly chosen examples is used, that are evaluated at the same w. This method is called minibatch gradient descent (MBGD), that better exploits parallel computational capabilities of the hardware. (MBGD is still commonly referred to as SGD) Though SGD/MBGD in the above form is very popular in optimization, the basic approach can sometimes result in very slow learning. To tackle the challenges incurred by high curvature and noisy gradients of the loss function of NN, a range of method has been proposed based on exponentially decaying average of the gradients or on adapting learning rates. <ref type="bibr" target="#b4">[5]</ref> In this paper, we investigate the effects of these methods on the performance , of federated training of artificial neural networks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2">Momentum techniques</head><p>Momentum based techniques <ref type="bibr" target="#b5">[6]</ref> use a velocity term during learning, that is an exponentially weighted average over the gradients of the past.</p><formula xml:id="formula_8">v ← β v − ∇ w ( 1 m m ∑ i=1 f i (w)) w ← w + v<label>(6)</label></formula><p>This term, on one hand, accelerates the learning process and, on the other hand, helps to get over noisy gradient and local minima or flat points of surface defined by the error function f . A variant of momentum algorithm is introduced in <ref type="bibr" target="#b6">[7]</ref> and is based on the Nesterov's accelerated gradient method, that differs from the standard momentum of <ref type="bibr" target="#b5">(6)</ref> in the place of the evaluation of the gradient.</p><formula xml:id="formula_9">v ← β v − ∇ w ( 1 m m ∑ i=1 f i (w + αv)) w ← w + v<label>(7)</label></formula><p>In Nesterov's momentum, the gradients are evaluated incorporating the velocity. This can be interpreted as adding a correction factor to the standard momentum algorithm <ref type="bibr" target="#b4">[5]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.3">Adaptive Learning Rates</head><p>Setting up learning rates is one of the most important factor in the learning process and deeply influences the performance. Thus, finding methods to adapt the learning rate might yield a substantial increase in speed of the learning. The AdaGrad algorithm <ref type="bibr" target="#b7">[8]</ref> adjusts the learning rates individually for each parameter, taking into account the whole history of the parameters, following the assumption, that if the magnitude of the gradients is big than it should be increased:</p><formula xml:id="formula_10">η t = η ∑ t−1 τ=1 g 2 τ ,<label>(8)</label></formula><p>where g = ∂ f ∂ w j for some parameter w j , and thus η t will be the learning rate belonging to w j a timestep t.</p><p>It has been found empirically that aggregating the gradients from the beginning of the optimization can lead to too fast decay in the learning rate, that, in some cases, leads to weak performance.</p><p>To remedy this problem RMSProp <ref type="bibr" target="#b8">[9]</ref> (the same time proposed by the authors of AdaDelta <ref type="bibr" target="#b9">[10]</ref>) replaces this aggregation with a decaying average, in the form:</p><formula xml:id="formula_11">v t = ρv t−1 + (1 − ρ)g 2 t η t = η √ v t + ε<label>(9)</label></formula><p>RMSProp has been proven very effective in non-convex optimization problems of NN, thus, it is the most often used technique in practice.</p><p>According to the explanation in <ref type="bibr" target="#b4">[5]</ref>, AdaGrad is designed to converge rapidly when applied to convex functions. In non-convex cases it should pass many structures before arriving at a convex bowl, and, since it accumulates the entire history of the squared gradient, it can shrink prematurely and eventually vanish. In contrast, discarding the old gradients in the RMSProp case enables learning to proceed rapidly after finding the convex bowl, equivalently as if AdaGrad would have been initialized within that convex area.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.4">Adam</head><p>The name of the Adam algorithm <ref type="bibr" target="#b10">[11]</ref> comes from "adaptive momentum", and can be viewed as the combination of adaptive learning rates and momentums.</p><formula xml:id="formula_12">g t ← ∇ w ( 1 m m ∑ i=1 f i (w t−1 ))<label>(10)</label></formula><formula xml:id="formula_13">m t ← β 1 m t−1 + (1 − β 1 )g t 1 − β t 1 (<label>11</label></formula><formula xml:id="formula_14">)</formula><formula xml:id="formula_15">v t ← β 2 v t−1 + (1 − β 2 )g 2 t 1 − β t 2 (12) w t ← w t−1 − ηm t √ v t + ε (element-wise)<label>(13)</label></formula><p>In Adam, the weight update is given by applying the RMSProp learning rate (12) on the momentum <ref type="bibr" target="#b10">(11)</ref>. (In Equation ( <ref type="formula">12</ref>) and ( <ref type="formula" target="#formula_13">11</ref>) the denominator is applied bias correction on the estimates.) We are not aware of clear theoretical understanding why this is advantageous, however, it seems to work very well in practice and became a de facto default optimization technique for a lot of ML practitioners.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Experimental setup</head><p>For analyzing thew performance of the optimizer algorithms, we implemented a simulation environment that trains multiple local NN models that would be aggregated into one common model, according to the Algorithm 1.</p><p>Compared to Algorithm 1, we have been varying the CLIENTUPDATE(k, w) method, where the local updates have been calculated. (Except for first experiment, since it describes exactly the MBGD method).</p><p>The new CLIENTUPDATE(k, w) method is introduced in the Algorithm 2. return W 8: end procedure Naturally, all the optimizers have their own hyperparameters which should be tuned to get the best possible result. However, for this experiment we used only the recommended values for them (that are in fact the default values in Keras/TensorFlow libraries).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Topology</head><p>The model we used is a simple multilayer perceptron. The input layer consists of 784 input units that is the flatten representation of the input images of size 28 × 28 pixels. The input is connected to one hidden layer of 128 neurons with ReLU activation. The output layer corresponds to the 10 output classes, thus it has 10 neurons with softmax activation.</p><p>In the implementation of the network, we relied on Keras NN API on a TensorFlow backend.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Data</head><p>For the experiment, we have chosen the Fashion MNIST dataset <ref type="bibr" target="#b11">[12]</ref> that was planned to replace the MNIST benchmark database.</p><p>From the characteristics of the FL scenario, in this experiment, we focused on non-iid nature of the data. That is, we have created local datasets of a highly skewed manner. Namely, training data at a given node contains exclusively, or almost exclusively, instances from the same class.</p><p>For these experiment, we have not taken into account the unbalanced nature, and each node have been assigned the same amount of data. Our idea here was that if something works in this simple setup then it might work in use cases closer to the real world problem.</p><p>Due to the lack of computational resources we also ignored the "massively distributed" condition and set the number of nodes to 10.</p><p>The distributions of the local datasets we tried in the experiments are the following: 99% non-iid The training data has been split into two parts in the ratio of 99%-1%, where the parts are independent and identically distributed, as best as possible. The 99% part will be assorted accorded to classes and then one class assigned to one of the nodes. The 1% part will be equally split into 10 parts and then added to the dataset of the particular nodes.</p><p>full-non-iid In the second test case, we assorted fully the training data and each node receives a dataset consisting of instances belonging purely to one single class.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Hyperparameters</head><p>To measure general applicability of the examined algorithms on the problem of FL, we executed the learning process multiple times, using different parametrisations.</p><p>In setting the hyperparameters we followed the Method of GridSearch, that is we defined a set of possible values for each hyperparameter, then run the algorithms with all the combinations. At defining the set of the possible values we tried to include extremities and generally recommended values. In addition to the parameters described in Section 1 we also included experimenting with the decay of the learning rate. Here we only tried tried nevertheless two case at each configuration of the other parameters, the one without decay, and the one with time based decay, where the learning rate a time t will be η t = η 0 1+φ * t with the decay rate parameter φ = η 0 max{t} .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Results</head><p>FedSGD -Simple Minibatch Gradient Descent As a baseline we run the experiment with the standard Minibatch Gradient Descent optimiation. The experiment result is shown in Figure <ref type="figure" target="#fig_4">1</ref>. It is clear that, as it often happens, the most simple algorithm, MBGD places the baseline rather high for the more sophisticated optimizer algorithm. It performs very well for the 99% non-iid datasets and surprisingly well with the full-noniid datasets, achieving an accuracy close to 75% in the 30 iterations with the best configuration of hyperparameters (η = 0.001, no decay). Moreover, in results it seems like on both distribution too high learning rate without decay leads to a poor performance.</p><p>FedSGD + Nesterov momentum In Figure <ref type="figure" target="#fig_3">2</ref>, the results using the Stochastic Gradient Descent with Nesterov momentum can be seen. We found that incorporating local momentum into computing the partial directions of the updates has a strong positive effect both on performance and convergence rate of the aggregated model at both data distributions.</p><p>The best performing configurations reached in the first couple of iterations the highest accuracy achieved by the baseline during the entire experiment. According to our results the higher the value β (that is the past directions influence stronger the update) is generally the better performance, apparently independently from η and decay rate.</p><p>FedSGD + AdaGrad The AdaGrad algorithm yields an even better performance 3, on the 99% non-iid datasets. Using this method results in the fastest convergence until the 70% of the baseline. In the first 30 rounds, though AdaGrad's performarnce drops dramatically full-non-iid datasets, reaching at most a 25% accuracy without obvious perspective further improvement. It might be interesting to check how many random training examples need to be put into the full-non-iid datasets to achieve the very good performance of the AdaGrad which is measured with the 99% non-iid datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>FedSGD + RMSProp</head><p>The RMSProp optimizer is used in the experiment which has produced the statistics in Figure <ref type="figure" target="#fig_6">4</ref>. We experienced that this optimizer algorithm apart from the stronger variance of performance seems to approach the accuracy of the AdaGrad and Nesterov momentum methods, and outperforming MBGD baseline as well on the 99% non-iid distribution. One the other hand though the accuracy on the full-non-iid still achieves a significantly worse performance compared to MBGD and Nesterov momentum, however leraning curve is much more promising than the one of AdaGrad. It reaches in the best performing setups about 50% accuracy, and shows an emerging tendency as well.</p><p>FedSGD + Adam The last experiment, we were applying Adam. The method is one of the most popular and often default optimizer <ref type="bibr" target="#b10">(11)</ref>, thanks to fast convergence, high accuracy in the traditional NN learning, and to its robustness to hyperparameter settings. In our experiment however, Adam worked with a very similar effectiveness to RMSProp, and has been definitely outperformed by MBGD and Nesterov momentum (Figure <ref type="figure">5</ref>) regarding to performance and smoothness of learning on the full-noniid datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Conclusion</head><p>In our experiment we found, that the best performing optimizer algorithms for both the distribution are Minibatch Gradient Descent without and with Nesterov momentum, whilst Adadelta and RMSProp is promising despite their poor performance on fully non-IID datasets. As one could have assumed, the presence of the non-iid part of the training data has a very strong regularizing effect even if its weight seems to negligible compared to the dominating class. Accuracy lr:0.001 decay:0.0 momentum:0.5 lr:0.01 decay:0.0 momentum:0.5 lr:1.0 decay:0.0 momentum:0.5 lr:0.001 decay:0.001 momentum:0.5 lr:0.01 decay:0.01 momentum:0.5 lr:1.0 decay:1.0 momentum:0.5 lr:0.001 decay:0.0 momentum:0.9 lr:0.01 decay:0.0 momentum:0.9 lr:1.0 decay:0.0 momentum:0.9 lr:0.001 decay:0.001 momentum:0.9 lr:0.01 decay:0.01 momentum:0.9 lr:1.0 decay:1.0 momentum:0.9 lr:0.001 decay:0.0 momentum:0.95 lr:0.01 decay:0.0 momentum:0.95 lr:1.0 decay:0.0 momentum:0.95 lr:0.001 decay:0.001 momentum:0.95 lr:0.01 decay:0.01 momentum:0. Accuracy lr:0.001 decay:0.0 epsilon:1e-08 rho:0.5 lr:0.01 decay:0.0 epsilon:1e-08 rho:0.5 lr:1.0 decay:0.0 epsilon:1e-08 rho:0.5 lr:0.001 decay:0.001 epsilon:1e-08 rho:0.5 lr:0.01 decay:0.01 epsilon:1e-08 rho:0.5 lr:1.0 decay:1.0 epsilon:1e-08 rho:0.5 lr:0.001 decay:0.0 epsilon:1e-08 rho:0.9 lr:0.01 decay:0.0 epsilon:1e-08 rho:0.9 lr:1.0 decay:0.0 epsilon:1e-08 rho:0.9 lr:0.001 decay:0.001 epsilon:1e-08 rho:0.9 lr:0.01 decay:0.01 epsilon:1e-08 rho:0.9 lr:1.0 decay:1.0 epsilon:1e-08 rho:0.9 lr:0.001 decay:0.0 epsilon:1e-08 rho:0.95 lr:0.01 decay:0.0 epsilon:1e-08 rho:0.95 lr:1.0 decay:0.0 epsilon:1e-08 rho:0.95 lr:0.001 decay:0.001 epsilon:1e-08 rho:0.95 lr:0.01 decay:0.01 epsilon:1e-08 rho:0.95 lr:1.0 decay:1.0 epsilon:1e-08 rho:0.95 Accuracy lr:0.001 decay:0.0 epsilon:1e-08 beta1:0.5 beta2:0.5 lr:0.01 decay:0.0 epsilon:1e-08 beta1:0.5 beta2:0.5 lr:1.0 decay:0.0 epsilon:1e-08 beta1:0.5 beta2:0.5 lr:0.001 decay:0.001 epsilon:1e-08 beta1:0.5 beta2:0.5 lr:0.01 decay:0.01 epsilon:1e-08 beta1:0.5 beta2:0.5 lr:1.0 decay:1.0 epsilon:1e-08 beta1:0.5 beta2:0.5 lr:0.001 decay:0.0 epsilon:1e-08 beta1:0.5 beta2:0.9 lr:0.01 decay:0.0 epsilon:1e-08 beta1:0.5 beta2:0.9 lr:1.0 decay:0.0 epsilon:1e-08 beta1:0.5 beta2:0.9 lr:0.001 decay:0.001 epsilon:1e-08 beta1:0.5 beta2:0.9 lr:0.01 decay:0.01 epsilon:1e-08 beta1:0.5 beta2:0.9 lr:1.0 decay:1.0 epsilon:1e-08 beta1:0.5 beta2:0.9 lr:0.001 decay:0.0 epsilon:1e-08 beta1:0.5 beta2:0.95 lr:0.01 decay:0.0 epsilon:1e-08 beta1:0.5 beta2:0.95 lr:1.0 decay:0.0 epsilon:1e-08 beta1:0.5 beta2:0.95 lr:0.001 decay:0.001 epsilon:1e-08 beta1:0.5 beta2:0.95 lr:0.01 decay:0.01 epsilon:1e-08 beta1:0.5 beta2:0.95 lr:1.0 decay:1.0 epsilon:1e-08 beta1:0.5 beta2:0.95 lr:0.001 decay:0.0 epsilon:1e-08 beta1:0.9 beta2:0.5 lr:0.01 decay:0.0 epsilon:1e-08 beta1:0.9 beta2:0.5 lr:1.0 decay:0.0 epsilon:1e-08 beta1:0.9 beta2:0.5 lr:0.001 decay:0.001 epsilon:1e-08 beta1:0.9 beta2:0.5 lr:0.01 decay:0.01 epsilon:1e-08 beta1:0.9 beta2:0.5 lr:1.0 decay:1.0 epsilon:1e-08 beta1:0.9 beta2:0.5 lr:0.001 decay:0.0 epsilon:1e-08 beta1:0.9 beta2:0.9 lr:0.01 decay:0.0 epsilon:1e-08 beta1:0.9 beta2:0.9 lr:1.0 decay:0.0 epsilon:1e-08 beta1:0.9 beta2:0.9 lr:0.001 decay:0.001 epsilon:1e-08 beta1:0.9 beta2:0.9 lr:0.01 decay:0.01 epsilon:1e-08 beta1:0.9 beta2:0.9 lr:1.0 decay:1.0 epsilon:1e-08 beta1:0.9 beta2:0.9 lr:0.001 decay:0.0 epsilon:1e-08 beta1:0.9 beta2:0.95 lr:0.01 decay:0.0 epsilon:1e-08 beta1:0.9 beta2:0.95 lr:1.0 decay:0.0 epsilon:1e-08 beta1:0.9 beta2:0.95 lr:0.001 decay:0.001 epsilon:1e-08 beta1:0.9 beta2:0.95 lr:0.01 decay:0.01 epsilon:1e-08 beta1:0.9 beta2:0.95 lr:1.0 decay:1.0 epsilon:1e-08 beta1:0.9 beta2:0.95 lr:0.001 decay:0.0 epsilon:1e-08 beta1:0.95 beta2:0.5 lr:0.01 decay:0.0 epsilon:1e-08 beta1:0.95 beta2:0.5 lr:1.0 decay:0.0 epsilon:1e-08 beta1:0.95 beta2:0.5 lr:0.001 decay:0.001 epsilon:1e-08 beta1:0.95 beta2:0.5 lr:0.01 decay:0.01 epsilon:1e-08 beta1:0.95 beta2:0.5 lr:1.0 decay:1.0 epsilon:1e-08 beta1:0.95 beta2:0.5 lr:0.001 decay:0.0 epsilon:1e-08 beta1:0.95 beta2:0.9 lr:0.01 decay:0.0 epsilon:1e-08 beta1:0.95 beta2:0.9 lr:1.0 decay:0.0 epsilon:1e-08 beta1:0.95 beta2:0.9 lr:0.001 decay:0.001 epsilon:1e-08 beta1:0.95 beta2:0.9 lr:0.01 decay:0.01 epsilon:1e-08 beta1:0.95 beta2:0.9 lr:1.0 decay:1.0 epsilon:1e-08 beta1:0.95 beta2:0.9 lr:0.001 decay:0.0 epsilon:1e-08 beta1:0.95 beta2:0.95 lr:0.01 decay:0.0 epsilon:1e-08 beta1:0.95 beta2:0.95 lr:1.0 decay:0.0 epsilon:1e-08 beta1:0.95 beta2:0.95 lr:0.001 decay:0.001 epsilon:1e-08 beta1:0.95 beta2:0.95 lr:0.01 decay:0.01 epsilon:1e-08 beta1:0.95 beta2:0.95 lr:1.0 decay:1.0 epsilon:1e-08 beta1:0.95 beta2:0.95</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>RMSprop</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Adam</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Figure 5: Federated Averaging -Adam</head><p>In general we experienced that methods that are intended to reduce the variance of the gradient direction works actually quite well for our specific scenario (1). This can be because momentum techniques can be seen as an averaging over the subsequent gradients, leading to a less and less biased estimate of the optimal update direction. The fact that strong momentum (high β ) seems to help in the big majority of configurations of the other parameters supports this idea.</p><p>On the other hand methods that aim at adapting the magnitude of the gradients seem to harm the learning process (2) in the full-non-iid case. The reason behind this phenomenon is most probably, that the local optimisers update their inner state -which here corresponds to η learning rate -based on the local gradients. In each local training round according to our intuition the magnitude of gradients start growing, since the aggregation of the local models moved the model away from the locally optimal model, but as we approaching again the local optima, they start shrinking again, resulting in slower start (smaller η) of optimization in the next iteration.</p><p>The extremely poor performance of AdaGrad on the full-non-iid dataset can support this intuition, since it prevents even the described fluctuation of the learning rate, instead it decreases it continuously.</p><p>The good performance of these algorithms on the 99% non-iid might be explainable with the presence of gradients of really big magnitude in the decaying average that controls the learning rate keeping it at an effective level.</p><p>Another interesting phenomenon is that in case of Adam -where momentum and adaptive learning rates are both applied -the strong deccelerating effect of learning rate adaption apparently overrides the help of the momentum. However looking at magnitude of performance differences it might be understandable.</p><p>Although according to our results these optimizers are not clearly beneficial in perspective of finding the best global model, they still could be useful for optimizing the global model at clients. One can argue, that in the end the goal of the entire federated optimization is to provide clients with a model performs well on their own data.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>3 .</head><label>3</label><figDesc>Unbalanced. Different nodes may vary by orders of magnitude in the number of training examples they hold.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>11 :B</head><label>11</label><figDesc>← split P k to set of batches 12: for all b ∈ B do 13: w ← w − η∇ f (w, b)</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Algorithm 2</head><label>2</label><figDesc>ClientUpdate 1: procedure CLIENTUPDATE(k, w) 2: B ← split P k to set of batches 3: for all b ∈ B do 4: ∆w = Optimizer(w, b)</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: FedSGD baseline(simple minibatch updates )</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 2 :Figure 3</head><label>23</label><figDesc>Figure 2: FedSGD + Nesterov momentum</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: FedSGD with RMSProp</figDesc></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgements</head><p>EFOP-3.6.3-VEKOP-16-2017-00001: Talent Management in Autonomous Vehicle Control Technologies -The Project is supported by the Hungarian Government and co-financed by the European Social Fund. Supported by Telekom Innovation Laboratories (T-Labs), the Research and Development unit of Deutsche Telekom.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Federated optimization: Distributed machine learning for on-device intelligence</title>
		<author>
			<persName><forename type="first">J</forename><surname>Konečný</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">B</forename><surname>Mcmahan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Ramage</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Richtárik</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1610.02527</idno>
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">Communication-efficient learning of deep networks from decentralized data</title>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">B</forename><surname>Mcmahan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Moore</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Ramage</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hampson</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1602.05629</idno>
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">Revisiting distributed synchronous sgd</title>
		<author>
			<persName><forename type="first">J</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Pan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Monga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bengio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Jozefowicz</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1604.00981</idno>
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Online learning and stochastic approximations</title>
		<author>
			<persName><forename type="first">L</forename><surname>Bottou</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">On-line learning in neural networks</title>
		<imprint>
			<biblScope unit="volume">17</biblScope>
			<biblScope unit="issue">9</biblScope>
			<biblScope unit="page">142</biblScope>
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Deep learning</title>
		<author>
			<persName><forename type="first">I</forename><surname>Goodfellow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Bengio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Courville</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2016">2016</date>
			<publisher>MIT press</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Some methods of speeding up the convergence of iteration methods</title>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">T</forename><surname>Polyak</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">USSR Computational Mathematics and Mathematical Physics</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="issue">5</biblScope>
			<biblScope unit="page" from="1" to="17" />
			<date type="published" when="1964">1964</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">On the importance of initialization and momentum in deep learning</title>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Martens</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Dahl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Hinton</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International conference on machine learning</title>
				<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="1139" to="1147" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Adaptive subgradient methods for online learning and stochastic optimization</title>
		<author>
			<persName><forename type="first">J</forename><surname>Duchi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Hazan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Singer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="page" from="2121" to="2159" />
			<date type="published" when="2011-07">Jul. 2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Neural networks for machine learning</title>
		<author>
			<persName><forename type="first">G</forename><surname>Hinton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Srivastava</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Swersky</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Coursera, video lectures</title>
		<imprint>
			<biblScope unit="volume">264</biblScope>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">Adadelta: an adaptive learning rate method</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">D</forename><surname>Zeiler</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1212.5701</idno>
		<imprint>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">Adam: A method for stochastic optimization</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">P</forename><surname>Kingma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ba</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1412.6980</idno>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms</title>
		<author>
			<persName><forename type="first">H</forename><surname>Xiao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Rasul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Vollgraf</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1708.07747</idno>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
