=Paper=
{{Paper
|id=Vol-2473/paper13
|storemode=property
|title=Optimization in Federated Learning
|pdfUrl=https://ceur-ws.org/Vol-2473/paper13.pdf
|volume=Vol-2473
|authors=Vukasin Felbab,Péter Kiss,Tomáš Horváth
|dblpUrl=https://dblp.org/rec/conf/itat/FelbabKH19
}}
==Optimization in Federated Learning==
<pdf width="1500px">https://ceur-ws.org/Vol-2473/paper13.pdf</pdf>
<pre>
                                       Optimization in Federated Learning

                                            Vukasin Felbab, Péter Kiss, and Tomáš Horváth

                                            Department of Data Science and Engineering
                                     ELTE – Eötvös Loránd University, Faculty of Informatics
                                  Budapest, H-1117 Budapest, Pázmány Péter sétány 1/C., Hungary
                           vukasindfelbab@gmail.com, {peter.kiss, tomas.horvath}@inf.elte.hu

Abstract: Federated learning (FL) is an emerging branch                   This measure in general case can be formalized as a nega-
of machine learning (ML) research, that is examining the                  tive log likelihood.
methods for scenarios, where individual nodes possess
parts of the data, and the task is to form a single com-                                     f = −Ex∼pdata [log pmodel (x)]            (1)
mon model that fits to the whole distribution. In FL,
we generally use mini batch gradient descent for optimiz-                 That is, if a given example x is drawn from the training
ing weights of the models that appears to work very well                  data distribution, what is the probability that it will be
for federated scenarios. For traditional machine learning                 present in the same form in the model distribution as well.
setups, a number of modifications has been proposed to                    If the model is for predicting some value(s) y based on a
accelerate the learning process and to help to get over                   vector of some attributes x this can be rewritten as
challenges posed by the high dimensionality and non-
                                                                                             f (x, y, w) = − log p(y|x; w)             (2)
convexity of search spaces of the parameters. In this paper
we present our experiments on applying different popu-                    ,
lar optimization methods for training neural networks in a                   The problem we want to solve is to minimize the loss
federated manner.                                                         function f with respect to model parameters w, that is an
                                                                          aggregation of the losses over all available data point as
1     Federated Learning                                                  follows:

                                                                                                                    1 n
Federated learning (FL) [1] is a new paradigm in Machine                               min f (w), where f (w) =       ∑ fi (w),        (3)
                                                                                      w∈Rd                          n i=1
Learning (ML), that is dealing with an increasingly impor-
tant distributed optimization setting, that came into view                              def
with the spread of small user devices and applications writ-              where fi (w) = f (xi , yi , w) denotes the loss on data point i
ten for them that can profit from ML. The domain of ML                    given the parametrization w.
models is often the data collected on the devices, thus, to                  In the setup of FL the characteristics of data distribution
train these models, one should incorporate the knowledge                  from which our training examples (xi , yi ) will be drawn,
contained into the learning process. The traditional way                  are the following:
for this would be to transfer the information gathered at                     1. Massively Distributed. Data points are stored across
the users to data centers, where the training takes place,                       a large number K of nodes. In particular, the number
then send back the trained models to the users. That, apart                      of nodes can be much bigger than the average number
from the obvious privacy concerns, can incur a huge com-                         of training examples stored on a given node (n/K).
munication overhead along with the need for significant
storage and computational resources at the place of cen-                      2. Non-IID. Data on each node may be drawn from a
tralized training.                                                               different distribution, i.e. the data points available lo-
   The idea proposed in [1] is that, instead of moving the                       cally are far from being a representative sample of the
training data to centralized location, one could exploit the                     overall distribution.
computational power residing at the user devices and dis-
tribute the training process across the participating nodes.                  3. Unbalanced. Different nodes may vary by orders of
                                                                                 magnitude in the number of training examples they
                                                                                 hold.
1.1   Distributed Optimization
                                                                             If we adapt the objective function (see Eq. 3) to these
In ML, the goal is to find a model for the training data that             characteristics, our problem can be defined as introduced
minimizes a loss function f that defines how our learned                  in the following paragraphs.
model distribution differs from the empirical distribution.                  We have K nodes and n data points, a set of indices Pk
                                                                          (k ∈ {1, . . . , K}) of data stored at node k, and nk = |Pk | is
     Copyright ©2019 for this paper by its authors. Use permitted under   the number of data points at Pk . We assume that Pk ∩
Creative Commons License Attribution 4.0 International (CC BY 4.0).       Pl = 0/ whenever l 6= k, thus ∑Kk=1 nk = n.
                                                           def   form is very popular in optimization, the basic approach
   We can then define the local loss for node as Fk (w) =
1                                                                can sometimes result in very slow learning. To tackle the
nk ∑i∈Pk f i (w) Thus the problem to be minimized will be-
come:                                                            challenges incurred by high curvature and noisy gradients
                                                                 of the loss function of NN, a range of method has been
                                  K
                                     nk                          proposed based on exponentially decaying average of the
              minw∈Rd f (w) = ∑         Fk (w).           (4)    gradients or on adapting learning rates. [5]
                                 k=1 n
                                                                    In this paper, we investigate the effects of these meth-
   To solve the problem in (4) the simplest algorithm is         ods on the performance , of federated training of artificial
FederatedSGD introduced in [2], that is equivalent to mini-      neural networks.
batch gradient descent over all data, and it is a simple ap-
plication of distributed synchronous Stochastic Gradient
Descent (SGD) [3] for the described setup.                       1.2   Momentum techniques

                                                                 Momentum based techniques [6] use a velocity term dur-
Algorithm 1 FederatedSGD                                         ing learning, that is an exponentially weighted average
 1: procedure S ERVER                                            over the gradients of the past.
 2:    initialize w0
 3:    for t = 0; 1; 2; ... do                                                                    1 m
 4:         for all k in the K nodes in parallel do                              v ← β v − ∇w (     ∑ fi (w))
                                                                                                  m i=1
 5:               k
                wt+1  ← ClientUpdate(k, wt )
 6:         end for                                                             w ← w+v                                  (6)
 7:         wt+1 = ∑Kk=1 nnk wt+1
                               k
                                                                 This term, on one hand, accelerates the learning process
 8:    end for
                                                                 and, on the other hand, helps to get over noisy gradient
 9: end procedure
                                                                 and local minima or flat points of surface defined by the
10: procedure C LIENT U PDATE (k, w)
                                                                 error function f .
11:    B ← split Pk to set of batches
                                                                    A variant of momentum algorithm is introduced in
12:    for all b ∈ B do
                                                                 [7] and is based on the Nesterov’s accelerated gradient
13:         w ← w − η∇ f (w, b)
                                                                 method, that differs from the standard momentum of (6)
14:    end for
                                                                 in the place of the evaluation of the gradient.
15:    return W
16: end procedure                                                                              1 m
                                                                              v ← β v − ∇w (     ∑ fi (w + αv))
                                                                                               m i=1
   In neural network (NN) optimization, due to the non                       w ← w+v                                     (7)
convexity of the loss functions, the most used methods
for optimization of network parameters are gradient based,       In Nesterov’s momentum, the gradients are evaluated in-
more specifically the versions of SGD [4]. Gradient de-          corporating the velocity. This can be interpreted as adding
scent methods take derivatives of loss function according        a correction factor to the standard momentum algorithm
to the parameters of the model, then move the parameter          [5].
values in the negative of the gradient.
   The pure form of SGD samples a random function (e.g
a random training data point) it ∈ 1, 2, ..., n in iteration t   1.3   Adaptive Learning Rates
and performs the update:
                                                                 Setting up learning rates is one of the most important fac-
                 wt+1 = wt − ηt ∇ fit (wt ),              (5)    tor in the learning process and deeply influences the per-
                                                                 formance. Thus, finding methods to adapt the learning rate
where ηt denotes the learning rate, which is, in the base        might yield a substantial increase in speed of the learning.
case, decaying during the learning to enforce convergence.       The AdaGrad algorithm [8] adjusts the learning rates indi-
Intuitively, SGD works because evaluating the gradient at        vidually for each parameter, taking into account the whole
a single training example gives an unbiased estimation of        history of the parameters, following the assumption, that
derivative of the error function over all the training exam-     if the magnitude of the gradients is big than it should be
ples: E[∇ fit (w)] = ∇ f (w).                                    increased:
                                                                                                η
   In practice, instead of applying the gradient for w at                             ηt = q           ,                  (8)
                                                                                                t−1 2
each example, usually an average of gradients over b ran-                                      ∑τ=1 gτ
domly chosen examples is used, that are evaluated at the
same w. This method is called minibatch gradient de-               where g = ∂∂wf j for some parameter w j , and thus ηt will
scent (MBGD), that better exploits parallel computational        be the learning rate belonging to w j a timestep t.
capabilities of the hardware. (MBGD is still commonly              It has been found empirically that aggregating the gradi-
referred to as SGD) Though SGD/MBGD in the above                 ents from the beginning of the optimization can lead to too
fast decay in the learning rate, that, in some cases, leads to   have been calculated. (Except for first experiment, since
weak performance.                                                it describes exactly the MBGD method).
   To remedy this problem RMSProp [9] (the same time                The new C LIENT U PDATE(k, w) method is introduced in
proposed by the authors of AdaDelta [10]) replaces this          the Algorithm 2.
aggregation with a decaying average, in the form:
                                                                 Algorithm 2 ClientUpdate
                  vt = ρvt−1 + (1 − ρ)gt2
                                                                  1: procedure C LIENT U PDATE (k, w)
                          η
                  ηt = √                                  (9)     2:    B ← split Pk to set of batches
                         vt + ε                                   3:    for all b ∈ B do
RMSProp has been proven very effective in non-convex              4:        ∆w = Optimizer(w, b)
optimization problems of NN, thus, it is the most often           5:        w ← w − ∆w
used technique in practice.                                       6:    end for
   According to the explanation in [5], AdaGrad is de-            7:    return W
signed to converge rapidly when applied to convex func-           8: end procedure
tions. In non-convex cases it should pass many structures
before arriving at a convex bowl, and, since it accumulates
                                                                    Naturally, all the optimizers have their own hyper-
the entire history of the squared gradient, it can shrink pre-
                                                                 parameters which should be tuned to get the best possi-
maturely and eventually vanish. In contrast, discarding the
                                                                 ble result. However, for this experiment we used only the
old gradients in the RMSProp case enables learning to pro-
                                                                 recommended values for them (that are in fact the default
ceed rapidly after finding the convex bowl, equivalently as
                                                                 values in Keras/TensorFlow libraries).
if AdaGrad would have been initialized within that convex
area.
                                                                 2.1   Topology
1.4   Adam
                                                                 The model we used is a simple multilayer perceptron. The
The name of the Adam algorithm [11] comes from “adap-            input layer consists of 784 input units that is the flatten
tive momentum”, and can be viewed as the combination of          representation of the input images of size 28 × 28 pixels.
adaptive learning rates and momentums.                           The input is connected to one hidden layer of 128 neurons
                                                                 with ReLU activation. The output layer corresponds to
                       1 m                                       the 10 output classes, thus it has 10 neurons with softmax
           gt ← ∇w (     ∑ fi (wt−1 ))                   (10)
                       m i=1                                     activation.
                  β1 mt−1 + (1 − β1 )gt                            In the implementation of the network, we relied on
          mt ←                                           (11)
                         1 − β1t                                 Keras NN API on a TensorFlow backend.
                β2 vt−1 + (1 − β2 )gt2
           vt ←                                          (12)
                       1 − β2t                                   2.2   Data
                          ηmt
           wt ← wt−1 − √         (element-wise)          (13)
                          vt + ε                                 For the experiment, we have chosen the Fashion MNIST
                                                                 dataset [12] that was planned to replace the MNIST bench-
  In Adam, the weight update is given by applying the            mark database.
RMSProp learning rate (12) on the momentum (11). (In
                                                                    From the characteristics of the FL scenario, in this ex-
Equation (12) and (11) the denominator is applied bias
                                                                 periment, we focused on non-iid nature of the data. That
correction on the estimates.) We are not aware of clear
                                                                 is, we have created local datasets of a highly skewed man-
theoretical understanding why this is advantageous, how-
                                                                 ner. Namely, training data at a given node contains ex-
ever, it seems to work very well in practice and became
                                                                 clusively, or almost exclusively, instances from the same
a de facto default optimization technique for a lot of ML
                                                                 class.
practitioners.
                                                                    For these experiment, we have not taken into account
                                                                 the unbalanced nature, and each node have been assigned
2     Experimental setup                                         the same amount of data. Our idea here was that if some-
                                                                 thing works in this simple setup then it might work in use
For analyzing thew performance of the optimizer algo-            cases closer to the real world problem.
rithms, we implemented a simulation environment that                Due to the lack of computational resources we also ig-
trains multiple local NN models that would be aggregated         nored the “massively distributed” condition and set the
into one common model, according to the Algorithm 1.             number of nodes to 10.
   Compared to Algorithm 1, we have been varying the                The distributions of the local datasets we tried in the
C LIENT U PDATE(k, w) method, where the local updates            experiments are the following:
99% non-iid The training data has been split into two              The best performing configurations reached in the first
parts in the ratio of 99%-1%, where the parts are indepen-      couple of iterations the highest accuracy achieved by the
dent and identically distributed, as best as possible. The      baseline during the entire experiment. According to our
99% part will be assorted accorded to classes and then one      results the higher the value β (that is the past directions in-
class assigned to one of the nodes. The 1% part will be         fluence stronger the update) is generally the better perfor-
equally split into 10 parts and then added to the dataset of    mance, apparently independently from η and decay rate.
the particular nodes.
                                                                FedSGD + AdaGrad The AdaGrad algorithm yields an
full-non-iid In the second test case, we assorted fully the     even better performance 3, on the 99% non-iid datasets.
training data and each node receives a dataset consisting       Using this method results in the fastest convergence until
of instances belonging purely to one single class.              the 70% of the baseline. In the first 30 rounds, though
                                                                AdaGrad’s performarnce drops dramatically full-non-iid
                                                                datasets, reaching at most a 25% accuracy without ob-
2.3   Hyperparameters                                           vious perspective further improvement. It might be inter-
                                                                esting to check how many random training examples need
To measure general applicability of the examined algo-          to be put into the full-non-iid datasets to achieve the very
rithms on the problem of FL, we executed the learning           good performance of the AdaGrad which is measured with
process multiple times, using different parametrisations.       the 99% non-iid datasets.
In setting the hyperparameters we followed the Method of
GridSearch, that is we defined a set of possible values for
each hyperparameter, then run the algorithms with all the       FedSGD + RMSProp The RMSProp optimizer is used in
combinations. At defining the set of the possible values we     the experiment which has produced the statistics in Figure
tried to include extremities and generally recommended          4. We experienced that this optimizer algorithm apart from
values. In addition to the parameters described in Section      the stronger variance of performance seems to approach
1 we also included experimenting with the decay of the          the accuracy of the AdaGrad and Nesterov momentum
learning rate. Here we only tried tried nevertheless two        methods, and outperforming MBGD baseline as well on
case at each configuration of the other parameters, the one     the 99% non-iid distribution. One the other hand though
without decay, and the one with time based decay, where         the accuracy on the full-non-iid still achieves a signifi-
                                         η0
the learning rate a time t will be ηt = 1+φ                     cantly worse performance compared to MBGD and Nes-
                                            ∗t with the decay
                       η0
rate parameter φ = max{t} .                                     terov momentum, however leraning curve is much more
                                                                promising than the one of AdaGrad. It reaches in the
                                                                best performing setups about 50% accuracy, and shows an
3     Results                                                   emerging tendency as well.

FedSGD - Simple Minibatch Gradient Descent As a                 FedSGD + Adam The last experiment, we were apply-
baseline we run the experiment with the standard Mini-          ing Adam. The method is one of the most popular and
batch Gradient Descent optimiation. The experiment re-          often default optimizer(11), thanks to fast convergence,
sult is shown in Figure 1. It is clear that, as it of-          high accuracy in the traditional NN learning, and to its
ten happens, the most simple algorithm, MBGD places             robustness to hyperparameter settings. In our experiment
the baseline rather high for the more sophisticated op-         however, Adam worked with a very similar effectiveness
timizer algorithm. It performs very well for the 99%            to RMSProp, and has been definitely outperformed by
non-iid datasets and surprisingly well with the full-non-       MBGD and Nesterov momentum (Figure 5) regarding to
iid datasets, achieving an accuracy close to 75% in the 30      performance and smoothness of learning on the full-non-
iterations with the best configuration of hyperparameters       iid datasets.
(η = 0.001, no decay).
   Moreover, in results it seems like on both distribution
too high learning rate without decay leads to a poor per-       4   Conclusion
formance.
                                                                In our experiment we found, that the best performing op-
                                                                timizer algorithms for both the distribution are Minibatch
FedSGD + Nesterov momentum In Figure 2, the results             Gradient Descent without and with Nesterov momentum,
using the Stochastic Gradient Descent with Nesterov mo-         whilst Adadelta and RMSProp is promising despite their
mentum can be seen. We found that incorporating local           poor performance on fully non-IID datasets. As one could
momentum into computing the partial directions of the up-       have assumed, the presence of the non-iid part of the train-
dates has a strong positive effect both on performance and      ing data has a very strong regularizing effect even if its
convergence rate of the aggregated model at both data dis-      weight seems to negligible compared to the dominating
tributions.                                                     class.
                                                           SGD

                                                                                                                       lr:0.001 decay:0.0 momentum:0.0
                                                                                                                       lr:0.001 decay:0.0 momentum:0.0
                                                                                                                       lr:0.01 decay:0.0 momentum:0.0
       0.7                                                              0.7                                            lr:1.0 decay:0.0 momentum:0.0
                                                                                                                       lr:0.001 decay:0.001 momentum:0.0
                                                                                                                       lr:0.01 decay:0.01 momentum:0.0
                                                                                                                       lr:1.0 decay:1.0 momentum:0.0


       0.6                                                              0.6


       0.5                                                              0.5
Accuracy


                                                             Accuracy
       0.4                                                              0.4


       0.3                                                              0.3


       0.2                                                              0.2


       0.1                                                              0.1

             0   5   10        15       20     25     30                      0   5   10        15      20   25   30
                          Iterations                                                       Iterations


                                       Figure 1: FedSGD baseline(simple minibatch updates )


                                                       Nesterov

       0.8                                                          0.8                                                lr:0.001 decay:0.0 momentum:0.5
                                                                                                                       lr:0.01 decay:0.0 momentum:0.5
                                                                                                                       lr:1.0 decay:0.0 momentum:0.5
                                                                                                                       lr:0.001 decay:0.001 momentum:0.5
                                                                                                                       lr:0.01 decay:0.01 momentum:0.5
                                                                                                                       lr:1.0 decay:1.0 momentum:0.5
                                                                                                                       lr:0.001 decay:0.0 momentum:0.9
       0.7                                                          0.7                                                lr:0.01 decay:0.0 momentum:0.9
                                                                                                                       lr:1.0 decay:0.0 momentum:0.9
                                                                                                                       lr:0.001 decay:0.001 momentum:0.9
                                                                                                                       lr:0.01 decay:0.01 momentum:0.9
                                                                                                                       lr:1.0 decay:1.0 momentum:0.9
                                                                                                                       lr:0.001 decay:0.0 momentum:0.95
                                                                                                                       lr:0.01 decay:0.0 momentum:0.95
       0.6                                                          0.6                                                lr:1.0 decay:0.0 momentum:0.95
                                                                                                                       lr:0.001 decay:0.001 momentum:0.95
                                                                                                                       lr:0.01 decay:0.01 momentum:0.95
                                                                                                                       lr:1.0 decay:1.0 momentum:0.95


       0.5                                                          0.5
Accuracy


                                                             Accuracy


       0.4                                                          0.4


       0.3                                                          0.3


       0.2                                                          0.2


       0.1                                                          0.1
             0   5   10        15       20     25     30                      0   5   10        15      20   25   30
                          Iterations                                                       Iterations


                                             Figure 2: FedSGD + Nesterov momentum
                                                          Adagrad


       0.8                                                              0.8                                                                              lr:0.001 decay:0.0 epsilon:1e-08
                                                                                                                                                         lr:0.01 decay:0.0 epsilon:1e-08
                                                                                                                                                         lr:1.0 decay:0.0 epsilon:1e-08
                                                                                                                                                         lr:0.001 decay:0.001 epsilon:1e-08
                                                                                                                                                         lr:0.01 decay:0.01 epsilon:1e-08
                                                                                                                                                         lr:1.0 decay:1.0 epsilon:1e-08

       0.7                                                              0.7


       0.6                                                              0.6


       0.5                                                              0.5
Accuracy


                                                                 Accuracy
       0.4                                                              0.4


       0.3                                                              0.3


       0.2                                                              0.2


       0.1                                                              0.1


             0   5   10        15      20    25       30                          0       5        10          15           20        25        30
                          Iterations                                                                      Iterations


                                                 Figure 3: FedSGD with AdaGrad


                                                      RMSprop

                                                                                                                                                 lr:0.001 decay:0.0 epsilon:1e-08 rho:0.5
       0.8                                                          0.8                                                                          lr:0.01 decay:0.0 epsilon:1e-08 rho:0.5
                                                                                                                                                 lr:1.0 decay:0.0 epsilon:1e-08 rho:0.5
                                                                                                                                                 lr:0.001 decay:0.001 epsilon:1e-08 rho:0.5
                                                                                                                                                 lr:0.01 decay:0.01 epsilon:1e-08 rho:0.5
                                                                                                                                                 lr:1.0 decay:1.0 epsilon:1e-08 rho:0.5
                                                                                                                                                 lr:0.001 decay:0.0 epsilon:1e-08 rho:0.9
                                                                                                                                                 lr:0.01 decay:0.0 epsilon:1e-08 rho:0.9
       0.7                                                          0.7                                                                          lr:1.0 decay:0.0 epsilon:1e-08 rho:0.9
                                                                                                                                                 lr:0.001 decay:0.001 epsilon:1e-08 rho:0.9
                                                                                                                                                 lr:0.01 decay:0.01 epsilon:1e-08 rho:0.9
                                                                                                                                                 lr:1.0 decay:1.0 epsilon:1e-08 rho:0.9
                                                                                                                                                 lr:0.001 decay:0.0 epsilon:1e-08 rho:0.95
                                                                                                                                                 lr:0.01 decay:0.0 epsilon:1e-08 rho:0.95
                                                                                                                                                 lr:1.0 decay:0.0 epsilon:1e-08 rho:0.95
       0.6                                                          0.6                                                                          lr:0.001 decay:0.001 epsilon:1e-08 rho:0.95
                                                                                                                                                 lr:0.01 decay:0.01 epsilon:1e-08 rho:0.95
                                                                                                                                                 lr:1.0 decay:1.0 epsilon:1e-08 rho:0.95


       0.5                                                          0.5
Accuracy


                                                             Accuracy


       0.4                                                          0.4


       0.3                                                          0.3


       0.2                                                          0.2


       0.1                                                          0.1


             0   5   10        15      20   25       30                       0       5       10             15        20        25        30
                          Iterations                                                                    Iterations


                                                 Figure 4: FedSGD with RMSProp
                                                        Adam


       0.8                                                        0.8                                                   lr:0.001 decay:0.0 epsilon:1e-08 beta1:0.5 beta2:0.5
                                                                                                                        lr:0.01 decay:0.0 epsilon:1e-08 beta1:0.5 beta2:0.5
                                                                                                                        lr:1.0 decay:0.0 epsilon:1e-08 beta1:0.5 beta2:0.5
                                                                                                                        lr:0.001 decay:0.001 epsilon:1e-08 beta1:0.5 beta2:0.5
                                                                                                                        lr:0.01 decay:0.01 epsilon:1e-08 beta1:0.5 beta2:0.5
                                                                                                                        lr:1.0 decay:1.0 epsilon:1e-08 beta1:0.5 beta2:0.5
                                                                                                                        lr:0.001 decay:0.0 epsilon:1e-08 beta1:0.5 beta2:0.9

       0.7                                                        0.7                                                   lr:0.01 decay:0.0 epsilon:1e-08 beta1:0.5 beta2:0.9
                                                                                                                        lr:1.0 decay:0.0 epsilon:1e-08 beta1:0.5 beta2:0.9
                                                                                                                        lr:0.001 decay:0.001 epsilon:1e-08 beta1:0.5 beta2:0.9
                                                                                                                        lr:0.01 decay:0.01 epsilon:1e-08 beta1:0.5 beta2:0.9
                                                                                                                        lr:1.0 decay:1.0 epsilon:1e-08 beta1:0.5 beta2:0.9
                                                                                                                        lr:0.001 decay:0.0 epsilon:1e-08 beta1:0.5 beta2:0.95
                                                                                                                        lr:0.01 decay:0.0 epsilon:1e-08 beta1:0.5 beta2:0.95

       0.6                                                        0.6
                                                                                                                        lr:1.0 decay:0.0 epsilon:1e-08 beta1:0.5 beta2:0.95
                                                                                                                        lr:0.001 decay:0.001 epsilon:1e-08 beta1:0.5 beta2:0.95
                                                                                                                        lr:0.01 decay:0.01 epsilon:1e-08 beta1:0.5 beta2:0.95
                                                                                                                        lr:1.0 decay:1.0 epsilon:1e-08 beta1:0.5 beta2:0.95
                                                                                                                        lr:0.001 decay:0.0 epsilon:1e-08 beta1:0.9 beta2:0.5
                                                                                                                        lr:0.01 decay:0.0 epsilon:1e-08 beta1:0.9 beta2:0.5
                                                                                                                        lr:1.0 decay:0.0 epsilon:1e-08 beta1:0.9 beta2:0.5
                                                                                                                        lr:0.001 decay:0.001 epsilon:1e-08 beta1:0.9 beta2:0.5
       0.5                                                        0.5                                                   lr:0.01 decay:0.01 epsilon:1e-08 beta1:0.9 beta2:0.5
                                                                                                                        lr:1.0 decay:1.0 epsilon:1e-08 beta1:0.9 beta2:0.5
                                                                                                                        lr:0.001 decay:0.0 epsilon:1e-08 beta1:0.9 beta2:0.9
Accuracy


                                                           Accuracy
                                                                                                                        lr:0.01 decay:0.0 epsilon:1e-08 beta1:0.9 beta2:0.9
                                                                                                                        lr:1.0 decay:0.0 epsilon:1e-08 beta1:0.9 beta2:0.9
                                                                                                                        lr:0.001 decay:0.001 epsilon:1e-08 beta1:0.9 beta2:0.9
                                                                                                                        lr:0.01 decay:0.01 epsilon:1e-08 beta1:0.9 beta2:0.9
       0.4                                                        0.4                                                   lr:1.0 decay:1.0 epsilon:1e-08 beta1:0.9 beta2:0.9
                                                                                                                        lr:0.001 decay:0.0 epsilon:1e-08 beta1:0.9 beta2:0.95
                                                                                                                        lr:0.01 decay:0.0 epsilon:1e-08 beta1:0.9 beta2:0.95
                                                                                                                        lr:1.0 decay:0.0 epsilon:1e-08 beta1:0.9 beta2:0.95
                                                                                                                        lr:0.001 decay:0.001 epsilon:1e-08 beta1:0.9 beta2:0.95
                                                                                                                        lr:0.01 decay:0.01 epsilon:1e-08 beta1:0.9 beta2:0.95
                                                                                                                        lr:1.0 decay:1.0 epsilon:1e-08 beta1:0.9 beta2:0.95
       0.3                                                        0.3                                                   lr:0.001 decay:0.0 epsilon:1e-08 beta1:0.95 beta2:0.5
                                                                                                                        lr:0.01 decay:0.0 epsilon:1e-08 beta1:0.95 beta2:0.5
                                                                                                                        lr:1.0 decay:0.0 epsilon:1e-08 beta1:0.95 beta2:0.5
                                                                                                                        lr:0.001 decay:0.001 epsilon:1e-08 beta1:0.95 beta2:0.5
                                                                                                                        lr:0.01 decay:0.01 epsilon:1e-08 beta1:0.95 beta2:0.5
                                                                                                                        lr:1.0 decay:1.0 epsilon:1e-08 beta1:0.95 beta2:0.5
                                                                                                                        lr:0.001 decay:0.0 epsilon:1e-08 beta1:0.95 beta2:0.9
       0.2                                                        0.2                                                   lr:0.01 decay:0.0 epsilon:1e-08 beta1:0.95 beta2:0.9
                                                                                                                        lr:1.0 decay:0.0 epsilon:1e-08 beta1:0.95 beta2:0.9
                                                                                                                        lr:0.001 decay:0.001 epsilon:1e-08 beta1:0.95 beta2:0.9
                                                                                                                        lr:0.01 decay:0.01 epsilon:1e-08 beta1:0.95 beta2:0.9
                                                                                                                        lr:1.0 decay:1.0 epsilon:1e-08 beta1:0.95 beta2:0.9
                                                                                                                        lr:0.001 decay:0.0 epsilon:1e-08 beta1:0.95 beta2:0.95
                                                                                                                        lr:0.01 decay:0.0 epsilon:1e-08 beta1:0.95 beta2:0.95
       0.1                                                        0.1                                                   lr:1.0 decay:0.0 epsilon:1e-08 beta1:0.95 beta2:0.95
                                                                                                                        lr:0.001 decay:0.001 epsilon:1e-08 beta1:0.95 beta2:0.95
                                                                                                                        lr:0.01 decay:0.01 epsilon:1e-08 beta1:0.95 beta2:0.95
                                                                                                                        lr:1.0 decay:1.0 epsilon:1e-08 beta1:0.95 beta2:0.95


             0   5   10        15      20   25     30                   0     5      10        15      20   25     30
                          Iterations                                                      Iterations


                                             Figure 5: Federated Averaging - Adam


    In general we experienced that methods that are in-                     However looking at magnitude of performance differences
 tended to reduce the variance of the gradient direction                    it might be understandable.
 works actually quite well for our specific scenario (1).                      Although according to our results these optimizers are
 This can be because momentum techniques can be seen                        not clearly beneficial in perspective of finding the best
 as an averaging over the subsequent gradients, leading to                  global model, they still could be useful for optimizing the
 a less and less biased estimate of the optimal update di-                  global model at clients. One can argue, that in the end
 rection. The fact that strong momentum (high β ) seems                     the goal of the entire federated optimization is to provide
 to help in the big majority of configurations of the other                 clients with a model performs well on their own data.
 parameters supports this idea.
    On the other hand methods that aim at adapting the mag-
 nitude of the gradients seem to harm the learning process                  Acknowledgements        EFOP-3.6.3-VEKOP-16-2017-
 (2) in the full-non-iid case. The reason behind this phe-                  00001: Talent Management in Autonomous Vehicle
 nomenon is most probably, that the local optimisers update                 Control Technologies - The Project is supported by the
 their inner state – which here corresponds to η learning                   Hungarian Government and co-financed by the European
 rate – based on the local gradients. In each local training                Social Fund.
 round according to our intuition the magnitude of gradi-                     Supported by Telekom Innovation Laboratories (T-
 ents start growing, since the aggregation of the local mod-                Labs), the Research and Development unit of Deutsche
 els moved the model away from the locally optimal model,                   Telekom.
 but as we approaching again the local optima, they start
 shrinking again, resulting in slower start (smaller η) of
 optimization in the next iteration.                                        References
    The extremely poor performance of AdaGrad on the
 full-non-iid dataset can support this intuition, since it pre-
 vents even the described fluctuation of the learning rate,                  [1] J. Konečný, H. B. McMahan, D. Ramage, and P. Richtárik,
                                                                                 “Federated optimization: Distributed machine learning for
 instead it decreases it continuously.
                                                                                 on-device intelligence,” arXiv preprint arXiv:1610.02527,
    The good performance of these algorithms on the 99%                          2016.
 non-iid might be explainable with the presence of gradi-                    [2] H. B. McMahan, E. Moore, D. Ramage, S. Hampson et al.,
 ents of really big magnitude in the decaying average that                       “Communication-efficient learning of deep networks from
 controls the learning rate keeping it at an effective level.                    decentralized data,” arXiv preprint arXiv:1602.05629,
    Another interesting phenomenon is that in case of Adam                       2016.
 – where momentum and adaptive learning rates are both                       [3] J. Chen, X. Pan, R. Monga, S. Bengio, and R. Jozefowicz,
 applied – the strong deccelerating effect of learning rate                      “Revisiting distributed synchronous sgd,” arXiv preprint
 adaption apparently overrides the help of the momentum.                         arXiv:1604.00981, 2016.
 [4] L. Bottou, “Online learning and stochastic approxima-
     tions,” On-line learning in neural networks, vol. 17, no. 9,
     p. 142, 1998.
 [5] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning.
     MIT press, 2016.
 [6] B. T. Polyak, “Some methods of speeding up the conver-
     gence of iteration methods,” USSR Computational Mathe-
     matics and Mathematical Physics, vol. 4, no. 5, pp. 1–17,
     1964.
 [7] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On
     the importance of initialization and momentum in deep
     learning,” in International conference on machine learning,
     2013, pp. 1139–1147.
 [8] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient
     methods for online learning and stochastic optimization,”
     Journal of Machine Learning Research, vol. 12, no. Jul,
     pp. 2121–2159, 2011.
 [9] G. Hinton, N. Srivastava, and K. Swersky, “Neural net-
     works for machine learning,” Coursera, video lectures, vol.
     264, 2012.
[10] M. D. Zeiler, “Adadelta: an adaptive learning rate method,”
     arXiv preprint arXiv:1212.5701, 2012.
[11] D. P. Kingma and J. Ba, “Adam: A method for stochastic
     optimization,” arXiv preprint arXiv:1412.6980, 2014.
[12] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel
     image dataset for benchmarking machine learning algo-
     rithms,” arXiv preprint arXiv:1708.07747, 2017.

</pre>