<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Optimization in Federated Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vukasin Felbab</string-name>
          <email>vukasindfelbab@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Péter Kiss</string-name>
          <email>peter.kiss@inf.elte.hu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tomáš Horváth</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Data Science and Engineering ELTE - Eötvös Loránd University, Faculty of Informatics Budapest</institution>
          ,
          <addr-line>H-1117 Budapest, Pázmány Péter sétány 1/C.</addr-line>
          ,
          <country country="HU">Hungary</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Federated learning (FL) is an emerging branch of machine learning (ML) research, that is examining the methods for scenarios, where individual nodes possess parts of the data, and the task is to form a single common model that fits to the whole distribution. In FL, we generally use mini batch gradient descent for optimizing weights of the models that appears to work very well for federated scenarios. For traditional machine learning setups, a number of modifications has been proposed to accelerate the learning process and to help to get over challenges posed by the high dimensionality and nonconvexity of search spaces of the parameters. In this paper we present our experiments on applying different popular optimization methods for training neural networks in a federated manner.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Federated learning (FL) [1] is a new paradigm in Machine
Learning (ML), that is dealing with an increasingly
important distributed optimization setting, that came into view
with the spread of small user devices and applications
written for them that can profit from ML. The domain of ML
models is often the data collected on the devices, thus, to
train these models, one should incorporate the knowledge
contained into the learning process. The traditional way
for this would be to transfer the information gathered at
the users to data centers, where the training takes place,
then send back the trained models to the users. That, apart
from the obvious privacy concerns, can incur a huge
communication overhead along with the need for significant
storage and computational resources at the place of
centralized training.</p>
      <p>The idea proposed in [1] is that, instead of moving the
training data to centralized location, one could exploit the
computational power residing at the user devices and
distribute the training process across the participating nodes.</p>
    </sec>
    <sec id="sec-2">
      <title>Distributed Optimization</title>
      <p>In ML, the goal is to find a model for the training data that
minimizes a loss function f that defines how our learned
model distribution differs from the empirical distribution.</p>
      <p>Copyright ©2019 for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).
This measure in general case can be formalized as a
negative log likelihood.</p>
      <p>f =</p>
      <p>Ex pdata [log pmodel (x)]
(1)
That is, if a given example x is drawn from the training
data distribution, what is the probability that it will be
present in the same form in the model distribution as well.
If the model is for predicting some value(s) y based on a
vector of some attributes x this can be rewritten as
(2)
(3)
f (x; y; w) =
log p(yjx; w)</p>
      <p>The problem we want to solve is to minimize the loss
function f with respect to model parameters w, that is an
aggregation of the losses over all available data point as
follows:
min f (w); where f (w) =
w2Rd
1 n</p>
      <p>å fi(w);
n i=1
def
where fi(w) = f (xi; yi; w) denotes the loss on data point i
given the parametrization w.</p>
      <p>In the setup of FL the characteristics of data distribution
from which our training examples (xi; yi) will be drawn,
are the following:
1. Massively Distributed. Data points are stored across
a large number K of nodes. In particular, the number
of nodes can be much bigger than the average number
of training examples stored on a given node (n=K).
2. Non-IID. Data on each node may be drawn from a
different distribution, i.e. the data points available
locally are far from being a representative sample of the
overall distribution.
3. Unbalanced. Different nodes may vary by orders of
magnitude in the number of training examples they
hold.</p>
      <p>If we adapt the objective function (see Eq. 3) to these
characteristics, our problem can be defined as introduced
in the following paragraphs.</p>
      <p>We have K nodes and n data points, a set of indices Pk
(k 2 f1; : : : ; Kg) of data stored at node k, and nk = jPkj is
the number of data points at Pk. We assume that Pk \
K
Pl = 0/ whenever l 6= k, thus åk=1 nk = n.</p>
      <p>We can then define the local loss for node as Fk(w)d=ef
n1 åi2Pk fi(w) Thus the problem to be minimized will
becokme:</p>
      <p>K nk
minw2Rd f (w) = å
k=1 n Fk(w):
(4)</p>
      <p>To solve the problem in (4) the simplest algorithm is
FederatedSGD introduced in [2], that is equivalent to
minibatch gradient descent over all data, and it is a simple
application of distributed synchronous Stochastic Gradient</p>
      <sec id="sec-2-1">
        <title>Descent (SGD) [3] for the described setup.</title>
        <p>Algorithm 1 FederatedSGD
1: procedure SERVER
2: initialize w0
3: for t = 0; 1; 2; ::: do
4: for all k in the K nodes in parallel do
5: wtk+1 ClientUpdate(k; wt )
6: end for
7: wt+1 = åkK=1 nnk wtk+1
8: end for
9: end procedure
10: procedure CLIENTUPDATE(k; w)
11: B split Pk to set of batches
12: for all b 2 B do
13: w w hÑ f (w; b)
14: end for
15: return W
16: end procedure</p>
        <p>In neural network (NN) optimization, due to the non
convexity of the loss functions, the most used methods
for optimization of network parameters are gradient based,
more specifically the versions of SGD [4]. Gradient
descent methods take derivatives of loss function according
to the parameters of the model, then move the parameter
values in the negative of the gradient.</p>
        <p>The pure form of SGD samples a random function (e.g
a random training data point) it 2 1; 2; :::; n in iteration t
and performs the update:
wt+1 = wt
ht Ñ fit (wt );
(5)
where ht denotes the learning rate, which is, in the base
case, decaying during the learning to enforce convergence.
Intuitively, SGD works because evaluating the gradient at
a single training example gives an unbiased estimation of
derivative of the error function over all the training
examples: E[Ñ fit (w)] = Ñ f (w).</p>
        <p>In practice, instead of applying the gradient for w at
each example, usually an average of gradients over b
randomly chosen examples is used, that are evaluated at the
same w. This method is called minibatch gradient
descent (MBGD), that better exploits parallel computational
capabilities of the hardware. (MBGD is still commonly
referred to as SGD) Though SGD/MBGD in the above
(6)
(7)
(8)
form is very popular in optimization, the basic approach
can sometimes result in very slow learning. To tackle the
challenges incurred by high curvature and noisy gradients
of the loss function of NN, a range of method has been
proposed based on exponentially decaying average of the
gradients or on adapting learning rates. [5]</p>
        <p>In this paper, we investigate the effects of these
methods on the performance , of federated training of artificial
neural networks.
1.2</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Momentum techniques</title>
      <p>Momentum based techniques [6] use a velocity term
during learning, that is an exponentially weighted average
over the gradients of the past.</p>
      <p>v
w
b v
w + v
1 m</p>
      <p>å fi(w))
Ñw( m i=1
This term, on one hand, accelerates the learning process
and, on the other hand, helps to get over noisy gradient
and local minima or flat points of surface defined by the
error function f .</p>
      <p>A variant of momentum algorithm is introduced in
[7] and is based on the Nesterov’s accelerated gradient
method, that differs from the standard momentum of (6)
in the place of the evaluation of the gradient.</p>
      <p>v
w
b v
w + v
1 m</p>
      <p>å fi(w + av))
Ñw( m i=1
In Nesterov’s momentum, the gradients are evaluated
incorporating the velocity. This can be interpreted as adding
a correction factor to the standard momentum algorithm
[5].
1.3</p>
    </sec>
    <sec id="sec-4">
      <title>Adaptive Learning Rates</title>
      <p>Setting up learning rates is one of the most important
factor in the learning process and deeply influences the
performance. Thus, finding methods to adapt the learning rate
might yield a substantial increase in speed of the learning.
The AdaGrad algorithm [8] adjusts the learning rates
individually for each parameter, taking into account the whole
history of the parameters, following the assumption, that
if the magnitude of the gradients is big than it should be
increased:
ht =</p>
      <p>h
qått=11 gt2
;
where g = ¶¶wf j for some parameter w j, and thus ht will
be the learning rate belonging to w j a timestep t.</p>
      <p>It has been found empirically that aggregating the
gradients from the beginning of the optimization can lead to too
fast decay in the learning rate, that, in some cases, leads to
weak performance.</p>
      <p>To remedy this problem RMSProp [9] (the same time
proposed by the authors of AdaDelta [10]) replaces this
aggregation with a decaying average, in the form:
have been calculated. (Except for first experiment, since
it describes exactly the MBGD method).</p>
      <p>The new CLIENTUPDATE(k; w) method is introduced in
the Algorithm 2.
ht = p
vt = rvt 1 + (1
h
vt + e
r)gt2
RMSProp has been proven very effective in non-convex
optimization problems of NN, thus, it is the most often
used technique in practice.</p>
      <p>According to the explanation in [5], AdaGrad is
designed to converge rapidly when applied to convex
functions. In non-convex cases it should pass many structures
before arriving at a convex bowl, and, since it accumulates
the entire history of the squared gradient, it can shrink
prematurely and eventually vanish. In contrast, discarding the
old gradients in the RMSProp case enables learning to
proceed rapidly after finding the convex bowl, equivalently as
if AdaGrad would have been initialized within that convex
area.
1.4</p>
    </sec>
    <sec id="sec-5">
      <title>Adam</title>
      <p>The name of the Adam algorithm [11] comes from
“adaptive momentum”, and can be viewed as the combination of
adaptive learning rates and momentums.</p>
      <p>gt
mt
vt
wt
1 m</p>
      <p>å fi(wt 1))
Ñw( m i=1
b1mt 1 + (1</p>
      <p>1 b1t
b2vt 1 + (1</p>
      <p>1 b2t
wt 1
b1)gt
b2)gt2
h mt (element-wise)
pvt + e
(9)
(10)
(11)
(12)
(13)</p>
      <p>In Adam, the weight update is given by applying the
RMSProp learning rate (12) on the momentum (11). (In
Equation (12) and (11) the denominator is applied bias
correction on the estimates.) We are not aware of clear
theoretical understanding why this is advantageous,
however, it seems to work very well in practice and became
a de facto default optimization technique for a lot of ML
practitioners.
2</p>
      <sec id="sec-5-1">
        <title>Experimental setup</title>
        <p>For analyzing thew performance of the optimizer
algorithms, we implemented a simulation environment that
trains multiple local NN models that would be aggregated
into one common model, according to the Algorithm 1.</p>
        <p>Compared to Algorithm 1, we have been varying the
CLIENTUPDATE(k; w) method, where the local updates
Algorithm 2 ClientUpdate
1: procedure CLIENTUPDATE(k; w)
2: B split Pk to set of batches
3: for all b 2 B do
4: Dw = Optimizer(w; b)
5: w w Dw
6: end for</p>
        <sec id="sec-5-1-1">
          <title>7: return W</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>8: end procedure</title>
      <p>Naturally, all the optimizers have their own
hyperparameters which should be tuned to get the best
possible result. However, for this experiment we used only the
recommended values for them (that are in fact the default
values in Keras/TensorFlow libraries).
2.1</p>
    </sec>
    <sec id="sec-7">
      <title>Topology</title>
      <p>The model we used is a simple multilayer perceptron. The
input layer consists of 784 input units that is the flatten
representation of the input images of size 28 28 pixels.
The input is connected to one hidden layer of 128 neurons
with ReLU activation. The output layer corresponds to
the 10 output classes, thus it has 10 neurons with softmax
activation.</p>
      <p>In the implementation of the network, we relied on</p>
      <sec id="sec-7-1">
        <title>Keras NN API on a TensorFlow backend.</title>
        <p>2.2</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Data</title>
      <p>For the experiment, we have chosen the Fashion MNIST
dataset [12] that was planned to replace the MNIST
benchmark database.</p>
      <p>From the characteristics of the FL scenario, in this
experiment, we focused on non-iid nature of the data. That
is, we have created local datasets of a highly skewed
manner. Namely, training data at a given node contains
exclusively, or almost exclusively, instances from the same
class.</p>
      <p>For these experiment, we have not taken into account
the unbalanced nature, and each node have been assigned
the same amount of data. Our idea here was that if
something works in this simple setup then it might work in use
cases closer to the real world problem.</p>
      <p>Due to the lack of computational resources we also
ignored the “massively distributed” condition and set the
number of nodes to 10.</p>
      <p>The distributions of the local datasets we tried in the
experiments are the following:
99% non-iid The training data has been split into two
parts in the ratio of 99%-1%, where the parts are
independent and identically distributed, as best as possible. The
99% part will be assorted accorded to classes and then one
class assigned to one of the nodes. The 1% part will be
equally split into 10 parts and then added to the dataset of
the particular nodes.
full-non-iid In the second test case, we assorted fully the
training data and each node receives a dataset consisting
of instances belonging purely to one single class.
2.3</p>
    </sec>
    <sec id="sec-9">
      <title>Hyperparameters</title>
      <p>To measure general applicability of the examined
algorithms on the problem of FL, we executed the learning
process multiple times, using different parametrisations.
In setting the hyperparameters we followed the Method of
GridSearch, that is we defined a set of possible values for
each hyperparameter, then run the algorithms with all the
combinations. At defining the set of the possible values we
tried to include extremities and generally recommended
values. In addition to the parameters described in Section
1 we also included experimenting with the decay of the
learning rate. Here we only tried tried nevertheless two
case at each configuration of the other parameters, the one
without decay, and the one with time based decay, where
h0
the learning rate a time t will be ht = 1+f t with the decay
rate parameter f = mahx0ftg .
3</p>
      <sec id="sec-9-1">
        <title>Results</title>
        <p>FedSGD - Simple Minibatch Gradient Descent As a
baseline we run the experiment with the standard
Minibatch Gradient Descent optimiation. The experiment
result is shown in Figure 1. It is clear that, as it
often happens, the most simple algorithm, MBGD places
the baseline rather high for the more sophisticated
optimizer algorithm. It performs very well for the 99%
non-iid datasets and surprisingly well with the
full-noniid datasets, achieving an accuracy close to 75% in the 30
iterations with the best configuration of hyperparameters
(h = 0:001, no decay).</p>
        <p>Moreover, in results it seems like on both distribution
too high learning rate without decay leads to a poor
performance.</p>
        <p>FedSGD + Nesterov momentum In Figure 2, the results
using the Stochastic Gradient Descent with Nesterov
momentum can be seen. We found that incorporating local
momentum into computing the partial directions of the
updates has a strong positive effect both on performance and
convergence rate of the aggregated model at both data
distributions.</p>
        <p>The best performing configurations reached in the first
couple of iterations the highest accuracy achieved by the
baseline during the entire experiment. According to our
results the higher the value b (that is the past directions
influence stronger the update) is generally the better
performance, apparently independently from h and decay rate.
FedSGD + AdaGrad The AdaGrad algorithm yields an
even better performance 3, on the 99% non-iid datasets.
Using this method results in the fastest convergence until
the 70% of the baseline. In the first 30 rounds, though
AdaGrad’s performarnce drops dramatically full-non-iid
datasets, reaching at most a 25% accuracy without
obvious perspective further improvement. It might be
interesting to check how many random training examples need
to be put into the full-non-iid datasets to achieve the very
good performance of the AdaGrad which is measured with
the 99% non-iid datasets.</p>
        <p>FedSGD + RMSProp The RMSProp optimizer is used in
the experiment which has produced the statistics in Figure
4. We experienced that this optimizer algorithm apart from
the stronger variance of performance seems to approach
the accuracy of the AdaGrad and Nesterov momentum
methods, and outperforming MBGD baseline as well on
the 99% non-iid distribution. One the other hand though
the accuracy on the full-non-iid still achieves a
significantly worse performance compared to MBGD and
Nesterov momentum, however leraning curve is much more
promising than the one of AdaGrad. It reaches in the
best performing setups about 50% accuracy, and shows an
emerging tendency as well.</p>
        <p>FedSGD + Adam The last experiment, we were
applying Adam. The method is one of the most popular and
often default optimizer(11), thanks to fast convergence,
high accuracy in the traditional NN learning, and to its
robustness to hyperparameter settings. In our experiment
however, Adam worked with a very similar effectiveness
to RMSProp, and has been definitely outperformed by
MBGD and Nesterov momentum (Figure 5) regarding to
performance and smoothness of learning on the
full-noniid datasets.
4</p>
      </sec>
      <sec id="sec-9-2">
        <title>Conclusion</title>
        <p>In our experiment we found, that the best performing
optimizer algorithms for both the distribution are Minibatch
Gradient Descent without and with Nesterov momentum,
whilst Adadelta and RMSProp is promising despite their
poor performance on fully non-IID datasets. As one could
have assumed, the presence of the non-iid part of the
training data has a very strong regularizing effect even if its
weight seems to negligible compared to the dominating
class.
0
5
10</p>
        <p>15
Iterations
20
25
30
0
5
10
20
25</p>
        <p>30
15
Iterations
Nesterov
0.7
0.6
0.5
y
c
a
r0.4
u
c
c
A
0.3
0.2
0.1
0.8
0.7
0.6
0.5
y
c
a
r
u
c
c0.4
A
0.3
0.2
0.1
0
5
10</p>
        <p>15
Iterations
20
25
30
0
5
10
20
25</p>
        <p>30
15
Iterations
0.8
0.7
0.6
0.5
y
c
a
r
u
c0.4
c
A
0.3
0.2
0.1
RMSprop
0.8
0.7
0.6
0.5
y
c
a
r
u
c0.4
c
A
0.3
0.2
0.1
15</p>
        <p>Iterations
0
5
10
20
25
30
0
5
10
20
25</p>
        <p>30
15
Iterations
0.8
0.7
0.6
0.5
adaption apparently</p>
        <p>However
looking
at magnitude
of performance
differences
it
might be understandable.</p>
        <p>Although according
not
clearly
beneficial
to
in
our
results
these</p>
        <p>optimizers
perspective
of
finding
the
are
best
global
global
model,
they
still could
be useful
for
optimizing the
model
at
clients.</p>
        <p>One
can
argue,
that
the goal
of
the entire
federated optimization
is
clients
with
a model performs well on their own data.
in
to
the</p>
        <p>end
provide</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>Acknowledgements</title>
      <p>EFOP-3.6.3-VEKOP-16-201700001:</p>
      <sec id="sec-10-1">
        <title>Control</title>
      </sec>
      <sec id="sec-10-2">
        <title>Talent</title>
      </sec>
      <sec id="sec-10-3">
        <title>Management in</title>
      </sec>
      <sec id="sec-10-4">
        <title>Autonomous</title>
      </sec>
      <sec id="sec-10-5">
        <title>Vehicle</title>
      </sec>
      <sec id="sec-10-6">
        <title>Technologies The</title>
      </sec>
      <sec id="sec-10-7">
        <title>Project</title>
        <p>is
supported
by
the</p>
      </sec>
      <sec id="sec-10-8">
        <title>Hungarian</title>
      </sec>
      <sec id="sec-10-9">
        <title>Government and co-financed by the European</title>
      </sec>
      <sec id="sec-10-10">
        <title>Social Fund.</title>
      </sec>
      <sec id="sec-10-11">
        <title>Telekom.</title>
      </sec>
      <sec id="sec-10-12">
        <title>Supported by</title>
      </sec>
      <sec id="sec-10-13">
        <title>Telekom</title>
      </sec>
      <sec id="sec-10-14">
        <title>Innovation</title>
      </sec>
      <sec id="sec-10-15">
        <title>Laboratories (T</title>
      </sec>
      <sec id="sec-10-16">
        <title>Labs), the</title>
      </sec>
      <sec id="sec-10-17">
        <title>Research and</title>
      </sec>
      <sec id="sec-10-18">
        <title>Development unit of</title>
      </sec>
      <sec id="sec-10-19">
        <title>Deutsche</title>
        <p>2016.
2016.
[1]</p>
      </sec>
      <sec id="sec-10-20">
        <title>J. Konecˇ ný,</title>
        <p>[4] L. Bottou, “Online learning and stochastic
approximations,” On-line learning in neural networks, vol. 17, no. 9,
p. 142, 1998.
[5] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning.</p>
        <p>MIT press, 2016.
[6] B. T. Polyak, “Some methods of speeding up the
convergence of iteration methods,” USSR Computational
Mathematics and Mathematical Physics, vol. 4, no. 5, pp. 1–17,
1964.
[7] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On
the importance of initialization and momentum in deep
learning,” in International conference on machine learning,
2013, pp. 1139–1147.
[8] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient
methods for online learning and stochastic optimization,”
Journal of Machine Learning Research, vol. 12, no. Jul,
pp. 2121–2159, 2011.
[9] G. Hinton, N. Srivastava, and K. Swersky, “Neural
networks for machine learning,” Coursera, video lectures, vol.
264, 2012.
[10] M. D. Zeiler, “Adadelta: an adaptive learning rate method,”
arXiv preprint arXiv:1212.5701, 2012.
[11] D. P. Kingma and J. Ba, “Adam: A method for stochastic
optimization,” arXiv preprint arXiv:1412.6980, 2014.
[12] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel
image dataset for benchmarking machine learning
algorithms,” arXiv preprint arXiv:1708.07747, 2017.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>