=Paper=
{{Paper
|id=Vol-2604/paper61
|storemode=property
|title=First-Order Optimization (Training) Algorithms in Deep Learning
|pdfUrl=https://ceur-ws.org/Vol-2604/paper61.pdf
|volume=Vol-2604
|authors=Oleg Rudenko,Oleksandr Bezsonov,Kyrylo Oliinyk
|dblpUrl=https://dblp.org/rec/conf/colins/RudenkoBO20
}}
==First-Order Optimization (Training) Algorithms in Deep Learning==
<pdf width="1500px">https://ceur-ws.org/Vol-2604/paper61.pdf</pdf>
<pre>
First-Order Optimization (Training) Algorithms in Deep
                      Learning

           Oleg Rudenko1[0000-0003-0859-2015], Oleksandr Bezsonov1[0000-0001-6104-4275] ,
                              Kyrylo Oliinyk1[0000-0001-8536-5217]
      1
          Kharkiv National University of Radio Electronics, Nauky Ave. 14, Kharkiv, 61166,
                                             Ukraine
                                  oleh.rudenko@nure.ua
                              oleksandr.bezsonov@nure.ua
                          kirill.oleynik.olegovich@gmail.com


          Abstract. The use of artificial neural networks (ANN) requires solving struc-
          tural and parametric identification problems corresponding to the choice of the
          optimal network topology and its training (parameter settings). In contrast to the
          problem of determining the structure, which is a discrete optimization (combi-
          natorial), the search for optimal parameters is carried out in continuous space
          using some optimization methods. The most widely used optimization method
          in deep learning is the first-order algorithm that based on gradient descent
          (GD). In the given paper a comparative analysis of convolutional neural net-
          works training algorithms that are used in tasks of image recognition is provid-
          ed. Comparison of training algorithms was carried out on the Oxford17 catego-
          ry flower dataset with TensorFlow framework usage. Studies show that for this
          task a simple gradient descent algorithm is quite effective. At the same time,
          however, the problem of selecting the optimal values of the algorithms parame-
          ters that provide top speed of learning still remains open.


          Keywords: Convolution, Optimization, Neural Network, Algorithm, Gradient,
          Training, Image Recognition.


1         Introduction

Deep Learning is a class of Artificial Neural Network (ANN) that has many pro-
cessing layers. There is huge number of ANN architectures variants in the literature.
ANNs can be used as a very effective technology to solving a wide class of problems
[1-5]. After the breakthrough result in the ImageNet classification challenge [2], dif-
ferent kinds of neural network, i.e. convolution NN (CNN) architectures have been
proposed and the performance is improved year by year [6,7].
   The use of ANN requires solving structural and parametric identification problems
corresponding to the choice of the optimal network topology and its training (parame-
ter settings). In contrast to the problem of determining the structure, which is a dis-
crete optimization (combinatorial), the search for optimal parameters is carried out in
continuous space using classical optimization methods. To train direct distribution
    Copyright © 2020 for this paper by its authors.
    Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
networks with a teacher, algorithms are usually used that optimize some objective
function. There are a lot of works that aim to improve ANN in different aspects (ar-
chitecture design, choice and optimization of training algorithms and so on).
    The most widely used optimization method in deep learning is the first-order algo-
rithm that based on gradient descent (GD). The BP algorithm is the standard training
method for ANN which uses GD. These methods can be split into the following cate-
gories: batch gradient decent method, mini-batch gradient decent method, and sto-
chastic gradient decent method (SGD) [8, 9]. The GD method is the earliest optimiza-
tion method. It often converges with a slower speed. The batch gradient decent meth-
od has high computational complexity for scale data.
    The using of SGDs is the predominant methodology in training deep learning
(CNN).


2        The Structure of the Convolution Neural Network

Initially, the convolution neural network structure was created taking into account the
structural features of some parts of the human brain responsible for vision. The basis
for the development of such networks is incorporated by three mechanisms:
   - local perception;
   - forming a set of layers in the shape of the characteristics maps (shared weights);
   - sub-sampling (sub-set).
   Under the local perception it is understood that the input neuron receives not the
whole picture but only some part of it. This helps to keep the image configuration
during the transition from layer to layer.
   The idea of shared weights means that a large number of connections used a small
set of weights, i.e. each area of the image to which it is divided, will be processed by
the same set of weights. Such artificial limitation of weights improves network’s gen-
eralization property.
   CNN consists of the convolution layers, sub-sampling and fully connected neural
network layers.


3        Convolution Neural Network Layers

CNN got its name from the operator "convolution". The main purpose of convolution
in the CNN case is to extract features from the input image.
   Convolution keeps spatial relations among pixels, studying the features of the im-
age, using the small batches of the input data.
   Each neuron in the plane of the convolutional layer receives its inputs from the cer-
tain region of the previous layer (local receptive field).
   Subsample layer zooms planes by local averaging of the neurons output values.
Subsequent layers extract more common characteristics relied upon the picture distor-
tion.
   Each convolutional layer is followed by subsampling or computational layer which
produces a reduction of the image dimension by local averaging the values of the
neurons output.
   The architecture of the convolution network is assumed that evidence of the fea-
ture’s existence is more important information than its exact location. Therefore, from
a plurality of neighboring neurons in the map attributes one neuron with maximum
value is chosen to map features of smaller dimension.
   Difference between subsample layer and convolution layer is that in the convolu-
tion layer neighboring neurons overlap, which does not occur in the subsampling
layer.
   Thus, CNN is constructed by alternating of convolution and subsampling layers. At
the output of the network several layers of fully connected neural network are usually
installed. The input for these layers is the final feature’s map. Each neuron of the
output layer is the perceptron, which has a non-linear activation function.


4         Training Methods of Convolution Neural Network

For convolutional neural network training a standard backpropagation algorithm and
its various modifications can be used. The basis of this method is a stochastic gradient
descent algorithm (Stochastic Gradient Descent).


    4.1   Training Based on Stochastic Gradient
Stochastic gradient descent (SGD) and its variants are the most widely-studied algo-
rithms for optimization problems in machine learning and stochastic approximation.
    The usual gradient descent is described by the following relation

                           (k  1)   (k )  J ( (k )),                        (1)

where  – network parameter N  1 , J ( (k )) – the loss function;  – training speed
parameter (learning rate).
   Algorithm (1) convergence is generally not guaranteed, but it is proved [9, 10] that
in the case of a convex function J ( (k )) and under the following conditions:

                                                      
                    lim (k )  0;       (k )  ;     2 ( k )  ,              (2)
                                     k 1              k 1

gradient descent process will converge.
   There are two main approaches to implementing gradient descent:
     Batch - when the training sample is viewed entirely at each iteration, and on-
         ly after this  is changed. This requires large computational cost.
         Stochastic (online) - where at each iteration of the algorithm from the train-
          ing set some (random) object is selected. Thus, the vector  is configurable
          for each newly selected object.
    The following disadvantages are inherent to this algorithm:
      stuck in local minima and saddle points of the minimized functional.
      Slow convergence due to difficult terrain of the objective function when the
          plateau regions alternate with strong nonlinearity (the derivative of the plat-
          eau is almost zero, and sudden fall, on the contrary, can change the parame-
          ter estimation).
      Some of the parameters are updated less often than others, especially when
          in the data some informative but rare features are found. This has a bad ef-
          fect on the nuances of the network rules generalization. On the other hand,
          giving too much importance to all rarely seen features can lead to overtrain-
          ing.
      Too small value of
                                parameter leads to slow convergence and stucking in
                                                  
          local minima, while too large value of leads to "overshooting" the narrow
          global minima or no divergence at all.
Using the second-order methods discussed above requires calculating the Hessian -
derivative matrix for each pair of parameters, and, for the Newton’s method addition-
ally its inverse matrix, i.e. implementation of these methods involves considerable
computing effort.
    Therefore, in practice, widespread methods are based on the stochastic gradient
method which has number of advantages [11].
    Consider these methods in more detail.


 4.2     Stochastic Average Gradient (SAG)
The stochastic average gradient (SAG) algorithm [12] is a difference decrease strate-
gy proposed to increase the speed of convergence.
   The SAG iterations take the form
                                                          n
                                                 (k )
                          (k  1)   (k ) 
                                                  n       yi (k ),
                                                         i 1                        (3)
                                J i ( (k )) if i  ik ,
                          yi  
                                     yi 1 otherwise.

where  (k ) is the learning rate and a random index ik is selected at each iteration
during which we set.
   Essentially, SAG maintains in memory, the gradient with respect to each function
in the sum of functions being optimized. The gradient value of only one such function
is computed and updated in memory at each iteration. The SAG method estimates the
overall gradient at by averaging the gradient values stored in memory.
  However, the SAG technique can be utilized only with the smooth loss function
and a convex objective function. The SAG has better convergence comparing to the
SGD in tasks like convex linear prediction problems.


 4.3    Stochastic Variance Reduction Gradient (SVRG)
The SVRG algorithm [13] calculates gradient %in the following way:

                                               N
                                                gi (%),
                                           1
                                     %                                            (4)
                                           N
                                               i 1

where % - interval update parameter. SVRG performs gradient updates by using fol-
lowing equation:

                   (k  1)   (k )   ( gi (k )( (k )  gi (k )(%)  %).      (5)

The gradient can be calculated up to two times during each update. After w itera-
tions, parameter % is updated and the next w iterations start. Through these update,
 (k  1) and the interval update parameter %will converge to the optimal   , and
then % 0 , and

            gi (k )( (k )  gi (k )(%)  % gi (k )( (k )  gi (k )( )  0.   (6)

There are also many variants of such linear convergence stochastic optimization algo-
rithms, such as the SAGA algorithm [13, 14].


 4.4    Momentum
Instead of depending on current gradient only for updating weights, the gradient de-
scent algorithm [15] with momentum replaces the current gradient with v(k  1) ( v
means the velocity), exponential moving average of the current and past gradients
(i.e., before the time k  1 )

                                (k  1)   (k )  v(k  1);                      (7)

                         v(k  1)   v(k )  (1   ) J ( (k ),                 (8)

  where   0.9.
  Later this pulse update becomes a standard for the gradient components upgrade.
 4.5     NAG (Nesterov Accelerated Gradient)
This algorithm [16] implements the idea of accumulation the pulse by using the in-
formation on the change of each parameter in the form of an exponential moving
average

                              v(k  1)   v(k )  (1   ) x.                      (9)

   The gradient algorithm accumulates target network functions

                            v(k  1)   v(k )   J ( (k )),                   (10)

which is used during the parameters correction

                                  (k  1)   (k )  v(k ).                       (11)

More precise correlation algorithm for NAG has the form
                                (k  1)   (k )  v(k  1);                      (12)

                       v(k  1)   v(k )   J ( (k )   v(k )).              (13)


 4.6     SNM (Simplified Nesterov Momentum)

In [17] was offer a new formulation of Nesterov momentum differing from (8) and
(9). The main difference from (8) and (9) lies in committing to the “peekedahead”
parameters (k )   (k )   (k )v(k ) and backtracking by the same amount before each
update. These new parameters (k  1) updates become [17]:

                        v(k  1)   (k )v(k )   (k ) J ((k ));               (14)

             (k  1)  (k )   (k )v(k )   (k  1)v (k  1)  v (k  1) 
                                                                                   (15)
              (k )   (k  1) (k )v(k )  (1   (k  1)) (k ) J ((k )).


 4.7     Adagrad
Adagrad (adaptive gradient) [18] takes into account the frequency of neurons activa-
tion by storing for each network parameter the sum of its squares updates. It uses a
modified formula of renovation

                    M {g 2 (k  1)}   M {g 2 (k )}  (1   ) g 2 (k ),          (16)

and correction parameter is carried out according to the rule

                                                    
                           (k  1)   (k )                  g (k ),             (17)
                                                 G (k )  
where g (k )   J ( (k ); G(k ) – the sum of the updates squares,  - smoothing pa-
rameter that is required in order to avoid division by 0. The frequently updated last
parameter G(k ) is large (large denominator in (17)), i.e. the parameter will change
slightly. Rarely changed parameters will change substantially. Parameter  generally
selected in range of 10-6 - 10-8.


 4.8     RMSProp
Disadvantage of Adagrad is that G(k ) in (17) can be increased without any limita-
tions. As the result, after short time an update becomes too small. This leads to paral-
ysis of the algorithm. RMSProp and Adadelta designed to solve this problem [19].
   Adagrad modifies parameters in such a way that the frequently updated weights are
adjusted less frequently. To do this, instead of the full sum of the updates averaged
over history gradient square is used, i.e., moving average of the following form

                    M {g 2 (k  1)}   M {g 2 (k )}  (1   ) g 2 (k ),          (18)

then instead of (17) we obtain

                                                    
                       (k  1)   (k )                       g (k ).            (19)
                                                    2
                                              M {g (k )}  

         Since the denominator represents the root of the mean squares gradient

                           RMS{g (k )}  M {g 2 (k )}   ,                        (20)

algorithm got its name RMSProp - root mean square propagation.


 4.9     Adadelta
Adadelta is a continuation Adagrad [20], which aims to avoidance of monotonic de-
crease of training speed. Instead of accumulating all the gradients of the last square,
Adadelta bounds the window of collected past gradients to the fixed size.
   Adadelta is different from RMSProp because we add to the numerator (17) the sta-
bilizing member proportional to RMS from  (k ). In step k  1 value of
 RMS{ (k )} is not yet known, so the update of the parameters is implemented in
three stages instead of two: at the first stage square of the gradient is accumulated,
then  is updated. And finally RMS{ (k )} , is updated

                                              RMS{ (k )}
                        (k  1)   (k )                 g (k );                 (21)
                                               RMS{g (k )}
                                 2                      2                          2
               M {  (k  1) }   M {  (k ) }  (1   )  (k )                   ;   (22)

                                                                2
                          RMS{ (k )}  M {  (k ) }   .                               (23)


For RMSProp, Adadelta and Adagrad there is no need in very accurate choose of the
learning curve - just its approximate value is needed. Usually it is advised to start
snapping  from 0,1-1, and leave  0.9. The closer  to 1, the longer RMSProp and
Adadelta with great RMS{ (k )} will much update rarely used weights. If   1 and
 RMS{ (k )}  0 , then Adadelta be longer "with a grain of salt" refers to a rarely
used weights that can lead to paralysis of the algorithm, and intentionally cause
"greedy" behavior, when the algorithm updates the first neurons that encode the best
features.


 4.10    Adam
Adam (Adaptive moment estimation) – another optimization algorithm. It combines
the idea of accumulation of the motion and the idea of a weaker weight updates for
typical features [21]. By analogy with (6) we can obtain:

                           m(k  1)  1m(k )  (1  1 ) g (k ).                          (24)

Nesterov is different from Adam because there is no need to accumulate  , and the
gradient’s value. To obtain information about the gradient’s change in [21] it is pro-
posed to estimate additionally an average dispersion:
                                                                      2
                        v(k  1)   2 v(k )  (1   2 ) g (k  1) ;                      (25)

                                             1   2 ( k  1)    m( k  1)
                   ( k  1)   ( k )                                       ,           (26)
                                             1  1( k  1)     v ( k  1)  

where 1 , 2 [0,1) , g (k ) – stochastic gradient off J at  (k  1) ,  – step size,  – a
small constant.
   Authors of Adam offered as defaults 1  0.9, 2  0.999,   108 and argue that
an algorithm performs better or about the same as all previous algorithms on a broad
set of datasets due to the initial calibration.


 4.11    AdaMax
AdaMax algorithm [22] is a Adam’s algorithm modification, wherein the dispersion is
used instead of the inertial moment of the distribution of arbitrary degree gradients p.
While this may lead to the calculation of volatility, in practice the case p   . It
works surprisingly well

                                        p                      p        p
                        v(k  1)   2 v(k )  (1   2 ) g (k )            .              (27)


 4.12   Adapg

Adapg combins Adadelta and Adam[6]

                                2                          2                        2
              M {  (k  1) }   M {  (k ) }  (1   )  (k ) .                      (28)


 4.13   HAdam (High-order Adam)

Using the core update law of Adam [23] it can be rewritten using induction as
following

                                                     k
                           m(k  1)  (1  1 )      1i g (k  i);                       (29)
                                                    i 0

                                                      k
                            v(k  1)  (1   2 )     2i g 2 (k ).                       (30)
                                                     i 0

It is easy to note that m(k  1) and v(k  1) are different because of the exponential
moving average utilization. In paper [23] was extend the second moment to high-
order moment.


 4.14   Nadam
Nadam algorithm (Nesterov-accelerated Adaptive Moment Estimation) [22] is a mod-
ification of Nesterov algorithm with pulse parameter adaptation

                                                             1  1                
               (k  1)   (k )                1gˆ ( k )           J ( ( k )  ;   (31)
                                     G (k )                 1   2               

                                                   g (k )
                                       gˆ(k )            ;                                (32)
                                                  1  1

                                                  G (k )
                                       Gˆ (k )          ;                                 (33)
                                                 1  2
                      g (k  1)  1g (k )  (1  1) J ( (k );                (34)

                   G (k  1)   2 G (k )  (1   2 )  J ( (k )  ;
                                                                         2
                                                                                  (35)


  α= 0.002; β₁= 0.9; β₂ = 0.999; ε = 10⁻⁷.

 4.15   AMSGrad

Another version of Adam algorithm is AMSGrad [24]. This version revises the com-
ponents of the adaptive learning rate in Adam and changes it to ensure that the current
G is always greater than at the previous time step

                                                    
                         (k  1)   (k )                    g (k );            (36)
                                                 Gˆ (k )  

                                            
                         Gˆ (k  1)  max Gˆ (k ), G(k  1) ;                    (37)

                      g (k  1)  1g (k )  (1  1) J ( (k );                (38)

                   G (k  1)   2G (k )  (1   2 )  J ( (k )  ;
                                                                         2
                                                                                  (39)

                   G (k  1)   2 G (k )  (1   2 )  J ( (k )  ;
                                                                         2
                                                                                  (40)


α= 0.001; β₁= 0.9; β₂ = 0.999; ε = 10⁻⁷.

 4.16   WNGrad
In WNGrad algorithm (weight normalization Grad) [25] the method of dynamic up-
date of the learning rate is used in accordance with the obtained gradients

                                                1
                         (k  1)   (k )            J ( (k ));              (41)
                                               b( k )

                                             1
                       b(k  1)  b(k )            J ( (k )) .                (42)
                                            b( k )

According to the numerical experiments results, WNGrad is a rival to the simple sto-
chastic gradient descent algorithm in terms of sustainability and generalization error
in the training of neural networks. In [15] WNGrad modifications are proposed which
use pulse (WN-Adam and WNGrad-Momentum).
 4.17    Padam
Padam (Partially adaptive momentum estimation method) [26] unifies Ad-
am/Amsgrad and SGD with momentum by a partially adaptive parameter

                                                       
                               (k  1)   (k )             g (k );               (43)
                                                     ˆp
                                                     G (k )

                                               
                            Gˆ (k  1)  max Gˆ (k ), G (k  1) ,                  (44)

where p  (0,1/ 2] is the partially adaptive parameter (1/2 is the largest possible value
for p and a larger p will result in non-convergence in the proof). When p  0, Pa-
dam reduces to SGD with momentum and when p  1/ 2 it is exactly Amsgrad.
   It is empirically shown in [26] that Padam achieves the highest training speed
while generalizing just as SGD. These outcomes recommend that a developer should
get adaptive gradient methods by and by for faster adjustment of CNN weights.


 4.18    AdaShift
The key difference between Adam and AdaShift (ADAptive learning rate method
with temporal SHIFTing) [27] is that the latter temporally shifts the gradient g (k ) for
n -step, i.e., using g (k  n ) for calculating v(k ) and using the kept-out n gradients,
which makes v(k ) and g (k ) temporally shifted and hence decorrelated:

                                                                        2
                       v(k  1)   2 v(k )  (1   2 ) ( g (k  n) ),            (45)

   where  is a function (spatial operation). There is no restriction on the choice of
 (in [27]  ( x )  max x (i )).
                       i


 4.19    SWATS
SWATS (Switching from Adam to SGD) [28] is a special method that Switches from
Adam to SGD when a triggering condition is satisfied.
   While the focus of [28] has been on Adam, the strategy proposed is generally ap-
plicable and can be analogously employed to other adaptive methods such as Adagrad
and RMSProp. A viable research direction includes exploring the possibility of
switching back-and-forth, as needed, from Adam to SGD.


 4.20    Parallelizing SGD
A strength of SGDs is that they are simple to implement and also fast for problems
that have many training examples. However, SGD methods have many disadvantages
[29]. Recently, several approaches [30][31][32] [33][34] towards an effective parallel-
ization of the SGD optimization have been proposed.
    A theoretical framework for the analysis of SGD parallelization performance has
been presented in [30]. In [31] was introduced an update scheme called Hogwild that
allows performing SGD updates in parallel on CPUs. In [32] was proposed an algo-
rithm called weighted parallel SGD (WP-SGD). Other methods of parallelizing SGD
were introduced in [33] [34].


5      Modeling

We train and evaluate our CNN model on the Oxford17 category flower dataset [35].
It contains 17 categories of common flowers in the UK with 80 images for each class.
Some of the pictures from the dataset are present at Figure 1.
    Proposed CNN based application is implemented using TensorFlow [36] and is
trained with using Nvidia GeForce-2080 GPU. Performance and accuracy of different
first-order optimization algorithms shown in Figure 2 and Figure 3. From presented
results it can be seen that the considered training methods perform differently. Ada-
max, adagrad and SGD converge faster than the other methods. Performance of
Adadelta is also acceptable.


6      Conclusion

A comparative analysis of gradient learning algorithms of convolutional neural net-
works in solving visual recognition problem was held in this report. Studies show that
for this task quite effective is a simple gradient descent algorithm. Pulse usage in the
considered modifications led to some improvement in the recognition process, but it
also increased the computation cost. In the considered problems the most effective
algorithm is Adamax. In [24] it is recommended to always start with the Adam opti-
mizer, regardless of the architecture of the neural network and problem areas in which
it is used. However, in our opinion, at the decision of problems of recognition the
Adamax algorithm should be used. At the same time, however, the problem of select-
ing the optimal values of the algorithms parameters that provide top speed of learning
still remains open.


7      Acknowledgement

This project has been funded with support from the European Commission. This pub-
lication reflects the views only of the author, and the Commission cannot be held
responsible for any use which may be made of the information contained therein.
     Fig. 1. Example images from the Oxford flower dataset


Fig. 2. Performance of different first-order optimization algorithms
               Fig. 3. Accuracy of different first-order optimization algorithms


References
 1. Goodfellow, I., Bengio, Y. and Courville, A.: Deep learning. MIT Press (2016).
 2. Krzhevsky, A., Sutshever, I. and Hinton, G.: ImageNet classi_cation with deep convolu-
    tional neuralnetworks. In NIPS (2012).
 3. Long, J., Shelhamer, E. and Darrell, T.: Fully convolutional networks for semantic seg-
    mentation. In CVPR (2015).
 4. Girshick, R.: Fast R-CNN. In ICCV (2015).
 5. Ren, R., Girshick, K., He, and Sun, J.: Faster R-CNN: Towards real-time object detection
    with region proposal networks. IEEE Trans. PAMI, 39:1137-1149 (2016).
 6. Li, P.: Optimization Algorithms for Deep Learning, http://lipiji.com/docs/li2017optdl.pdf
 7. Tan, H.H. and Lim, K.H.: Review of second-order optimization techniques in artificial
    neural networks backpropagation. IOP Conf. Series: Materials Science and Engineering
    495 (2019).
 8. Ruder, S.: An overview of gradient descent optimization Algorithms,
    https://arxiv.org/abs/1609.04747
 9. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat., 22(3), pp.
    400-407 (1951).
10. Wasan, M.: Stochastic Approximation. Cambridge University Press (1969).
11. Oppermann, A.A.: Optimization Algorithms in Deep Learning, https://www.deeplearning-
    academy.com/p/ai-wiki-optimization-algorithms
12. Schmidt, M., Le Roux, N. and Bach, F.: Minimizing finite sums with the stochastic aver-
    age gradient. Technical report, INRIA, hal-0086005 (2013).
13. Johnson, R. and Zhang, T.: Accelerating stochastic gradient descent using predictive vari-
    ance reduction. In Advances in Neural Information Processing Systems, pp. 315–323
    (2013).
14. Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: A Fast Incremental Gradient Method
    With Support for Non-Strongly Convex Composite Objectives https://arXiv:1407.0202v3
15. Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR
    Computational Mathematics and Mathematical Physics, vol. 4, no. 5, pp. 1-17 (1964).
16. Nesterov, Y.: A method of solving a convex programming problem with convergence rate
    O(1/sqr (k)). Soviet Mathematics Doklady, 27, pp. 372-376 (1983).
17. Bengio, Y., Boulanger-Lewandowski, N., Pascanu, R.: Advances in optimizing recurrent
    networks. Proc. ICASSP, pp. 8624-8626, (2013).
18. Duchi, J., Hazan, E., Singer, Y.: Adaptive Subgradient Methods for Online Learning and
    Stochastic Optimization. J. of Machine Learning Research, 12, рр. 2121-2159 (2011).
19. Tieleman, T., Hinton, G.: Lecture 6.5-rmsprop: Divide the gradient by a running average
    of its recentmagnitude. COURSERA: Neural Networks Mach. Learn., 4, 26–31 (2012).
20. Zeiler,     M.D.:      ADADELTA:         An     Adaptive    Learning     Rate    Method,
    http://arxiv.org/abs/1212.57012012.
21. Kingma, D.P., Ba, J.L.: Adam: a Method for Stochastic Optimization. 2nd Int. Conf. Learn-
    ing Representations, ICLR 2014, Banff, AB, Canada, pp.1-13, April 14-16, 2014.
22. Dozat, T.: Incorporating Nesterov Momentum into Adam. Technical report, Stanford Uni-
    versity, Tech. Rep., (2015).
23. Jiang, Z., Balu, A., Tan, S.Y., Lee, Y.M., Sarkar, C.S.: On Higher-order Moments in Ad-
    am, https://arXiv:1910.06878v1
24. Reddi, S.J., Kale, S., Kumar, S.: On the convergence of adam and beyond. Int. Conf. Learn-
    ing Representations (ICLR), Vancouver, Canada, 23 p., Apr -May 2018.
25. Wu, X.. Ward, R., Bottou, L.: WNGrad: Learn the Learning Rate in Gradient Descent,
    https://arXiv:1803.02865v1.
26. Chen, J., Gu, Q.: Closing the Generalization Gap of Adaptive Gradient Methods in Train-
    ing Deep Neural Networks, https://arXiv:1806.06763v1
27. Zhou, Z., Zhang, Q., Lu, G., Wang, H., Zhang, W., Yu Y.: AdaShift: decorrelation and
    convergence of adaptive learning rate methods, https://arXiv:1810.00143v4
28. Keskar, N.S., Socher, R.: Improving Generalization Performance by Switching from Adam
    to SGD, https://arXiv:1712.07628v1
29. Ngiam, J., Coates, A., Lahiri, A., Prochnow, B., Le, Q.V. and Ng, A.Y.: On optimization
    methods for deep learning. In Proc. of the 28th Int. Conf. on Machine Learning (ICML-
    11), pp. 265–272 (2011).
30. Zinkevich, M., Weimer, M., Smola, A.J., Li, L.: Parallelized stochastic gradient descent.,
    Advances in neural information processing systems 23 (23), pp. 2595–2603 (2010).
31. Recht, B., Re, C., Wright, S. and Niu, F.: Hogwild: A lock-free approach to parallelizing
    stochastic gradient descent. In Advances in Neural Information Processing Systems, pages
    693–701 (2011).
32. Daninga, C., Shiganga, Li, Yunquana, Z.: Weighted parallel SGD for distributed unbal-
    anced-workload training system, https://arXiv:1708.04801v1
33. Keuper, J., and Pfreundt F.-J.: Asynchronous Parallel Stochastic Gradient Descent A Nu-
    meric      Core     for   Scalable     Distributed   Machine      Learning    Algorithms,
    https://arXiv:1802.09941v2
34. Chu, C.T., Kim, S.K., Lin, Y.A., Yu, Y., Bradski, G.R., Ng, A.Y., and Olukotun, K.: Map-
    reduce for machine learning on multicore. In Proc. of Neural Information Processing Sys-
    tems Conf. NIPS ’06, . MITPress, pp. 281–288 (2006).
35. Flower Datasets, http://www.robots.ox.ac.uk/~vgg/data/flowers/
36. An end-to-end open source machine learning platform, https://www.tensorflow.org/

</pre>