A block coordinate descent optimizer for classification problems exploiting
                                      convexity

                     Ravi G. Patel, Nathaniel A. Trask, Mamikon A. Gulian, Eric C. Cyr
                                    Center for Computing Research, Sandia National Laboratories
                                                 Albuquerque, New Mexico, 87123
                                           {rgpatel, natrask, mgulian, eccyr}@sandia.gov

                            Abstract                                 where LCE , FSM , FLL and FHL denote a cross-entropy loss,
  Second-order optimizers hold intriguing potential for deep
                                                                     softmax layer, linear layer, and hidden layer, respectively,
  learning, but suffer from increased cost and sensitivity to the    and ◦ is composition. We denote linear layer weights by W,
  non-convexity of the loss surface as compared to gradient-         and consider a general class of hidden layers (e.g. dense
  based approaches. We introduce a coordinate descent method         networks, convolutional networks, etc.), denoting associated
  to train deep neural networks for classification tasks that ex-    weights and biases by the parameter ξ. The final three layers
  ploits global convexity of the cross-entropy loss in the weights   are expressed as
  of the linear layer. Our hybrid Newton/Gradient Descent                                          Nc
  (NGD) method is consistent with the interpretation of hid-                                      X
  den layers as providing an adaptive basis and the linear layer
                                                                                    LCE (x; y) = −    y i log xi ,
  as providing an optimal fit of the basis to data. By alternating                                   i=1
  between a second-order method to find globally optimal pa-                            i                 exp(xi )              (2)
  rameters for the linear layer and gradient descent to train the                      FSM  (x) = PNc                 ,
  hidden layers, we ensure an optimal fit of the adaptive basis to                                       j=1 exp(xj )
  data throughout training. The size of the Hessian in the second-                       i
                                                                                       FLL  (x) = Wx,
  order step scales only with the number weights in the linear
                                                                                        Nin
  layer and not the depth and width of the hidden layers; fur-       and map FHL : R         → RNbasis ; FLL : RNbasis → RNclasses ;
                                                                             Nclasses      Nclasses
  thermore, the approach is applicable to arbitrary hidden layer     FSM : R          → R           ; and LCE : RNclasses → R. Here,
  architecture. Previous work applying this adaptive basis per-      Nbasis is the dimension of the output of the hidden layer; this
  spective to regression problems demonstrated significant im-       notation is explained in the next paragraph. The standard
  provements in accuracy at reduced training cost, and this work     classification problem is to solve
  can be viewed as an extension of this approach to classification
  problems. We first prove that the resulting Hessian matrix is                     (W∗ , ξ ∗ ) = argmin L(W, ξ, D).             (3)
                                                                                                  W,ξ
  symmetric semi-definite, and that the Newton step realizes a
  global minimizer. By studying classification of manufactured       The recent work by Cyr et al. (2019) performed a similar
  two-dimensional point cloud data, we demonstrate both an           partition of weights into linear layer weights W and hidden
  improvement in validation error and a striking qualitative dif-    layer weights ξ for regression problems. Two important ob-
  ference in the basis functions encoded in the hidden layer when    servations were made using this decomposition. First, the
  trained using NGD. Application to image classification bench-      output of the hidden layers can be treated as an adaptive basis
  marks for both dense and convolutional architectures reveals       with the learned weights W corresponding to the coefficients
  improved training accuracy, suggesting gains of second-order
  methods over gradient descent. A Tensorflow implementation
                                                                     producing the prediction. Second, holding ξ fixed leads to
  of the algorithm is available at github.com/rgp62/.                a linear least squares problem for the basis coefficients W
                                                                     that can be solved for a global minimum. This work builds
                                                                     on these two observations for classification problems. The
      A Newton/gradient coordinate descent                           output of the hidden layers FHL defines a basis
           optimizer for classification                                        Φα (·, ξ) : RNin → R for α = 1 . . . Nbasis      (4)
                          Ndata
Denote by D = {(xi , yi )}i=1   data/label pairs, and consider       where Φα (x, ξ) is row α of FHL (x, ξ). Thus the input to
the following class of deep learning architectures:                  the softmax classification layer are Nclasses functions, each
  L(W, ξ, D) =                                                       defined using the adaptive basis Φα and a single row of the
   X                                                                 weight matrix W. The crux of this approach to classification
        LCE (·; yi ) ◦ FSM ◦ FLL (·; W) ◦ FHL (xi ; ξ), (1)          is the observation that for all ξ, the function
(xi ,yi )∈D                                                                            S(W, D) = L(W, ξ, D)                     (5)
Copyright © 2021for this paper by its authors. Use permitted under   is convex with respect to W, so the global minimizer
Creative Commons License Attribution 4.0 International (CC BY
4.0)
                                                                                        W∗ = argmin S(W, D)                     (6)
                                                                                                 W
may be obtained via Newton iteration with line search. In                               Data: batch B ⊂ D, ξold , Wold , α, ρ
the Algorithm section, we introduce a coordinate-descent                                Result: ξnew , Wnew
optimizer that alternates between a globally optimal solution                           for j ∈ {1, ..., newton_steps} do
of (6) and a gradient descent step minimizing (3). Combining                                Compute gradient G = ∇W S(Wold , B) and
this with the interpretation of the hidden layer as providing a                              Hessian H = ∇W ∇W S(Wold , B) ;
data-driven adaptive basis, this ensures that during training                               Solve Hs = −G;
the parameters evolve along a manifold providing optimal fit                                W† ← Wold + s;
of the adaptive basis to data (Cyr et al. 2019). We summa-                                  λ ← 1;
rize this perspective in the section on Relation to previous
works, and in the Results section we investigate how this                                   while S(W† , B) > S(Wold , B) + αλG · s do
                                                                                                λ ← λρ;
approach differs from stochastic gradient descent (GD), both
                                                                                               W† ← Wold + λs;
in accuracy and in the qualitative properties of the hidden
layer basis.                                                                                end
                                                                                        end
   Convexity analysis and invertibility of the                                         Wnew ← W† ;
                                                                                        ξnew ← GD(ξold , B, Wnew );
                   Hessian
In what follows, we use basic properties of convex func-                              Algorithm 1: Application of coordinate descent algo-
tions (Boyd, Boyd, and Vandenberghe 2004) and the Cauchy-                             rithm for classification to a single batch B ⊂ D. For the
Schwartz inequality (Folland 1999) to prove that S in (5) is                          purposes of this work, we use ρ = 0.5 and α = 10−4 .
convex. Recall that convexity is preserved under affine trans-
formations. We first note that LLL (W; D, ξ) is linear. By (1),
it remains only to show that the composition LCE ◦ FSM is                                                     Algorithm
convex. We write, for any data vector y,
                                                                                     Application of the traditional Newton method to the problem
LCE ◦ FSM (x; y)                                                                     (3) would require solution of a dense matrix problem of size
                 NX
                  classes
                                                                 !                   equal to the total number of parameters in the network. In
                                               exp(xi )                              contrast, we alternate between applying Newton’s method
            =−              yi log       PNclasses
                   i=1                       exp(xj )
                                            j=1                                      to solve only for W in (6) and a single step of a gradient-
                                                                                   based optimizer for the remaining parameters ξ; the Newton
               NX
                classes         NX
                                 classes          NX
                                                   classes
                                                                                     step therefore scales with the number of weights (Nbasis ×
            =−          yi xi +          yi log           exp (xj )                Nclasses ) in the linear layer. Since S is convex, Newton’s
                   i=1                   i=1                  j=1                    method with appropriate backtracking or trust region may
                                                                                     be expected to achieve a global minimizer. We pursue a
                 NX
                  classes
                                            
                                               NX
                                                classes
                                                                                    simple backtracking approach, taking the step direction and
            =−              yi xi + log                   exp (xj ) .              size from standard Newton and repeatedly reducing the step
                                                                                     direction until the Armijo condition is satisfied, ensuring an
                   i=1                          j=1
                                                                                     adequate reduction of the loss (Armijo 1966; Dennis Jr and
The first term above is affine and thus con-                                         Schnabel 1996). For the gradient descent step we apply Adam
vex. We proveP the convexity         of the second term                            (Kingma and Ba 2014), although one may apply any gradient-
f (x) := log
               Nclasses
                        exp (x i )  by writing                                       based optimizer; we denote such an update to the hidden
               i=1
                                                                                     layers for fixed W by the function GD(ξ, B, W). To handle
  f (θx + (1 − θ)y)                                                                  large data sets, stochastic gradient descent (GD) updates
                                                                                     parameters using gradients computed over disjoint subsets
                             NX
                                                                             !
                              classes
                                                     θ                 1−θ           B ⊂ D (Bottou 2010). To expose the same parallelism, we
                = log                   (exp(xi )) (exp(yi ))                    .   apply our coordinate descent update over the same batches by
                              i=1                                                    solving (6) restricted to B. Note that this implies an optimal
Applying Cauchy-Schwartz with 1/p = θ and 1/q = 1 − θ,                               choice of W over B only. We summarize the approach in
                                                                                     Alg. 1.
   f (θx + (1 − θ)y)
                                                                                        While H and G may be computed analytically from (2),
                                !θ                                   !1−θ 
                NX
                 classes                        NX
                                                 classes                             we used automatic differentiation for ease of implementation.
       ≤ log            exp(xi )                          exp(yi )                 The system Hs = −G can be solved using either a dense or
                   i=1                           i=1                                 an iterative method. Having proven convexity of S in (6), and
                                                                                     thus positive semi-definiteness of the Hessian, we may apply
      = θf (x) + (1 − θ)f (y).
                                                                                     a conjugate gradient method. We observed that solving to a
Thus f , and therefore LCE ◦ FSM and S, are convex. As a                             relatively tight residual resulted in overfitting during training,
consequence, the Hessian H of S with respect to W is a sym-                          while running a fixed number Ncg of iterations improved val-
metric positive semi-definite function, allowing application                         idation accuracy. Thus, we treat Ncg as a hyperparameter in
of a convex optimizer in the following section to realize a                          our studies below. We also experimented with dense solvers;
global minimum.                                                                      due to rank deficiency we considered a pseudo-inverse of the
form H † = (H +I)−1 , where taking a finite  > 0 provided        some works have attempted to minimize general loss func-
similar accuracy gains. We speculate that these approaches         tions by minimizing surrogate `2 problems (Barratt and Boyd
may be implicitly regularizing the training. For brevity we        2020).
only present results using the iterative approach; the resulting
accuracy was comparable to the dense solver. In the follow-                                   Results
ing section we typically use only a handful of Newton and
CG iterations, so the additional cost is relatively small.         We study the performance and properties of the NGD algo-
   We later provide convergence studies comparing our tech-        rithm as compared to the standard stochastic gradient descent
nique to GD using the Adam optimizer and identical batching.       (GD) on several benchmark problems with various architec-
We note that a lack of optimized software prevents a straight-     tures. We start by applying dense network architectures to
forward comparison of the performance of our approach vs.          classification in the peaks problem. This allows us to plot
standard GD; while optimized GPU implementations are al-           and compare the qualitative properties of the basis functions
ready available for GD, it is an open question how to most         Φα (·, ξ) encoded in the hidden layer (4) when trained with
efficiently parallelize the current approach. For this reason we   the two methods. We then compare the performance of NGD
compare performance in terms of iterations, deferring wall-        and GD for the standard image classification benchmarks
clock benchmarking to a future work when a fair comparison         CIFAR-10, MNIST, and Fashion MNIST using both dense
is possible.                                                       and convolutional (ConvNet) architectures. Throughout this
                                                                   section, we compare performance in terms of iterations of
             Relation to previous works                            Alg. 1 for NGD and iterations of stochastic gradient descent,
                                                                   each of which achieves a single update of the parameters
We seek an extension of Cyr et al. (2019). This work used
                                                                   (W, ξ) in the respective algorithm based on a batch B; this is
an adaptive basis perspective to motivate a block coordi-
                                                                   the number of epochs multiplied by the number of batches.
nate descent approach utilizing a linear least squares solver.
The training strategy they develop can be found under the
name of variable projection, and was used to train small net-
                                                                   Peaks problem
works (McLoone et al. 1998; Pereyra, Scherer, and Wong             The peaks benchmark is a synthetic dataset for understand-
2006). In addition to the work in Cyr et al. (2019), the per-      ing the qualitative performance of classification algorithms
spective of neural networks producing an adaptive basis has        (Haber and Ruthotto 2017). Here, a scattered point cloud in
been considered by several approximation theorists to study        the two- dimensional unit square [0, 1]2 is partitioned into dis-
the accuracy of deep networks (Yarotsky 2017; Opschoor,            joint sets. The classification problem is to determine which of
Petersen, and Schwab 2019; Daubechies et al. 2019). The            those sets a given 2D point belongs to. The two-dimensional
combination of the adaptive basis perspective combined with        nature allows visualization of how NGD and GD classify
the block coordinate descent optimization demonstrated dra-        data. In particular, plots of both how the nonlinear basis en-
matic increases in accuracy and performance in Cyr et al.          coded by the hidden layer maps onto classification space
(2019), but was limited to an `2 loss. None of the previous        and how the linear layer combines the basis functions to
works have considered the generalization of this approach          assign a probability map over the input space are readily
to training deep neural networks with a cross-entropy loss         obtained. We train a depth 4 dense network of the form (1)
typically used in classification as we develop here.               with Nin = 2, three hidden layers of width 12 contracting to
   Bottou, Curtis, and Nocedal (2018) provides a mathemati-        a final hidden layer of width Nbasis = 6, with tanh activation
cal summary on the breadth of work on numerical optimizers         and batch normalization, and Nclasses = 5 classes. As speci-
used in machine learning. Several recent works have sought         fied by Haber and Ruthotto (2017), 5000 training points are
different means to incorporate second-order optimizers to          sampled from [0, 1]2 . The upper-left most image in Figure 2
accelerate training and avoid issues with selecting hyperpa-       shows the sampled data points with their observed classes.
rameters and training schedules (Osawa et al. 2019, 2020;          For the peaks benchmark we use a single batch containing
Botev, Ritter, and Barber 2017; Martens 2010). Some pursue         all training points, i.e. B = D. The NGD algorithm uses
a quasi-Newton approach, defining approximate Hessians, or         5 Newton iterations per training step with 3 CG iterations
apply factorization to reduce the effective bandwidth of the       approximating the linear solve. The learning rate for Adam
Hessian (Botev, Ritter, and Barber 2017; Xu, Roosta, and Ma-       for both NGD and GD is 10−4 .
honey 2019). Our work pursues a (block) coordinate descent            Figure 1 demonstrates that for an identical architecture,
strategy, partitioning degrees of freedom into sub-problems        NGD provides a rapid increase in both training and valida-
amenable to more sophisticated optimization (Nesterov 2012;        tion accuracy over GD after a few iterations. For a large
Wright 2015; Blondel, Seki, and Uehara 2013). Many works           number of iterations both approaches achieve comparable
have successfully employed such schemes in ML contexts             training accuracy, although NGD generalizes better to the
(e.g. (Blondel, Seki, and Uehara 2013; Fu 1998; Shevade            validation set. The improvement in validation accuracy is
and Keerthi 2003; Clarkson, Hazan, and Woodruff 2012)),            borne out in Figure 2, which compares representative in-
but they typically rely on stochastic partitioning of variables    stances of training using GD and NGD. While a single in-
rather than the partition of the weights of deep neural net-       stance is shown, the character of these results is consistent
works into hidden layer variables and their complement pur-        with other neural networks trained for the Peaks problem in
sued here. The strategy of extracting convex approximations        the same way. The top row illustrates the predicted classes
to nonlinear loss functions is classical (Bubeck 2014), and        argmax [FSM (x)] ∈ {0, 1, 2, 3, 4} for x ∈ [0, 1]2 and the
                                1.0                                                                     1.0

                                0.8                                                                     0.8


                                                                                  Validation Accuracy
            Training Accuracy                                                                           0.6
                                0.6
                                                                                                        0.4
                                0.4
                                                                                                        0.2
                                                                         GD                                                                                      GD
                                0.2                                      NGD                                                                                     NGD
                                                                                                        0.0
                                      0   2000   4000      6000   8000    10000                                0       2000        4000      6000       8000      10000
                                                   Iterations                                                                        Iterations
Figure 1: The training (left) and validation (right) accuracy for the peaks problem for both gradient descent (GD) and the
Newton/gradient descent (NGD) algorithm. The solid lines represent the mean of 16 independent runs, and the shaded areas
represent the mean ± one standard deviation.


training data, demonstrating that the NGD-trained network                                               a simple comparison. The code for this study is provided at
predicts the class i = 2 of lowest training point density more                                          github.com/rgp62/.
accurately than the GD-trained network. The remaining sets                                                 For all results reported in this section, we first optimize the
of images visualize both the classification probability map                                             hyperparameters listed in Table 1 by maximizing the valida-
[FSM (x)]i for i ∈ {0, 1, 2, 3, 4} (middle row) and the six                                             tion accuracy over the training run. We perform this search
basis functions Φα (x, ξ) (bottom row) learned by each op-                                              using the Gaussian process optimization tool in the scikit-
timizer. The difference in the learned bases is striking. GD                                            optimize package with default options (Head et al. 2018).
learns a basis that is nearly discontinuous, in which the sup-                                          This process is performed for both GD and NGD to allow
port of each basis function appears fit to the class boundaries.                                        a fair comparison. The ranges for the search are shown in
On the other hand, NGD learns a far smoother basis that can                                             Table 1 with the optimal hyperparameters for each dataset
be combined to give sharper class boundary predictions. This                                            examined in this study. For all problems we partition data
manifests in the resulting probability map assigned to each                                             into training, validation and test sets to ensure hyperparam-
class; linear combinations of the rougher GD basis results                                              eter optimization is not overfitting. For MNIST and fashion
in imprinting and assignment of probability far from the rel-                                           MNIST we consider a 50K/10K/10K partition, while for
evant class. This serves as an explanation of the improved                                              CIFAR-10 we consider a 40K/10K/10K partition. All train-
validation accuracy of NGD as compared to GD despite sim-                                               ing is performed with a batch size of 1000 over 100 epochs.
ilar final training accuracy. The NGD algorithm separates                                               For all results the test accuracy falls within the first standard
refinement of the basis from the determination of the coeffi-                                           deviation error bars included in Figure 3.
cients. This provides an effective regulation of the final basis,                                          Figure 3 shows the training and validation accuracies using
leading to improved generalization.                                                                     the optimal hyperparameters for a dense architecture with
                                                                                                        two hidden layers of width 128 and 10 and ReLU activation
Image recognition benchmarks                                                                            functions. We find for all datasets, NGD more quickly reaches
                                                                                                        a maximum validation accuracy compared to GD, while both
We consider in this section a collection of image classifica-                                           optimizers achieve a similar final validation accuracy. For
tion benchmarks: MNIST (Deng 2012; Grother 1995), fash-                                                 the more difficult CIFAR-10 benchmark, NGD attains the
ion MNIST (Xiao, Rasul, and Vollgraf 2017), and CIFAR-                                                  maximum validation accuracy of GD in roughly one-quarter
10 (Krizhevsky, Hinton et al. 2009). We focus primarily on                                              of the number of iterations. In Figure 3, we also use the
CIFAR-10 due to its increased difficulty; it is well-known                                              CIFAR-10 dataset to compare the dense architecture to the
that one may obtain near-perfect accuracy in the MNIST                                                  following ConvNet architecture,
benchmark without sophisticated choice of architecture. For
all problems, we consider a simple dense network architec-                                                    Convolution → Max Pooling →                       Convolution
                                                                                                          8 channels, 3x3 kernel       2x2 window          16 channels, 3x3 kernel
ture to highlight the role of the optimizer, and for CIFAR-10
we also utilize convolutional architectures (ConvNet). This                                                   → Max Pooling →            Convolution            → Dense → Dense
                                                                                                                   2x2 window         16 channels, 3x3 kernel     width 64     width 10
highlights that our approach applies to general hidden layer
architectures. Our objective is to demonstrate improvements                                             where the convolution and dense layers use the ReLU activa-
in accuracy due to optimization with all other aspects held                                             tion function. Again, NGD attains the maximum validation
equal. For CIFAR-10, for example, the state-of-the-art re-                                              accuracy of GD in one-quarter the number of iterations, and
quires application of techniques such as data-augmentation                                              also leads to an improvement of 1.76% in final test accu-
and complicated architectures to realize good accuracy; for                                             racy. This illustrates that NGD accelerates training and can
simplicity we do not consider such complications to allow                                               improve accuracy for a variety of architectures.
                                Training Data                 GD Prediction             NGD Prediction          4

                                                                                                                3

                                                                                                                2

                                                                                                                1

                                                                                                                0
                                   GD: Classes                                               NGD: Classes


                               GD: Basis Functions                                       NGD: Basis Functions


Figure 2: Results for peaks benchmarks, with comparison between NGD and GD on an identical architecture. In this example,
GD obtained a training accuracy of 99.3% and validation accuracy of 96.2%, while NGD obtained a training accuracy of 99.6%
and validation accuracy of 98.0%. Top: Training data (left), classification by GD (center), and classification by NGD (right).
GD misclassifies large portions of the input space. Middle: The linear and softmax layers combine basis functions to assign
classification probabilities to each class. The sharp basis learned in GD leads to artifacts and attribution of probability far from
the sets (left) while diffuse basis in NGD provides a sharp characterization of class boundaries (right). Bottom: Adaptive basis
encoded by hidden layer, as learnt by GD (left) and NGD (right). For GD the basis is sharp, and individual basis functions
conform to classification boundaries, while NGD learns a more regular basis.


                       Hyperparameter              range          MNIST       Fashion      CIFAR-10         CIFAR-10
                                                                              MNIST                         ConvNet

                         Learning rate          [10−8 , 10−2 ]    10−2.81     10−3.33       10−3.57          10−2.66
                                                                  10−2.26     10−2.30       10−2.50          10−2.30
                        Adam decay                 [0.5, 1]        0.537       0.756         0.629            0.755
                        parameter 1                                0.630       0.657         0.891            0.657
                        Adam decay                 [0.5, 1]        0.830       0.808         0.782            0.858
                        parameter 2                                0.616       0.976         0.808            0.976
                       CG iterations               [1, 10]           3           1             2                2
                      Newton iterations            [1, 10]           6           5             4                7

Table 1: Hyperparameters varied in study (first column), ranges considered (second column), and optimal values found for
MNIST (third column), Fashion MNIST (fourth column), CIFAR-10 (fifth column), and CIFAR-10 with the ConvNet architecture
(last column). For the learning rate and the Adam decay parameters, the optimal values for NGD followed by GD are shown. The
optimal CG and Newton iterations are only applicable to NGD.
                                                                                             CIFAR-10 ConvNet


Figure 3: Training accuracy (top row) and validation accuracy (bottom row) for CIFAR-10, Fashion MNIST, and MNIST datasets
using the dense architecture, and CIFAR-10 using the ConvNet architecture. Mean and standard deviation over 10 training runs
are shown.


                       Conclusion                                                   Acknowledgements
The NGD method, motivated by the adaptive basis interpreta-       Sandia National Laboratories is a multimission laboratory
tion of deep neural networks, is a block coordinate descent       managed and operated by National Technology and Engineer-
method for classification problems. This method separates         ing Solutions of Sandia, LLC, a wholly owned subsidiary
the weights of the linear layer from the weights of the pre-      of Honeywell International, Inc., for the U.S. Department
ceding nonlinear layers. NGD uses this decomposition to           of Energy’s National Nuclear Security Administration under
exploit the convexity of the cross-entropy loss with respect      contract DE-NA0003525. This paper describes objective tech-
to the linear layer variables. It utilizes a Newton method to     nical results and analysis. Any subjective views or opinions
approximate the global minimum for a given batch of data be-      that might be expressed in the paper do not necessarily repre-
fore performing a step of gradient descent for the remaining      sent the views of the U.S. Department of Energy or the United
variables. As such, it is a hybrid first/second order optimizer   States Government. SAND Number: SAND2021-3013 C.
which extracts significant performance from a second-order           The work of R. Patel, N. Trask, and M. Gulian are sup-
substep that only scales with the number of weights in the        ported by the U.S. Department of Energy, Office of Advanced
linear layer, making it an appealing and feasible application     Scientific Computing Research under the Collaboratory on
of second-order methods for training very deep neural net-        Mathematics and Physics-Informed Learning Machines for
works. Applying this optimizer to dense and convolutional         Multiscale and Multiphysics Problems (PhILMs) project. E.
networks, we have demonstrated acceleration with respect          C. Cyr is supported by the Department of Energy early career
to the number of epochs in the validation loss for the peaks,     program. M. Gulian is supported by the John von Neumann
MNIST, Fashion MNIST, and CIFAR-10 benchmarks, with               fellowship at Sandia National Laboratories.
improvements in accuracy for peaks benchmark and CIFAR-
10 benchmark using a convolutional network.                                               References
   Examining the basis functions encoded in the hidden layer      Armijo, L. 1966. Minimization of functions having Lips-
of the network in the peaks benchmarks revealed significant       chitz continuous first partial derivatives. Pacific Journal of
qualitative difference between NGD and stochastic gradient        mathematics 16(1): 1–3.
descent in the exploration of parameter space corresponding
to the hidden layer variables. This, and the role of the tol-     Barratt, S. T.; and Boyd, S. P. 2020. Least squares auto-tuning.
erance in the Newton step as an implicit regularizer, merit       Engineering Optimization 1–22.
further study.                                                    Blondel, M.; Seki, K.; and Uehara, K. 2013. Block coor-
   The difference in the regularity of the learned basis and      dinate descent algorithms for large-scale sparse multiclass
probability classes suggests that one may obtain a quali-         classification. Machine learning 93(1): 31–52.
tatively different model by varying only the optimization
                                                                  Botev, A.; Ritter, H.; and Barber, D. 2017. Practical Gauss-
scheme used. We hypothesize that this more regular basis
                                                                  Newton optimisation for deep learning. In Proceedings of the
may have desirable robustness properties which may effect
                                                                  34th International Conference on Machine Learning-Volume
resulting model sensitivity. This could have applications to-
                                                                  70, 557–565. JMLR. org.
ward training networks to be more robust against adversarial
attacks.                                                          Bottou, L. 2010. Large-scale machine learning with stochas-
tic gradient descent. In Proceedings of COMPSTAT’2010,           Opschoor, J. A.; Petersen, P.; and Schwab, C. 2019. Deep
177–186. Springer.                                               ReLU networks and high-order finite element methods. SAM,
Bottou, L.; Curtis, F. E.; and Nocedal, J. 2018. Optimization    ETH Zürich .
methods for large-scale machine learning. Siam Review 60(2):     Osawa, K.; Tsuji, Y.; Ueno, Y.; Naruse, A.; Foo, C.-S.;
223–311.                                                         and Yokota, R. 2020. Scalable and Practical Natural
Boyd, S.; Boyd, S. P.; and Vandenberghe, L. 2004. Convex         Gradient for Large-Scale Deep Learning. arXiv preprint
optimization. Cambridge university press.                        arXiv:2002.06015 .
                                                                 Osawa, K.; Tsuji, Y.; Ueno, Y.; Naruse, A.; Yokota, R.; and
Bubeck, S. 2014. Convex optimization: Algorithms and
                                                                 Matsuoka, S. 2019. Large-scale distributed second-order
complexity. arXiv preprint arXiv:1405.4980 .
                                                                 optimization using Kronecker-factored approximate curva-
Clarkson, K. L.; Hazan, E.; and Woodruff, D. P. 2012. Sub-       ture for deep convolutional neural networks. In Proceedings
linear optimization for machine learning. Journal of the ACM     of the IEEE Conference on Computer Vision and Pattern
(JACM) 59(5): 1–49.                                              Recognition, 12359–12367.
Cyr, E. C.; Gulian, M. A.; Patel, R. G.; Perego, M.; and         Pereyra, V.; Scherer, G.; and Wong, F. 2006. Variable projec-
Trask, N. A. 2019. Robust training and initialization of deep    tions neural network training. Mathematics and Computers
neural networks: An adaptive basis viewpoint. arXiv preprint     in Simulation 73(1-4): 231–243.
arXiv:1912.04862 .                                               Shevade, S. K.; and Keerthi, S. S. 2003. A simple and ef-
Daubechies, I.; DeVore, R.; Foucart, S.; Hanin, B.; and          ficient algorithm for gene selection using sparse logistic re-
Petrova, G. 2019. Nonlinear approximation and (deep) ReLU        gression. Bioinformatics 19(17): 2246–2253.
networks. arXiv preprint arXiv:1905.02199 .                      Wright, S. J. 2015. Coordinate descent algorithms. Mathe-
Deng, L. 2012. The MNIST database of handwritten digit           matical Programming 151(1): 3–34.
images for machine learning research [best of the web]. IEEE     Xiao, H.; Rasul, K.; and Vollgraf, R. 2017. Fashion-MNIST:
Signal Processing Magazine 29(6): 141–142.                       a novel image dataset for benchmarking machine learning
Dennis Jr, J. E.; and Schnabel, R. B. 1996. Numerical meth-      algorithms. arXiv preprint arXiv:1708.07747 .
ods for unconstrained optimization and nonlinear equations,      Xu, P.; Roosta, F.; and Mahoney, M. W. 2019. Newton-type
volume 16. Siam.                                                 methods for non-convex optimization under inexact hessian
Folland, G. B. 1999. Real analysis: modern techniques and        information. Mathematical Programming 1–36.
their applications, volume 40. John Wiley & Sons.                Yarotsky, D. 2017. Error bounds for approximations with
Fu, W. J. 1998. Penalized regressions: the bridge versus the     deep ReLU networks. Neural Networks 94: 103–114.
lasso. Journal of computational and graphical statistics 7(3):
397–416.
Grother, P. J. 1995. NIST special database 19 handprinted
forms and characters database. National Institute of Stan-
dards and Technology .
Haber, E.; and Ruthotto, L. 2017. Stable architectures for
deep neural networks. Inverse Problems 34(1): 014004.
Head, T.; MechCoder; Louppe, G.; Shcherbatyi, I.; fcharras;
Vinícius, Z.; cmmalone; Schröder, C.; nel215; Campos, N.;
and et al. 2018. scikit-optimize/scikit-optimize: v0.5.1 - re-
release doi:10.5281/zenodo.1170575.
Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochas-
tic optimization. arXiv preprint arXiv:1412.6980 .
Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple
layers of features from tiny images .
Martens, J. 2010. Deep learning via Hessian-free optimiza-
tion. In ICML, volume 27, 735–742.
McLoone, S.; Brown, M. D.; Irwin, G.; and Lightbody, A.
1998. A hybrid linear/nonlinear training algorithm for feed-
forward neural networks. IEEE Transactions on Neural Net-
works 9(4): 669–684.
Nesterov, Y. 2012. Efficiency of coordinate descent meth-
ods on huge-scale optimization problems. SIAM Journal on
Optimization 22(2): 341–362.