A block coordinate descent optimizer for classification problems exploiting convexity Ravi G. Patel, Nathaniel A. Trask, Mamikon A. Gulian, Eric C. Cyr Center for Computing Research, Sandia National Laboratories Albuquerque, New Mexico, 87123 {rgpatel, natrask, mgulian, eccyr}@sandia.gov Abstract where LCE , FSM , FLL and FHL denote a cross-entropy loss, Second-order optimizers hold intriguing potential for deep softmax layer, linear layer, and hidden layer, respectively, learning, but suffer from increased cost and sensitivity to the and ◦ is composition. We denote linear layer weights by W, non-convexity of the loss surface as compared to gradient- and consider a general class of hidden layers (e.g. dense based approaches. We introduce a coordinate descent method networks, convolutional networks, etc.), denoting associated to train deep neural networks for classification tasks that ex- weights and biases by the parameter ξ. The final three layers ploits global convexity of the cross-entropy loss in the weights are expressed as of the linear layer. Our hybrid Newton/Gradient Descent Nc (NGD) method is consistent with the interpretation of hid- X den layers as providing an adaptive basis and the linear layer LCE (x; y) = − y i log xi , as providing an optimal fit of the basis to data. By alternating i=1 between a second-order method to find globally optimal pa- i exp(xi ) (2) rameters for the linear layer and gradient descent to train the FSM (x) = PNc , hidden layers, we ensure an optimal fit of the adaptive basis to j=1 exp(xj ) data throughout training. The size of the Hessian in the second- i FLL (x) = Wx, order step scales only with the number weights in the linear Nin layer and not the depth and width of the hidden layers; fur- and map FHL : R → RNbasis ; FLL : RNbasis → RNclasses ; Nclasses Nclasses thermore, the approach is applicable to arbitrary hidden layer FSM : R → R ; and LCE : RNclasses → R. Here, architecture. Previous work applying this adaptive basis per- Nbasis is the dimension of the output of the hidden layer; this spective to regression problems demonstrated significant im- notation is explained in the next paragraph. The standard provements in accuracy at reduced training cost, and this work classification problem is to solve can be viewed as an extension of this approach to classification problems. We first prove that the resulting Hessian matrix is (W∗ , ξ ∗ ) = argmin L(W, ξ, D). (3) W,ξ symmetric semi-definite, and that the Newton step realizes a global minimizer. By studying classification of manufactured The recent work by Cyr et al. (2019) performed a similar two-dimensional point cloud data, we demonstrate both an partition of weights into linear layer weights W and hidden improvement in validation error and a striking qualitative dif- layer weights ξ for regression problems. Two important ob- ference in the basis functions encoded in the hidden layer when servations were made using this decomposition. First, the trained using NGD. Application to image classification bench- output of the hidden layers can be treated as an adaptive basis marks for both dense and convolutional architectures reveals with the learned weights W corresponding to the coefficients improved training accuracy, suggesting gains of second-order methods over gradient descent. A Tensorflow implementation producing the prediction. Second, holding ξ fixed leads to of the algorithm is available at github.com/rgp62/. a linear least squares problem for the basis coefficients W that can be solved for a global minimum. This work builds on these two observations for classification problems. The A Newton/gradient coordinate descent output of the hidden layers FHL defines a basis optimizer for classification Φα (·, ξ) : RNin → R for α = 1 . . . Nbasis (4) Ndata Denote by D = {(xi , yi )}i=1 data/label pairs, and consider where Φα (x, ξ) is row α of FHL (x, ξ). Thus the input to the following class of deep learning architectures: the softmax classification layer are Nclasses functions, each L(W, ξ, D) = defined using the adaptive basis Φα and a single row of the X weight matrix W. The crux of this approach to classification LCE (·; yi ) ◦ FSM ◦ FLL (·; W) ◦ FHL (xi ; ξ), (1) is the observation that for all ξ, the function (xi ,yi )∈D S(W, D) = L(W, ξ, D) (5) Copyright © 2021for this paper by its authors. Use permitted under is convex with respect to W, so the global minimizer Creative Commons License Attribution 4.0 International (CC BY 4.0) W∗ = argmin S(W, D) (6) W may be obtained via Newton iteration with line search. In Data: batch B ⊂ D, ξold , Wold , α, ρ the Algorithm section, we introduce a coordinate-descent Result: ξnew , Wnew optimizer that alternates between a globally optimal solution for j ∈ {1, ..., newton_steps} do of (6) and a gradient descent step minimizing (3). Combining Compute gradient G = ∇W S(Wold , B) and this with the interpretation of the hidden layer as providing a Hessian H = ∇W ∇W S(Wold , B) ; data-driven adaptive basis, this ensures that during training Solve Hs = −G; the parameters evolve along a manifold providing optimal fit W† ← Wold + s; of the adaptive basis to data (Cyr et al. 2019). We summa- λ ← 1; rize this perspective in the section on Relation to previous works, and in the Results section we investigate how this while S(W† , B) > S(Wold , B) + αλG · s do λ ← λρ; approach differs from stochastic gradient descent (GD), both W† ← Wold + λs; in accuracy and in the qualitative properties of the hidden layer basis. end end Convexity analysis and invertibility of the Wnew ← W† ; ξnew ← GD(ξold , B, Wnew ); Hessian In what follows, we use basic properties of convex func- Algorithm 1: Application of coordinate descent algo- tions (Boyd, Boyd, and Vandenberghe 2004) and the Cauchy- rithm for classification to a single batch B ⊂ D. For the Schwartz inequality (Folland 1999) to prove that S in (5) is purposes of this work, we use ρ = 0.5 and α = 10−4 . convex. Recall that convexity is preserved under affine trans- formations. We first note that LLL (W; D, ξ) is linear. By (1), it remains only to show that the composition LCE ◦ FSM is Algorithm convex. We write, for any data vector y, Application of the traditional Newton method to the problem LCE ◦ FSM (x; y) (3) would require solution of a dense matrix problem of size NX classes ! equal to the total number of parameters in the network. In exp(xi ) contrast, we alternate between applying Newton’s method =− yi log PNclasses i=1 exp(xj ) j=1 to solve only for W in (6) and a single step of a gradient-   based optimizer for the remaining parameters ξ; the Newton NX classes NX classes NX classes step therefore scales with the number of weights (Nbasis × =− yi xi + yi log  exp (xj ) Nclasses ) in the linear layer. Since S is convex, Newton’s i=1 i=1 j=1 method with appropriate backtracking or trust region may be expected to achieve a global minimizer. We pursue a NX classes  NX classes  simple backtracking approach, taking the step direction and =− yi xi + log  exp (xj ) . size from standard Newton and repeatedly reducing the step direction until the Armijo condition is satisfied, ensuring an i=1 j=1 adequate reduction of the loss (Armijo 1966; Dennis Jr and The first term above is affine and thus con- Schnabel 1996). For the gradient descent step we apply Adam vex. We proveP the convexity  of the second term (Kingma and Ba 2014), although one may apply any gradient- f (x) := log Nclasses exp (x i ) by writing based optimizer; we denote such an update to the hidden i=1 layers for fixed W by the function GD(ξ, B, W). To handle f (θx + (1 − θ)y) large data sets, stochastic gradient descent (GD) updates parameters using gradients computed over disjoint subsets NX ! classes θ 1−θ B ⊂ D (Bottou 2010). To expose the same parallelism, we = log (exp(xi )) (exp(yi )) . apply our coordinate descent update over the same batches by i=1 solving (6) restricted to B. Note that this implies an optimal Applying Cauchy-Schwartz with 1/p = θ and 1/q = 1 − θ, choice of W over B only. We summarize the approach in Alg. 1. f (θx + (1 − θ)y) While H and G may be computed analytically from (2),  !θ !1−θ  NX classes NX classes we used automatic differentiation for ease of implementation. ≤ log  exp(xi ) exp(yi )  The system Hs = −G can be solved using either a dense or i=1 i=1 an iterative method. Having proven convexity of S in (6), and thus positive semi-definiteness of the Hessian, we may apply = θf (x) + (1 − θ)f (y). a conjugate gradient method. We observed that solving to a Thus f , and therefore LCE ◦ FSM and S, are convex. As a relatively tight residual resulted in overfitting during training, consequence, the Hessian H of S with respect to W is a sym- while running a fixed number Ncg of iterations improved val- metric positive semi-definite function, allowing application idation accuracy. Thus, we treat Ncg as a hyperparameter in of a convex optimizer in the following section to realize a our studies below. We also experimented with dense solvers; global minimum. due to rank deficiency we considered a pseudo-inverse of the form H † = (H +I)−1 , where taking a finite  > 0 provided some works have attempted to minimize general loss func- similar accuracy gains. We speculate that these approaches tions by minimizing surrogate `2 problems (Barratt and Boyd may be implicitly regularizing the training. For brevity we 2020). only present results using the iterative approach; the resulting accuracy was comparable to the dense solver. In the follow- Results ing section we typically use only a handful of Newton and CG iterations, so the additional cost is relatively small. We study the performance and properties of the NGD algo- We later provide convergence studies comparing our tech- rithm as compared to the standard stochastic gradient descent nique to GD using the Adam optimizer and identical batching. (GD) on several benchmark problems with various architec- We note that a lack of optimized software prevents a straight- tures. We start by applying dense network architectures to forward comparison of the performance of our approach vs. classification in the peaks problem. This allows us to plot standard GD; while optimized GPU implementations are al- and compare the qualitative properties of the basis functions ready available for GD, it is an open question how to most Φα (·, ξ) encoded in the hidden layer (4) when trained with efficiently parallelize the current approach. For this reason we the two methods. We then compare the performance of NGD compare performance in terms of iterations, deferring wall- and GD for the standard image classification benchmarks clock benchmarking to a future work when a fair comparison CIFAR-10, MNIST, and Fashion MNIST using both dense is possible. and convolutional (ConvNet) architectures. Throughout this section, we compare performance in terms of iterations of Relation to previous works Alg. 1 for NGD and iterations of stochastic gradient descent, each of which achieves a single update of the parameters We seek an extension of Cyr et al. (2019). This work used (W, ξ) in the respective algorithm based on a batch B; this is an adaptive basis perspective to motivate a block coordi- the number of epochs multiplied by the number of batches. nate descent approach utilizing a linear least squares solver. The training strategy they develop can be found under the name of variable projection, and was used to train small net- Peaks problem works (McLoone et al. 1998; Pereyra, Scherer, and Wong The peaks benchmark is a synthetic dataset for understand- 2006). In addition to the work in Cyr et al. (2019), the per- ing the qualitative performance of classification algorithms spective of neural networks producing an adaptive basis has (Haber and Ruthotto 2017). Here, a scattered point cloud in been considered by several approximation theorists to study the two- dimensional unit square [0, 1]2 is partitioned into dis- the accuracy of deep networks (Yarotsky 2017; Opschoor, joint sets. The classification problem is to determine which of Petersen, and Schwab 2019; Daubechies et al. 2019). The those sets a given 2D point belongs to. The two-dimensional combination of the adaptive basis perspective combined with nature allows visualization of how NGD and GD classify the block coordinate descent optimization demonstrated dra- data. In particular, plots of both how the nonlinear basis en- matic increases in accuracy and performance in Cyr et al. coded by the hidden layer maps onto classification space (2019), but was limited to an `2 loss. None of the previous and how the linear layer combines the basis functions to works have considered the generalization of this approach assign a probability map over the input space are readily to training deep neural networks with a cross-entropy loss obtained. We train a depth 4 dense network of the form (1) typically used in classification as we develop here. with Nin = 2, three hidden layers of width 12 contracting to Bottou, Curtis, and Nocedal (2018) provides a mathemati- a final hidden layer of width Nbasis = 6, with tanh activation cal summary on the breadth of work on numerical optimizers and batch normalization, and Nclasses = 5 classes. As speci- used in machine learning. Several recent works have sought fied by Haber and Ruthotto (2017), 5000 training points are different means to incorporate second-order optimizers to sampled from [0, 1]2 . The upper-left most image in Figure 2 accelerate training and avoid issues with selecting hyperpa- shows the sampled data points with their observed classes. rameters and training schedules (Osawa et al. 2019, 2020; For the peaks benchmark we use a single batch containing Botev, Ritter, and Barber 2017; Martens 2010). Some pursue all training points, i.e. B = D. The NGD algorithm uses a quasi-Newton approach, defining approximate Hessians, or 5 Newton iterations per training step with 3 CG iterations apply factorization to reduce the effective bandwidth of the approximating the linear solve. The learning rate for Adam Hessian (Botev, Ritter, and Barber 2017; Xu, Roosta, and Ma- for both NGD and GD is 10−4 . honey 2019). Our work pursues a (block) coordinate descent Figure 1 demonstrates that for an identical architecture, strategy, partitioning degrees of freedom into sub-problems NGD provides a rapid increase in both training and valida- amenable to more sophisticated optimization (Nesterov 2012; tion accuracy over GD after a few iterations. For a large Wright 2015; Blondel, Seki, and Uehara 2013). Many works number of iterations both approaches achieve comparable have successfully employed such schemes in ML contexts training accuracy, although NGD generalizes better to the (e.g. (Blondel, Seki, and Uehara 2013; Fu 1998; Shevade validation set. The improvement in validation accuracy is and Keerthi 2003; Clarkson, Hazan, and Woodruff 2012)), borne out in Figure 2, which compares representative in- but they typically rely on stochastic partitioning of variables stances of training using GD and NGD. While a single in- rather than the partition of the weights of deep neural net- stance is shown, the character of these results is consistent works into hidden layer variables and their complement pur- with other neural networks trained for the Peaks problem in sued here. The strategy of extracting convex approximations the same way. The top row illustrates the predicted classes to nonlinear loss functions is classical (Bubeck 2014), and argmax [FSM (x)] ∈ {0, 1, 2, 3, 4} for x ∈ [0, 1]2 and the 1.0 1.0 0.8 0.8 Validation Accuracy Training Accuracy 0.6 0.6 0.4 0.4 0.2 GD GD 0.2 NGD NGD 0.0 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 Iterations Iterations Figure 1: The training (left) and validation (right) accuracy for the peaks problem for both gradient descent (GD) and the Newton/gradient descent (NGD) algorithm. The solid lines represent the mean of 16 independent runs, and the shaded areas represent the mean ± one standard deviation. training data, demonstrating that the NGD-trained network a simple comparison. The code for this study is provided at predicts the class i = 2 of lowest training point density more github.com/rgp62/. accurately than the GD-trained network. The remaining sets For all results reported in this section, we first optimize the of images visualize both the classification probability map hyperparameters listed in Table 1 by maximizing the valida- [FSM (x)]i for i ∈ {0, 1, 2, 3, 4} (middle row) and the six tion accuracy over the training run. We perform this search basis functions Φα (x, ξ) (bottom row) learned by each op- using the Gaussian process optimization tool in the scikit- timizer. The difference in the learned bases is striking. GD optimize package with default options (Head et al. 2018). learns a basis that is nearly discontinuous, in which the sup- This process is performed for both GD and NGD to allow port of each basis function appears fit to the class boundaries. a fair comparison. The ranges for the search are shown in On the other hand, NGD learns a far smoother basis that can Table 1 with the optimal hyperparameters for each dataset be combined to give sharper class boundary predictions. This examined in this study. For all problems we partition data manifests in the resulting probability map assigned to each into training, validation and test sets to ensure hyperparam- class; linear combinations of the rougher GD basis results eter optimization is not overfitting. For MNIST and fashion in imprinting and assignment of probability far from the rel- MNIST we consider a 50K/10K/10K partition, while for evant class. This serves as an explanation of the improved CIFAR-10 we consider a 40K/10K/10K partition. All train- validation accuracy of NGD as compared to GD despite sim- ing is performed with a batch size of 1000 over 100 epochs. ilar final training accuracy. The NGD algorithm separates For all results the test accuracy falls within the first standard refinement of the basis from the determination of the coeffi- deviation error bars included in Figure 3. cients. This provides an effective regulation of the final basis, Figure 3 shows the training and validation accuracies using leading to improved generalization. the optimal hyperparameters for a dense architecture with two hidden layers of width 128 and 10 and ReLU activation Image recognition benchmarks functions. We find for all datasets, NGD more quickly reaches a maximum validation accuracy compared to GD, while both We consider in this section a collection of image classifica- optimizers achieve a similar final validation accuracy. For tion benchmarks: MNIST (Deng 2012; Grother 1995), fash- the more difficult CIFAR-10 benchmark, NGD attains the ion MNIST (Xiao, Rasul, and Vollgraf 2017), and CIFAR- maximum validation accuracy of GD in roughly one-quarter 10 (Krizhevsky, Hinton et al. 2009). We focus primarily on of the number of iterations. In Figure 3, we also use the CIFAR-10 due to its increased difficulty; it is well-known CIFAR-10 dataset to compare the dense architecture to the that one may obtain near-perfect accuracy in the MNIST following ConvNet architecture, benchmark without sophisticated choice of architecture. For all problems, we consider a simple dense network architec- Convolution → Max Pooling → Convolution 8 channels, 3x3 kernel 2x2 window 16 channels, 3x3 kernel ture to highlight the role of the optimizer, and for CIFAR-10 we also utilize convolutional architectures (ConvNet). This → Max Pooling → Convolution → Dense → Dense 2x2 window 16 channels, 3x3 kernel width 64 width 10 highlights that our approach applies to general hidden layer architectures. Our objective is to demonstrate improvements where the convolution and dense layers use the ReLU activa- in accuracy due to optimization with all other aspects held tion function. Again, NGD attains the maximum validation equal. For CIFAR-10, for example, the state-of-the-art re- accuracy of GD in one-quarter the number of iterations, and quires application of techniques such as data-augmentation also leads to an improvement of 1.76% in final test accu- and complicated architectures to realize good accuracy; for racy. This illustrates that NGD accelerates training and can simplicity we do not consider such complications to allow improve accuracy for a variety of architectures. Training Data GD Prediction NGD Prediction 4 3 2 1 0 GD: Classes NGD: Classes GD: Basis Functions NGD: Basis Functions Figure 2: Results for peaks benchmarks, with comparison between NGD and GD on an identical architecture. In this example, GD obtained a training accuracy of 99.3% and validation accuracy of 96.2%, while NGD obtained a training accuracy of 99.6% and validation accuracy of 98.0%. Top: Training data (left), classification by GD (center), and classification by NGD (right). GD misclassifies large portions of the input space. Middle: The linear and softmax layers combine basis functions to assign classification probabilities to each class. The sharp basis learned in GD leads to artifacts and attribution of probability far from the sets (left) while diffuse basis in NGD provides a sharp characterization of class boundaries (right). Bottom: Adaptive basis encoded by hidden layer, as learnt by GD (left) and NGD (right). For GD the basis is sharp, and individual basis functions conform to classification boundaries, while NGD learns a more regular basis. Hyperparameter range MNIST Fashion CIFAR-10 CIFAR-10 MNIST ConvNet Learning rate [10−8 , 10−2 ] 10−2.81 10−3.33 10−3.57 10−2.66 10−2.26 10−2.30 10−2.50 10−2.30 Adam decay [0.5, 1] 0.537 0.756 0.629 0.755 parameter 1 0.630 0.657 0.891 0.657 Adam decay [0.5, 1] 0.830 0.808 0.782 0.858 parameter 2 0.616 0.976 0.808 0.976 CG iterations [1, 10] 3 1 2 2 Newton iterations [1, 10] 6 5 4 7 Table 1: Hyperparameters varied in study (first column), ranges considered (second column), and optimal values found for MNIST (third column), Fashion MNIST (fourth column), CIFAR-10 (fifth column), and CIFAR-10 with the ConvNet architecture (last column). For the learning rate and the Adam decay parameters, the optimal values for NGD followed by GD are shown. The optimal CG and Newton iterations are only applicable to NGD. CIFAR-10 ConvNet Figure 3: Training accuracy (top row) and validation accuracy (bottom row) for CIFAR-10, Fashion MNIST, and MNIST datasets using the dense architecture, and CIFAR-10 using the ConvNet architecture. Mean and standard deviation over 10 training runs are shown. Conclusion Acknowledgements The NGD method, motivated by the adaptive basis interpreta- Sandia National Laboratories is a multimission laboratory tion of deep neural networks, is a block coordinate descent managed and operated by National Technology and Engineer- method for classification problems. This method separates ing Solutions of Sandia, LLC, a wholly owned subsidiary the weights of the linear layer from the weights of the pre- of Honeywell International, Inc., for the U.S. Department ceding nonlinear layers. NGD uses this decomposition to of Energy’s National Nuclear Security Administration under exploit the convexity of the cross-entropy loss with respect contract DE-NA0003525. This paper describes objective tech- to the linear layer variables. It utilizes a Newton method to nical results and analysis. Any subjective views or opinions approximate the global minimum for a given batch of data be- that might be expressed in the paper do not necessarily repre- fore performing a step of gradient descent for the remaining sent the views of the U.S. Department of Energy or the United variables. As such, it is a hybrid first/second order optimizer States Government. SAND Number: SAND2021-3013 C. which extracts significant performance from a second-order The work of R. Patel, N. Trask, and M. Gulian are sup- substep that only scales with the number of weights in the ported by the U.S. Department of Energy, Office of Advanced linear layer, making it an appealing and feasible application Scientific Computing Research under the Collaboratory on of second-order methods for training very deep neural net- Mathematics and Physics-Informed Learning Machines for works. Applying this optimizer to dense and convolutional Multiscale and Multiphysics Problems (PhILMs) project. E. networks, we have demonstrated acceleration with respect C. Cyr is supported by the Department of Energy early career to the number of epochs in the validation loss for the peaks, program. M. Gulian is supported by the John von Neumann MNIST, Fashion MNIST, and CIFAR-10 benchmarks, with fellowship at Sandia National Laboratories. improvements in accuracy for peaks benchmark and CIFAR- 10 benchmark using a convolutional network. References Examining the basis functions encoded in the hidden layer Armijo, L. 1966. Minimization of functions having Lips- of the network in the peaks benchmarks revealed significant chitz continuous first partial derivatives. Pacific Journal of qualitative difference between NGD and stochastic gradient mathematics 16(1): 1–3. descent in the exploration of parameter space corresponding to the hidden layer variables. This, and the role of the tol- Barratt, S. T.; and Boyd, S. P. 2020. Least squares auto-tuning. erance in the Newton step as an implicit regularizer, merit Engineering Optimization 1–22. further study. Blondel, M.; Seki, K.; and Uehara, K. 2013. Block coor- The difference in the regularity of the learned basis and dinate descent algorithms for large-scale sparse multiclass probability classes suggests that one may obtain a quali- classification. Machine learning 93(1): 31–52. tatively different model by varying only the optimization Botev, A.; Ritter, H.; and Barber, D. 2017. Practical Gauss- scheme used. We hypothesize that this more regular basis Newton optimisation for deep learning. In Proceedings of the may have desirable robustness properties which may effect 34th International Conference on Machine Learning-Volume resulting model sensitivity. This could have applications to- 70, 557–565. JMLR. org. ward training networks to be more robust against adversarial attacks. Bottou, L. 2010. Large-scale machine learning with stochas- tic gradient descent. In Proceedings of COMPSTAT’2010, Opschoor, J. A.; Petersen, P.; and Schwab, C. 2019. Deep 177–186. Springer. ReLU networks and high-order finite element methods. SAM, Bottou, L.; Curtis, F. E.; and Nocedal, J. 2018. Optimization ETH Zürich . methods for large-scale machine learning. Siam Review 60(2): Osawa, K.; Tsuji, Y.; Ueno, Y.; Naruse, A.; Foo, C.-S.; 223–311. and Yokota, R. 2020. Scalable and Practical Natural Boyd, S.; Boyd, S. P.; and Vandenberghe, L. 2004. Convex Gradient for Large-Scale Deep Learning. arXiv preprint optimization. Cambridge university press. arXiv:2002.06015 . Osawa, K.; Tsuji, Y.; Ueno, Y.; Naruse, A.; Yokota, R.; and Bubeck, S. 2014. Convex optimization: Algorithms and Matsuoka, S. 2019. Large-scale distributed second-order complexity. arXiv preprint arXiv:1405.4980 . optimization using Kronecker-factored approximate curva- Clarkson, K. L.; Hazan, E.; and Woodruff, D. P. 2012. Sub- ture for deep convolutional neural networks. In Proceedings linear optimization for machine learning. Journal of the ACM of the IEEE Conference on Computer Vision and Pattern (JACM) 59(5): 1–49. Recognition, 12359–12367. Cyr, E. C.; Gulian, M. A.; Patel, R. G.; Perego, M.; and Pereyra, V.; Scherer, G.; and Wong, F. 2006. Variable projec- Trask, N. A. 2019. Robust training and initialization of deep tions neural network training. Mathematics and Computers neural networks: An adaptive basis viewpoint. arXiv preprint in Simulation 73(1-4): 231–243. arXiv:1912.04862 . Shevade, S. K.; and Keerthi, S. S. 2003. A simple and ef- Daubechies, I.; DeVore, R.; Foucart, S.; Hanin, B.; and ficient algorithm for gene selection using sparse logistic re- Petrova, G. 2019. Nonlinear approximation and (deep) ReLU gression. Bioinformatics 19(17): 2246–2253. networks. arXiv preprint arXiv:1905.02199 . Wright, S. J. 2015. Coordinate descent algorithms. Mathe- Deng, L. 2012. The MNIST database of handwritten digit matical Programming 151(1): 3–34. images for machine learning research [best of the web]. IEEE Xiao, H.; Rasul, K.; and Vollgraf, R. 2017. Fashion-MNIST: Signal Processing Magazine 29(6): 141–142. a novel image dataset for benchmarking machine learning Dennis Jr, J. E.; and Schnabel, R. B. 1996. Numerical meth- algorithms. arXiv preprint arXiv:1708.07747 . ods for unconstrained optimization and nonlinear equations, Xu, P.; Roosta, F.; and Mahoney, M. W. 2019. Newton-type volume 16. Siam. methods for non-convex optimization under inexact hessian Folland, G. B. 1999. Real analysis: modern techniques and information. Mathematical Programming 1–36. their applications, volume 40. John Wiley & Sons. Yarotsky, D. 2017. Error bounds for approximations with Fu, W. J. 1998. Penalized regressions: the bridge versus the deep ReLU networks. Neural Networks 94: 103–114. lasso. Journal of computational and graphical statistics 7(3): 397–416. Grother, P. J. 1995. NIST special database 19 handprinted forms and characters database. National Institute of Stan- dards and Technology . Haber, E.; and Ruthotto, L. 2017. Stable architectures for deep neural networks. Inverse Problems 34(1): 014004. Head, T.; MechCoder; Louppe, G.; Shcherbatyi, I.; fcharras; Vinícius, Z.; cmmalone; Schröder, C.; nel215; Campos, N.; and et al. 2018. scikit-optimize/scikit-optimize: v0.5.1 - re- release doi:10.5281/zenodo.1170575. Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochas- tic optimization. arXiv preprint arXiv:1412.6980 . Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple layers of features from tiny images . Martens, J. 2010. Deep learning via Hessian-free optimiza- tion. In ICML, volume 27, 735–742. McLoone, S.; Brown, M. D.; Irwin, G.; and Lightbody, A. 1998. A hybrid linear/nonlinear training algorithm for feed- forward neural networks. IEEE Transactions on Neural Net- works 9(4): 669–684. Nesterov, Y. 2012. Efficiency of coordinate descent meth- ods on huge-scale optimization problems. SIAM Journal on Optimization 22(2): 341–362.