Partition of unity networks: deep hp-approximation Kookjin Lee1 , Nathaniel A. Trask2 , Ravi G. Patel2 , Mamikon A. Gulian2 , Eric C. Cyr2 1 Extreme Scale Data Science & Analytics Department, Sandia National Laboratories 2 Center for Computing Research, Sandia National Laboratories Albuquerque, New Mexico, 87123 {koolee, natrask, rgpatel, mgulian, eccyr}@sandia.gov Abstract works seek to understand the role of width/depth in the ab- sence of optimization error (He et al. 2018; Daubechies et al. Approximation theorists have established best-in-class opti- 2019; Yarotsky 2017, 2018; Opschoor, Petersen, and Schwab mal approximation rates of deep neural networks by utilizing 2019). In particular, Yarotsky and Opschoor et al. prove the their ability to simultaneously emulate partitions of unity and existence of parameters for a deep neural network archi- monomials. Motivated by this, we propose partition of unity networks (POUnets) which incorporate these elements directly tecture that approximate algebraic operations, partitions of into the architecture. Classification architectures of the type unity (POUs), and polynomials to exponential accuracy in the used to learn probability measures are used to build a mesh- depth of the network. This shows that sufficently deep DNNs free partition of space, while polynomial spaces with learnable may in theory learn a spectrally convergent hp-element space coefficients are associated to each partition. The resulting hp- without a hand-tailored mesh by constructing a POU to lo- element-like approximation allows use of a fast least-squares calize polynomial approximation. In practice, however, such optimizer, and the resulting architecture size need not scale convergent approximations are not realized when training exponentially with spatial dimension, breaking the curse of DNNs using gradient descent optimizers, even for smooth dimensionality. An abstract approximation result establishes target functions (Fokina and Oseledets 2019; Adcock and desirable properties to guide network design. Numerical re- Dexter 2020). The thesis of this work is to incorporate the sults for two choices of architecture demonstrate that POUnets yield hp-convergence for smooth functions and consistently POU and polynomial elements directly into a deep learning outperform MLPs for piecewise polynomial functions with architecture. Rather than attempting to force a DNN to simul- large numbers of discontinuities. taneously perform localization and high-order approximation, introducing localized polynomial spaces frees the DNN to play to its strengths by focusing exclusively on partitioning Overview space, as in classification problems. N An attractive property of the proposed architecture is its We consider regression over the set D = {(xi , yi )}i=1 data , amenability to a fast training strategy. In previous work, we d where xi ∈ R and yi = y(xi ) are point samples of a piece- developed an optimizer which alternates between a gradient wise smooth function. Following successes in classification descent update of hidden layer parameters and a globally problems in high-dimensional spaces (Chollet 2017), deep optimal least squares solve for a final linear layer (Cyr et al. neural networks (DNNs) have garnered tremendous inter- 2019); this was applied as well to classification problems est as tools for regression problems and numerical analysis, (Patel et al. 2020). A similar strategy is applied in the current partially due to their apparent ability to alleviate the curse work: updating the POU with gradient descent before finding of dimensionality in the presence of latent low-dimensional a globally optimal polynomial fit at each iteration ensures an structure. This is in contrast to classical methods for which optimal representation of data over the course of training. the computational expense grows exponentially with d, a ma- While DNNs have been explored as a means of solving jor challenge for solution of high-dimensional PDEs (Bach high-dimensional PDEs (Geist et al. 2020; Han, Jentzen, and 2017; Bengio and Bengio 2000; Han, Jentzen, and Weinan Weinan 2018), optimization error prevents a practical demon- 2018). stration of convergence with respect to size of either data or Understanding the performance of DNNs requires account- model parameters (Beck, Jentzen, and Kuckuck 2019; Wang, ing for both optimal approximation error and optimization Teng, and Perdikaris 2020). The relatively simple regression error. While one may prove existence of DNN parameters problem considered here provides a critical first example of providing exponential convergence with respect to architec- how the optimization error barrier may be circumvented to ture size, in practice a number of issues conspire to prevent provide accuracy competitive with finite element methods. realizing such convergence. Several approximation theoretic Copyright © 2021for this paper by its authors. Use permitted under An abstract POU network N part Creative Commons License Attribution 4.0 International (CC BY Consider P a partition of unity Φ = {φα (x)}α=1 satisfying 4.0) α φ α (x) = 1 and φ α (x) ≥ 0 for all x. We work with the approximant Proof. For each α, take qα ∈ πm (Rd ) to be the mth order Taylor polynomial of y(·) centered at any point of supp(φξα ). Npart dim(V ) X X Then for all x ∈ supp(φξα ), yPOU (x) = φα (x) cα,β Pβ (x), (1) m+1 α=1 β=1 |qα (x) − y(x)| ≤ Cm,y diam supp(φξα ) . (5) where V = span {Pβ }. For this work, we take V to be Define the approximant yePOU = α=1 PNpart ξ φα (x)qα (x), which the space πm (Rd ) of polynomials of order m, while Φ is is of the form (1) and represented by feasible (ξ, c). Then by parametrized as a neural network with weights and biases ξ ∗ definition of yPOU and (3), we have and output dimension Npart : ∗   kyPOU (x)−y(x)k2`2 (D) ≤ ke yPOU (x) − y(x)k2`2 (D) φα (x; ξ) = N N (x; ξ) α , 1 ≤ α ≤ Npart . (2) 2 Npart Npart We consider two architectures for N N (x; ξ) to be specified X X = φξα (x)qα (x) − y(x) φξα (x) later. Approximants of the form (1) allow a “soft” localiza- α=1 α=1 `2 (D) tion of the basis elements Pβ to an implicit partition of space parametrized by the φα . While approximation with broken polynomial spaces corresponds to taking Φ to consist of char- 2 Npart acteristic functions on the cells of a computational mesh, the X = φξα (x) (qα (x) − y(x)) . parametrization of Φ by a DNN generalizes more broadly to α=1 differentiable partitions of space. For applications of parti- `2 (D) tions of unity in numerical analysis, see Strouboulis, Copps, For each x = xi ∈ D, if x ∈ supp(D), then we apply (5); and Babuška (2001); Fries and Belytschko (2010); Wend- otherwise, the summand φξα (x) (qα (x) − y(x)) vanishes. So land (2002); for their use in differential geometry, see Spivak ∗ (2018); Hebey (2000). kyPOU (x) − y(x)k2`2 (D) In a traditional numerical procedure, Φ is constructed prior 2 Npart to fitting cα,β to data through a geometric “meshing” process. X m+1 ξ We instead work with a POU Φξ in the form of a DNN (2) in ≤ Cm,y diam supp(φξα ) φα (x) which the weights and biases ξ, which are trained to fit the α=1 `2 (D) data. We therefore fit both the localized basis coefficients c = 2 Npart [cα,β ] and the localization itself simultaneously by solving m+1 X the optimization problem ≤ Cm,y max diam supp(φξα ) φξα (x) α α=1 2 `2 (D) dim(V ) X N part m+1 ≤ Cm,y max diam supp(φξα ) . X X argmin φα (xi , ξ) cα,β Pβ (xi ) − yi . (3) α ξ,c i∈D α=1 β=1 Theorem 1 indicates that for smooth y, the resulting conver- Error analysis and architecture design gence rate of the training error is independent of the specific Before specifying a choice of architecture for Φξ in (2), we choice of parameterization of the φξα , and depends only upon present a basic estimate of the optimal training error using their support. Further, the error does not explicitly depend the POU-Net architecture to highlight desirable properties upon the dimension of the problem; if the trained Φξ encodes for the partition to have. We denote by diam(A) the diameter a covering by supp(φξα ) of a low dimensional manifold con- of a set A ⊂ Rd . taining the data locations x with latent dimension dM  d, then the approximation cost scales only with dim(V ) and thus Theorem 1. Consider an approximant yPOU of the form (1) may break the curse of dimensionality, e.g. for linear polyno- with V = πm (Rd ). If y(·) ∈ C m+1 (Ω) and ξ ∗ , c∗ solve (3) ∗ mial approximation dim(V ) = d + 1. If the parameterization to yield the approximant yPOU , then is able to find compactly supported quasi-uniform partitions m+1 −1/d ∗ kyPOU − yk2`2 (D) ≤ Cm,y max diam supp(φξα ) (4) such that the maximum diameter in (4) scales as Npart M , α −(m+1)/d M the training loss will exhibit a scaling of Npart . ∗ where kyPOU − yk`2 (D) denotes the root-mean-square norm The above analysis indicates the advantage of enforcing over the training data pairs in D, highly localized, compact support of the POU functions. v However, in practice we found that for a POU parametrized u 1 by a shallow RBF-Net networks (described below), rapidly u ∗ (x) − y(x))2 , X ∗ kyPOU − yk`2 (D) = t (yPOU decaying but globally supported POU functions exhibited Ndata (x,y)∈D more consistent training results. This is likely due to a higher tolerance to initializations poorly placed with respect to the and data locations. Similarly, when the POU is parametrized Cm,y = kykC m+1 (Ω) . by a deep ResNet, we obtained good results using ReLU activation functions while training with a regularizer to promote localization (Algorithm 2). Thus, the properties Data: ξold , cold playing a role in the above analysis – localization of the Result: ξnew , cnew partition functions and distribution of their support over a Function LSGD( ξold , cold , nepoch , λ, ρ, nstag ): latent manifold containing the data – are the motivation for ξ, c ← ξold , cold ; the architectures and training strategies we consider below. for i ∈ {1, ..., nepoch } do c ← LS(ξ, λ) ; // solve Eqn. (1) with POU #1 - RBF-Net: A shallow RBF-network (Broomhead a regularizer λkck2F and Lowe 1988; Billings and Zheng 1995) implementation ξ ← GD(c); of Φξ is given by (1) and if LSGD stagnates more than nstag then λ ← ρλ exp −|x − ξ1,α |2 /ξ2,α 2  φα = P  . end exp −|x − ξ 1,β |2 /ξ 2 end β 2,β cnew ← LS(ξ, 0); Here, ξ1 denotes the RBF centers and ξ2 denotes RBF shape ξnew ← ξ; parameters, both of which evolve during training. A measure Algorithm 1: The regularized least-squares gradient de- of the localization of these functions can be taken to be scent (LSGD) method. Setting λ = 0 recovers the origi- the magnitude of ξ1 . Such an architecture works well for nal LSGD method (Cyr et al. 2019). approximation of smooth functions, but the C∞ continuity of Φξ causes difficulty in the approximation of piecewise smooth functions. Data: ξold , cold Result: ξnew , cnew POU #2 - ResNet: We compose a residual network Function Two-phase-LSGD( ξold , cold , architecture (He et al. 2016) with a softmax layer S to nepoch , npre epoch , λ, ρ,nstag ): define (2). For the experiments considered we use a ReLU ξ, c ← LSGD(ξold , cold , npreepoch , λ, ρ, nstag ) ; activation, allowing representation of functions with with // Phase 1: LSGD with a regularizer discontinuities in their first derivative. ξnew , cnew ← LSGD(ξ, c, nepoch , 0, 0, nepoch ) ; // Phase 2: LSGD without a Initialization: All numerical experiments take data sup- regularizer ported within the unit hypercube Ω ⊂ [0, 1]d . To initialize Algorithm 2: The two-phase LSGD method with the the POU #1 architecture we use unit shape parameter and uni- `2 -regularized least-squares solves. formly distribute centers x1 ∼ U([0, 1]d ). We initialize POU #2 with the Box initialization (Cyr et al. 2019). We found that these initializations provide an initial set of partitions Numerical experiments sufficiently “well-distributed” throughout Ω for successful training. In this section, we assess the performance of POUnets with the two POU architectures. We implement the proposed algo- rithms in P YTHON, construct POUnets using T ENSORFLOW Fast optimizer 2.3.0 (Abadi et al. 2016), and employ N UM P Y and S CIPY The least-squares structure of (3) allows application of the packages for generating training data and solving the least least-squares gradient descent (LSGD) block coordinate de- squares problem. For all considered neural networks, train- scent strategy (Cyr et al. 2019). At each epoch, one may hold ing is performed by batch gradient descent using the Adam the hidden parameters ξ fixed to obtain a least squares prob- optimizer (Kingma and Ba 2014). For a set of d-variate poly- lem for the optimal polynomial coefficients c. The step may nomials {Pβ }, we choose truncated Taylor polynomials of be concluded by then taking a step of gradient descent for maximal degree mmax , providing to (m (mmax +d)! polynomial the POU parameters, therefore evolving the partitions along max !)(d!) the manifold corresponding to optimal representation of data; coefficients per partition. The coefficients are initialized via for details see (Cyr et al. 2019). sampling from the unit normal distribution. Algorithm 1 with λ = 0 specifies the application of LSGD Smooth functions to Eqn. (3). We will demonstrate that while effective, several of the learned partition functions φα may “collapse” to near- We consider an analytic function as our first benchmark, zero values everywhere. To remedy this, we will also consider specifically the sine function defined on a cross-shaped one- a pre-training step in Algorithm 2, which adds an `2 regular- dimensional manifold embedded in 2-dimensional domain izer to the polynomial coefficients. The intuition behind this [−1, 1]2 is that a given partition regresses data using an element of the  sin(2πx1 ), if x2 = 0, form cα,β φα Pβ . If φα is scaled by a small δ > 0, the LSGD y(x) = sin(2πx2 ), if x1 = 0. solver may pick up a scaling 1/δ for cα,β and achieve the same approximation. Limiting the coefficients thus indirectly We test RBF-Nets for varying number of partitions, penalizes this mode of partition function collapse, promoting Npart = {1, 2, 4, 8, 16} and the maximal polynomial degrees more quasi-uniform partitions of space. {0, 1, 2, 3, 4}. For training, we collect data xi , i = 1, 2 by uniformly sampling from the domain [-1,1] for 501 samples, can be achieved theoretically via construction of carefully i.e., in total, 1001 {((x1 , x2 ), y(x)}-pairs after removing the chosen architecture and weights/biases, but to our knowledge duplicate point on the origin. We initialize centers of the has not been achieved via standard training. RBF basis functions by sampling uniformly from the domain The smoothness possessed by the RBF-Net partition func- [−1, 1]2 and initialize shape parameters as ones. We then tions (POU #1) precludes rapidly convergent approxima- train RBF-Nets by using the LSGD method with λ = 0 (Al- tion of piecewise smooth functions, and we instead employ gorithm 1) with the initial learning rate for Adam set to 10−3 . ResNets (POU #2) for Φξ in this case, as the piecewise linear The maximum number of epochs nepoch is set as 100, and we nature of such partition functions is better suited. As a base- choose the centers and shape parameters that yield the best line for the performance comparison, we apply regression result on the training loss during the specified epochs. of the same data over a mean square error using standard Figure 1(a) reports the relative `2 -errors of approximants ResNets, yMLP (x). The standard ResNets share the same ar- produced by POUnets for varying Npart and varying p. The chitectural design and hyperparameters with the ResNets results are obtained from 10 independent runs for each value used to parametrize the POU functions in the POUnets. The of Npart and p, with a single log-normal standard deviation. only exception is the output layer; the standard ResNets pro- Algebraic convergence is observed of order increasing with duce a scalar-valued output, whereas the POU parametriza- polynomial degree before saturating at 10−10 . We conjec- tion produces a vector-valued output, which is followed by ture this is due to eventual loss of precision due to the ill- the softmax activation. conditioning of the least squares matrix; however, we leave a We consider five frequency parameters for both target func- formal study and treatment for a future work. tions, p = {1, 2, 3, 4, 5}, which results in piecewise linear In comparison to the performance achieved by POUnets, and quadratic functions with 2p pieces. Based on the num- we assess the performance of the standard MLPs with vary- ber of pieces in the target function, we scale the width of the ing depth {4,8,12,16,20} and width {8,16,32,64,128}. The baseline neural networks and POUnets as 4 × 2p , while fixing standard MLPs are trained by using Adam with an initial the depth as 8, and for POUnets the number of partitions are learning rate 10−3 and the maximum number of epochs 1000. set as Npart = 2p . For POUnets, we choose the maximal de- As opposed to the convergence behavior of RBF-Net-based gree of polynomials to be mmax = 1 and mmax = 2 for the POUnets, the standard MLP briefly exhibits roughly first or- piecewise linear and quadratic target functions, respectively. der convergence before stagnating around an error of 10−2 . Both neural networks are trained on the same data, {xi , y(xi ; p)}ni=1 data , where xi are uniformly sampled from Piecewise Smooth Functions [0,1] and ndata = 2000. The standard ResNets are trained us- We next consider piecewise linear and piecewise quadratic ing the gradient descent method and the POUnets are trained functions: triangle waves with varying frequencies, i.e., using the LSGD method with λ = 0 (Algorithm 1). The y(x) = TRI(x; p), and their quadratic variants y(x) = initial learning rate for Adam is set as 10−3 . The maximum TRI2 (x; p), where number of epochs nepoch is set as 2000 and we choose the   neural network weights and biases that yield the best result 1 on the training loss during the specified epochs. TRI(x; p) = 2 px − px + − 1. 2 Figure 2 reports the relative `2 -errors of approximants pro- We study the introduction of increasingly many discontinu- duced by the standard ResNets and POUnets for varying Npart ities by increasing the frequency p. Reproduction of such saw- and width for the piecewise linear functions (left) and the tooth functions by ReLU networks via both wide networks quadratic piecewise quadratic functions (right). The statistics (He et al. 2018) and very deep networks (Telgarsky 2016) are obtained from five independent runs. Figure 2 essentially 0 0 0 0 ResNet ResNet POUnet POUnet −2 10 −2 10 relative `2 -error relative `2 -error 10−4 relative `2 -error relative `2 -error 10−4 10−2 10−2 10−6 10−6 10−8 p=0 10−8 depth 4 p=1 depth 8 p=2 depth 12 10−10 10−10 p=3 depth 16 p=4 depth 20 10−4 10−4 10−12 10−12 1 (2) 2 (4) 3 (8) 4 (16) 5 (32) 1 (2) 2 (4) 3 (8) 4 (16) 5 (32) 1 2 4 8 16 8 16 32 64 128 Npart width p(Npart ) p(Npart ) (a) POUnets (b) MLPs (a) Triangular waves (b) Quadratic waves Figure 1: Relative `2 -errors (log-log scale) of approximants Figure 2: Relative `2 -errors (log-log scale) of approximants produced by POUnets with RBF-Net partition functions for produced by the standard ResNets and POUnets with ResNet varying Npart and varying mmax (left) and standard MLPs for partition functions for varying Npart for approximating the varying width and depth (right). target functions with varying p. Ground truth ResNet POUnet approximate the target functions with sub 1% errors, while 1.0 1.2 the standard ResNets significantly fails to produce accurate 1.0 approximations. 0.5 0.8 Results of two-phase LSGD 0.0 0.6 Now we demonstrate effectiveness of the two-phase LSGD 0.4 method (Algorithm 2) in constructing partitions which are 0.5 0.2 localized according to the features in the data and nearly 1.0 0.0 disjoint, i.e., breakpoints in the target function, which we 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 found leads to better approximation accuracy. The first phase of the algorithm aims to construct such (a) Npart = 16 (b) Npart = 16 partitions. To this end, we limit the expressibility of the poly- nomial regression by regularizing the Frobenius norm of the 1.0 1.0 coefficients kck2F in least-squares solves. This prevents a small number of partitions dominating and other partitions 0.8 0.5 being collapsed per our discussion of the fast optimizer above. 0.6 These "quasi-uniform" partition are then used as the initial 0.0 guess for the unregularized second phase. Finally, we study 0.4 the qualitative differences in the resulting POU, particularly 0.5 0.2 the ability of the network to learn a disjoint partition of space. We do however stress that there is no particular "correct" form 1.0 0.0 for the resulting POU - this is primarily of interest because it 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 shows the network may recover a traditional discontinuous (c) Npart = 32 (d) Npart = 32 polynomial approximation. In the first phase, we employ a relatively larger learning 1.0 1.0 rate in order to prevent weights and biases from getting stuck in local minima. On the other hand, in the second phase we 0.8 0.5 employ a relatively smaller learning rate to ensure that the 0.6 training error decreases. 0.0 0.4 0.5 0.2 Triangular waves. We demonstrate how the two-phase LSGD works in two example cases. The first set of example 1.0 0.0 0.5 0.625 0.5 0.625 problems is the triangular wave with the frequency parameter p = 1, 3. We use the same POU network architecture de- (e) Npart = 32 (zoomed-in) (f) Npart = 32 (zoomed-in) scribed in the previous section (i.e., ResNet with width 8 and depth 8) and the same initialization scheme (i.e., the box ini- Figure 3: Snapshots of target functions y(x) and approxi- tialization). We set 0.1 and 0.05 for the initial learning rates mants produced by ResNet and POUnet (i.e., yPOU (x)) are for phase 1 and the phase 2, respectively; we set the other depicted in black, light green, and orange, respectively. The LSGD parameters as λ = 0.1, ρ = 0.9, and nstag = 1000. target function correspond to triangular waves (left) and their Figure 4 (top) depicts both how partitions are constructed quadratic variants (right). The bottom row depicts the snap- during phase 1 (Figures 4(a)–4(d)), and snapshots of the par- shots in the domain [0.5,0.625]. titions and the approximant at 1000th epoch of the phase 2 in Figures 4(e) and 4(f). The approximation accuracy mea- sured in the relative `2 -norm of the error reaches down to 6.2042 × 10−8 after 12000 epochs in phase 2. shows that the POUnets outperform the standard ResNets in Next we consider the triangular wave with the frequency terms of approximation accuracy; specifically, the standard parameter p = 3. We use the POU network architecture ResNets with the ReLU activation function significantly fails constructed as ResNet with width 8 and dept 10, and use the to produce accurate approximantions, while POUnets obtain box initialization. We set 0.05 and 0.01 for the initial learning < 1% error for large numbers of discontinuities. rates for phase 1 and phase 2, respectively; we set the other Figure 3 illustrates snapshots of the target functions and LSGD parameters as λ = 0.1, ρ = 0.999, and nstag = 1000. approximants produced by the two neural networks for target Figure 4 (bottom) depicts again how partitions evolve during functions with the frequency parameter p = {4, 5}. Fig- the phase 1 and the resulting approximants. Figures 4(g)–4(j) ure 3 confirms the trends shown in Figure 2; the benefits depicts that the two-phase LSGD constructs partitions that of using POUnets are more pronounced in approximating are disjoint and localized according to the features during the functions with larger p and in approximating quadratic func- first phase. tions (potentially, in approximating high-order polynomials). Figures 3(c)–3(f) clearly show that the POUnets accurately 1 1 1 1 1 1 0 0 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 (a) Phase 1 (0) (b) Phase 1 (15) (c) Phase 1 (30) (d) Phase 1 (60) (e) Phase 2 (1000) (f) Approximation 1 1 1 1 1 1 0 0 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 (g) Phase 1 (0) (h) Phase 1 (30000) (i) Phase 1 (60000) (j) Phase 1 (90000) (k) Phase 2 (1000) (l) Approximation Figure 4: Triangular wave with two pieces (top) and triangular wave with eight pieces (bottom): Phase 1 LSGD constructs localized disjoint partitions (4(a)–4(d) and 4(g)–4(j)) and Phase 2 LSGD produces an accurate approximation. Quadratic waves. Finally, we approximate the piecewise accuracy may be able to train faster; deep learning strategies quadratic wave with frequency parameter p = 1, 3 while em- for parametrizing charts as in (Schonsheck, Chen, and Lai ploying the same network architecture used for approximat- 2019) may be useful for this. Secondly, it may be fruitful ing the triangular waves. For the p = 1 case the learning rate to understand the approximation error under less restrictive set as 0.5 and 0.25 for phase 1 and phase 2. We use the same assumptions that the partitions are compactly supported or parameter settings (i.e., λ = 0.1, ρ = 0.9, and nstag = 1000) highly localized. We have been guided by the idea that poly- as in the previous experiment. Again, in phase 1 disjoint nomials defined on indicator function partitions reproduce partitions are constructed, and the accurate approximation is the constructions used by the finite element community, but a produced in the phase 2 (Figures 5(a)–5(f)). Moreover, we less restrictive paradigm combined with an effective learning observe that the partitions are further refined during phase 2. strategy may prove more flexible for general datasets. For the p = 3 case we again employ the architecture used for the triangular wave with p = 3 (i.e., ResNet with width 8 and Conclusions depth 10). We use the same hyperparameters as in the trian- Borrowing ideas from numerical analysis, we have demon- gular wave with p = 3 (i.e., 0.05 and 0.01 for learning rates, strated an novel architecture and optimization strategy the and nstag = 1000). The two exceptions are the weight for the computational complexity of which need not scale expo- regularization, λ = 1 and ρ = 0.999. Figures 5(g)–5(k) illus- nentially with the ambient dimension and which provides trates that the phase 1 LSGD constructs disjoint supports for high-order convergence for smooth functions and error con- partitions, but takes much more epochs. Again, the two-stage sistently under 1% for piecewise smooth problems. Such an LSGD produces an accurate approximation (Figure 5(l)). architecture has the potential to provide DNN methods for solving high-dimensional PDE that converge in a manner Discussion. The results of this section have demonstrated competitive with traditional finite element spaces. that it is important to apply a good regularizer during a pre- training stage to obtain approximately disjoint piecewise Acknowledgements constant partitions. For the piecewise polynomial functions Sandia National Laboratories is a multimission laboratory considered here, such partitions allowed in some cases re- managed and operated by National Technology and Engineer- covery to near-machine precision. Given the abstract error ing Solutions of Sandia, LLC, a wholly owned subsidiary analysis, it is clear that such localized partitions which adapt of Honeywell International, Inc., for the U.S. Department to the features of the target function act similarly to tradi- of Energy’s National Nuclear Security Administration under tional hp-approximation. There are several challenges regard- contract DE-NA0003530. This paper describes objective tech- ing this approach, however: the strong regularization during nical results and analysis. Any subjective views or opinions phase 1 requires a large number of training epochs, and we that might be expressed in the paper do not necessarily repre- were unable to obtain a set of hyperparameters which pro- sent the views of the U.S. Department of Energy or the United vide such clean partitions across all cases. This suggests two States Government. SAND Number: SAND2020-6022 J. potential areas of future work. Firstly, an improved means The work of M. Gulian, R. Patel, and N. Trask are sup- of regularizing during pretraining that does not compromise ported by the U.S. Department of Energy, Office of Advanced 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 0 1 (a) Phase 1 (0) (b) Phase 1 (3000) (c) Phase 1 (6000) (d) Phase 1 (9000) (e) Phase 2 (1000) (f) Approximation 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 0 1 (g) Phase 1 (0) (h) Phase 1 (100000) (i) Phase 1 (150000) (j) Phase 1 (300000) (k) Phase 1 (500000) (l) Approximation Figure 5: Quadratic wave with two pieces (top) and quadratic wave with eight pieces (bottom): Phase 1 LSGD constructs localized disjoint partitions (5(a)–5(d) and 5(g)–5(j)) and Phase 2 LSGD produces an accurate approximation. Scientific Computing Research under the Collaboratory on Chollet, F. 2017. Deep learning with Python. Manning Mathematics and Physics-Informed Learning Machines for Publications Company. Multiscale and Multiphysics Problems (PhILMs) project. E. C. Cyr and N. Trask are supported by the Department of En- Cyr, E. C.; Gulian, M. A.; Patel, R. G.; Perego, M.; and ergy early career program. M. Gulian is supported by the John Trask, N. A. 2019. Robust training and initialization of deep von Neumann fellowship at Sandia National Laboratories. neural networks: An adaptive basis viewpoint. arXiv preprint arXiv:1912.04862 . References Daubechies, I.; DeVore, R.; Foucart, S.; Hanin, B.; and Petrova, G. 2019. Nonlinear approximation and (deep) ReLU Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, networks. arXiv preprint arXiv:1905.02199 . J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. 2016. Tensorflow: A system for large-scale machine learning. In Fokina, D.; and Oseledets, I. 2019. Growing axons: greedy 12th {USENIX} symposium on operating systems design and learning of neural networks with application to function ap- implementation ({OSDI} 16), 265–283. proximation. arXiv preprint arXiv:1910.12686 . Adcock, B.; and Dexter, N. 2020. The gap between the- Fries, T.-P.; and Belytschko, T. 2010. The ex- ory and practice in function approximation with deep neural tended/generalized finite element method: an overview of networks. arXiv preprint arXiv:2001.07523 . the method and its applications. International journal for Bach, F. 2017. Breaking the curse of dimensionality with numerical methods in engineering 84(3): 253–304. convex neural networks. The Journal of Machine Learning Geist, M.; Petersen, P.; Raslan, M.; Schneider, R.; and Ku- Research 18(1): 629–681. tyniok, G. 2020. Numerical solution of the parametric dif- Beck, C.; Jentzen, A.; and Kuckuck, B. 2019. Full error fusion equation by deep neural networks. arXiv preprint analysis for the training of deep neural networks. arXiv arXiv:2004.12131 . preprint arXiv:1910.00121 . Han, J.; Jentzen, A.; and Weinan, E. 2018. Solving high- Bengio, S.; and Bengio, Y. 2000. Taking on the curse of dimensional partial differential equations using deep learning. dimensionality in joint distributions using neural networks. Proceedings of the National Academy of Sciences 115(34): IEEE Transactions on Neural Networks 11(3): 550–557. 8505–8510. Billings, S. A.; and Zheng, G. L. 1995. Radial basis func- He, J.; Li, L.; Xu, J.; and Zheng, C. 2018. ReLU deep tion network configuration using genetic algorithms. Neural neural networks and linear finite elements. arXiv preprint Networks 8(6): 877–890. arXiv:1807.03973 . Broomhead, D. S.; and Lowe, D. 1988. Radial basis functions, He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual multi-variable functional interpolation and adaptive networks. learning for image recognition. In Proceedings of the IEEE Technical report, Royal Signals and Radar Establishment conference on computer vision and pattern recognition, 770– Malvern (United Kingdom). 778. Hebey, E. 2000. Nonlinear Analysis on Manifolds: Sobolev Spaces and Inequalities, volume 5. American Mathematical Soc. Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochas- tic optimization. arXiv preprint arXiv:1412.6980 . Opschoor, J. A.; Petersen, P.; and Schwab, C. 2019. Deep ReLU networks and high-order finite element methods. SAM, ETH Zürich . Patel, R. G.; Trask, N. A.; Gulian, M. A.; and Cyr, E. C. 2020. A block coordinate descent optimizer for classification prob- lems exploiting convexity. arXiv preprint arXiv:2006.10123 . Schonsheck, S.; Chen, J.; and Lai, R. 2019. Chart Auto- Encoders for Manifold Structured Data. arXiv preprint arXiv:1912.10094 . Spivak, M. 2018. Calculus on Manifolds: A Modern Ap- proach to Classical Theorems of Advanced Calculus. CRC press. Strouboulis, T.; Copps, K.; and Babuška, I. 2001. The gener- alized finite element method. Computer methods in applied mechanics and engineering 190(32-33): 4081–4193. Telgarsky, M. 2016. Benefits of depth in neural networks. arXiv preprint arXiv:1602.04485 . Wang, S.; Teng, Y.; and Perdikaris, P. 2020. Understand- ing and mitigating gradient pathologies in physics-informed neural networks. arXiv preprint arXiv:2001.04536 . Wendland, H. 2002. Fast evaluation of radial basis func- tions: Methods based on partition of unity. In Approximation Theory X: Wavelets, Splines, and Applications. Citeseer. Yarotsky, D. 2017. Error bounds for approximations with deep ReLU networks. Neural Networks 94: 103–114. Yarotsky, D. 2018. Optimal approximation of continuous functions by very deep ReLU networks. arXiv preprint arXiv:1802.03620 .