Partition of unity networks: deep hp-approximation
         Kookjin Lee1 , Nathaniel A. Trask2 , Ravi G. Patel2 , Mamikon A. Gulian2 , Eric C. Cyr2
                      1
                          Extreme Scale Data Science & Analytics Department, Sandia National Laboratories
                                  2
                                    Center for Computing Research, Sandia National Laboratories
                                                 Albuquerque, New Mexico, 87123
                                      {koolee, natrask, rgpatel, mgulian, eccyr}@sandia.gov


                             Abstract                                works seek to understand the role of width/depth in the ab-
                                                                     sence of optimization error (He et al. 2018; Daubechies et al.
  Approximation theorists have established best-in-class opti-       2019; Yarotsky 2017, 2018; Opschoor, Petersen, and Schwab
  mal approximation rates of deep neural networks by utilizing       2019). In particular, Yarotsky and Opschoor et al. prove the
  their ability to simultaneously emulate partitions of unity and
                                                                     existence of parameters for a deep neural network archi-
  monomials. Motivated by this, we propose partition of unity
  networks (POUnets) which incorporate these elements directly       tecture that approximate algebraic operations, partitions of
  into the architecture. Classification architectures of the type    unity (POUs), and polynomials to exponential accuracy in the
  used to learn probability measures are used to build a mesh-       depth of the network. This shows that sufficently deep DNNs
  free partition of space, while polynomial spaces with learnable    may in theory learn a spectrally convergent hp-element space
  coefficients are associated to each partition. The resulting hp-   without a hand-tailored mesh by constructing a POU to lo-
  element-like approximation allows use of a fast least-squares      calize polynomial approximation. In practice, however, such
  optimizer, and the resulting architecture size need not scale      convergent approximations are not realized when training
  exponentially with spatial dimension, breaking the curse of        DNNs using gradient descent optimizers, even for smooth
  dimensionality. An abstract approximation result establishes       target functions (Fokina and Oseledets 2019; Adcock and
  desirable properties to guide network design. Numerical re-
                                                                     Dexter 2020). The thesis of this work is to incorporate the
  sults for two choices of architecture demonstrate that POUnets
  yield hp-convergence for smooth functions and consistently         POU and polynomial elements directly into a deep learning
  outperform MLPs for piecewise polynomial functions with            architecture. Rather than attempting to force a DNN to simul-
  large numbers of discontinuities.                                  taneously perform localization and high-order approximation,
                                                                     introducing localized polynomial spaces frees the DNN to
                                                                     play to its strengths by focusing exclusively on partitioning
                           Overview                                  space, as in classification problems.
                                                               N        An attractive property of the proposed architecture is its
We consider regression over the set D = {(xi , yi )}i=1    data
                                                                ,    amenability to a fast training strategy. In previous work, we
               d
where xi ∈ R and yi = y(xi ) are point samples of a piece-           developed an optimizer which alternates between a gradient
wise smooth function. Following successes in classification          descent update of hidden layer parameters and a globally
problems in high-dimensional spaces (Chollet 2017), deep             optimal least squares solve for a final linear layer (Cyr et al.
neural networks (DNNs) have garnered tremendous inter-               2019); this was applied as well to classification problems
est as tools for regression problems and numerical analysis,         (Patel et al. 2020). A similar strategy is applied in the current
partially due to their apparent ability to alleviate the curse       work: updating the POU with gradient descent before finding
of dimensionality in the presence of latent low-dimensional          a globally optimal polynomial fit at each iteration ensures an
structure. This is in contrast to classical methods for which        optimal representation of data over the course of training.
the computational expense grows exponentially with d, a ma-             While DNNs have been explored as a means of solving
jor challenge for solution of high-dimensional PDEs (Bach            high-dimensional PDEs (Geist et al. 2020; Han, Jentzen, and
2017; Bengio and Bengio 2000; Han, Jentzen, and Weinan               Weinan 2018), optimization error prevents a practical demon-
2018).                                                               stration of convergence with respect to size of either data or
   Understanding the performance of DNNs requires account-           model parameters (Beck, Jentzen, and Kuckuck 2019; Wang,
ing for both optimal approximation error and optimization            Teng, and Perdikaris 2020). The relatively simple regression
error. While one may prove existence of DNN parameters               problem considered here provides a critical first example of
providing exponential convergence with respect to architec-          how the optimization error barrier may be circumvented to
ture size, in practice a number of issues conspire to prevent        provide accuracy competitive with finite element methods.
realizing such convergence. Several approximation theoretic

Copyright © 2021for this paper by its authors. Use permitted under
                                                                                   An abstract POU network
                                                                                                                     N
                                                                                                                     part
Creative Commons License Attribution 4.0 International (CC BY        Consider
                                                                     P         a partition of unity Φ = {φα (x)}α=1       satisfying
4.0)                                                                   α φ α (x) = 1 and  φ α (x) ≥ 0 for all x. We work    with the
approximant                                                                   Proof. For each α, take qα ∈ πm (Rd ) to be the mth order
                                                                              Taylor polynomial of y(·) centered at any point of supp(φξα ).
                          Npart            dim(V )
                          X                 X                                 Then for all x ∈ supp(φξα ),
            yPOU (x) =            φα (x)             cα,β Pβ (x),       (1)                                                    m+1
                          α=1               β=1                                       |qα (x) − y(x)| ≤ Cm,y diam supp(φξα )          . (5)
where V = span {Pβ }. For this work, we take V to be                          Define the approximant yePOU = α=1
                                                                                                                PNpart ξ
                                                                                                                       φα (x)qα (x), which
the space πm (Rd ) of polynomials of order m, while Φ is                      is of the form (1) and represented by feasible (ξ, c). Then by
parametrized as a neural network with weights and biases ξ                                    ∗
                                                                              definition of yPOU  and (3), we have
and output dimension Npart :
                                                                                ∗
                                                                            kyPOU (x)−y(x)k2`2 (D) ≤ ke
                                                                                                        yPOU (x) − y(x)k2`2 (D)
       φα (x; ξ) = N N (x; ξ) α , 1 ≤ α ≤ Npart .      (2)                                                                                           2
                                                                                                   Npart                          Npart
We consider two architectures for N N (x; ξ) to be specified
                                                                                                   X                              X
                                                                                              =            φξα (x)qα (x) − y(x)           φξα (x)
later. Approximants of the form (1) allow a “soft” localiza-                                       α=1                            α=1
                                                                                                                                                     `2 (D)
tion of the basis elements Pβ to an implicit partition of space
parametrized by the φα . While approximation with broken
polynomial spaces corresponds to taking Φ to consist of char-                                                                          2
                                                                                                   Npart
acteristic functions on the cells of a computational mesh, the
                                                                                                   X
                                                                                              =            φξα (x) (qα (x) − y(x))              .
parametrization of Φ by a DNN generalizes more broadly to                                          α=1
differentiable partitions of space. For applications of parti-                                                                         `2 (D)
tions of unity in numerical analysis, see Strouboulis, Copps,                 For each x = xi ∈ D, if x ∈ supp(D), then we apply (5);
and Babuška (2001); Fries and Belytschko (2010); Wend-                        otherwise, the summand φξα (x) (qα (x) − y(x)) vanishes. So
land (2002); for their use in differential geometry, see Spivak
                                                                                 ∗
(2018); Hebey (2000).                                                          kyPOU (x) − y(x)k2`2 (D)
   In a traditional numerical procedure, Φ is constructed prior                                                                            2
                                                                                      Npart
to fitting cα,β to data through a geometric “meshing” process.                        X                           m+1 ξ
We instead work with a POU Φξ in the form of a DNN (2) in                         ≤           Cm,y diam supp(φξα )    φα (x)
which the weights and biases ξ, which are trained to fit the                          α=1
                                                                                                                                           `2 (D)
data. We therefore fit both the localized basis coefficients c =                                                                                 2
                                                                                                                               Npart
[cα,β ] and the localization itself simultaneously by solving                                                          m+1 X
the optimization problem                                                          ≤ Cm,y max diam supp(φξα )                           φξα (x)
                                                                                               α
                                                                                                                              α=1
                                                                    2                                                                               `2 (D)
                                     dim(V )
            X N part                                                                                         m+1
                                                                                  ≤ Cm,y max diam supp(φξα )      .
              X                       X
  argmin               φα (xi , ξ)             cα,β Pβ (xi ) − yi . (3)                        α
      ξ,c   i∈D α=1                   β=1
                                                                                 Theorem 1 indicates that for smooth y, the resulting conver-
      Error analysis and architecture design                                  gence rate of the training error is independent of the specific
Before specifying a choice of architecture for Φξ in (2), we                  choice of parameterization of the φξα , and depends only upon
present a basic estimate of the optimal training error using                  their support. Further, the error does not explicitly depend
the POU-Net architecture to highlight desirable properties                    upon the dimension of the problem; if the trained Φξ encodes
for the partition to have. We denote by diam(A) the diameter                  a covering by supp(φξα ) of a low dimensional manifold con-
of a set A ⊂ Rd .                                                             taining the data locations x with latent dimension dM  d,
                                                                              then the approximation cost scales only with dim(V ) and thus
Theorem 1. Consider an approximant yPOU of the form (1)
                                                                              may break the curse of dimensionality, e.g. for linear polyno-
with V = πm (Rd ). If y(·) ∈ C m+1 (Ω) and ξ ∗ , c∗ solve (3)
                          ∗                                                   mial approximation dim(V ) = d + 1. If the parameterization
to yield the approximant yPOU , then
                                                                              is able to find compactly supported quasi-uniform partitions
                                                  m+1                                                                               −1/d
     ∗
   kyPOU  − yk2`2 (D) ≤ Cm,y max diam supp(φξα )          (4)                 such that the maximum diameter in (4) scales as Npart M ,
                                      α                                                                                            −(m+1)/d
                                                                                                                                       M
                                                                              the training loss will exhibit a scaling of Npart          .
        ∗
where kyPOU − yk`2 (D) denotes the root-mean-square norm                         The above analysis indicates the advantage of enforcing
over the training data pairs in D,                                            highly localized, compact support of the POU functions.
                      v                                                       However, in practice we found that for a POU parametrized
                      u 1                                                     by a shallow RBF-Net networks (described below), rapidly
                      u
                                     ∗ (x) − y(x))2 ,
                                 X
    ∗
 kyPOU − yk`2 (D) = t              (yPOU                                      decaying but globally supported POU functions exhibited
                         Ndata
                                       (x,y)∈D                                more consistent training results. This is likely due to a higher
                                                                              tolerance to initializations poorly placed with respect to the
and                                                                           data locations. Similarly, when the POU is parametrized
                       Cm,y = kykC m+1 (Ω) .                                  by a deep ResNet, we obtained good results using ReLU
                                                                              activation functions while training with a regularizer to
promote localization (Algorithm 2). Thus, the properties             Data: ξold , cold
playing a role in the above analysis – localization of the           Result: ξnew , cnew
partition functions and distribution of their support over a         Function LSGD( ξold , cold , nepoch , λ, ρ, nstag ):
latent manifold containing the data – are the motivation for            ξ, c ← ξold , cold ;
the architectures and training strategies we consider below.            for i ∈ {1, ..., nepoch } do
                                                                            c ← LS(ξ, λ) ; // solve Eqn. (1) with
POU #1 - RBF-Net: A shallow RBF-network (Broomhead                             a regularizer λkck2F
and Lowe 1988; Billings and Zheng 1995) implementation                       ξ ← GD(c);
of Φξ is given by (1) and                                                    if LSGD stagnates more than nstag then
                                                                                 λ ← ρλ
                    exp −|x − ξ1,α |2 /ξ2,α 2
                                                
           φα = P                                .                         end
                       exp   −|x − ξ 1,β |2 /ξ 2                         end
                    β                         2,β
                                                                         cnew ← LS(ξ, 0);
Here, ξ1 denotes the RBF centers and ξ2 denotes RBF shape                ξnew ← ξ;
parameters, both of which evolve during training. A measure         Algorithm 1: The regularized least-squares gradient de-
of the localization of these functions can be taken to be           scent (LSGD) method. Setting λ = 0 recovers the origi-
the magnitude of ξ1 . Such an architecture works well for           nal LSGD method (Cyr et al. 2019).
approximation of smooth functions, but the C∞ continuity
of Φξ causes difficulty in the approximation of piecewise
smooth functions.                                                    Data: ξold , cold
                                                                     Result: ξnew , cnew
POU #2 - ResNet: We compose a residual network                       Function Two-phase-LSGD( ξold , cold ,
architecture (He et al. 2016) with a softmax layer S to               nepoch , npre
                                                                                epoch , λ, ρ,nstag ):
define (2). For the experiments considered we use a ReLU                ξ, c ← LSGD(ξold , cold , npreepoch , λ, ρ, nstag ) ;
activation, allowing representation of functions with with                // Phase 1: LSGD with a regularizer
discontinuities in their first derivative.                               ξnew , cnew ← LSGD(ξ, c, nepoch , 0, 0, nepoch ) ;
                                                                          // Phase 2: LSGD without a
Initialization: All numerical experiments take data sup-                  regularizer
ported within the unit hypercube Ω ⊂ [0, 1]d . To initialize        Algorithm 2: The two-phase LSGD method with the
the POU #1 architecture we use unit shape parameter and uni-        `2 -regularized least-squares solves.
formly distribute centers x1 ∼ U([0, 1]d ). We initialize POU
#2 with the Box initialization (Cyr et al. 2019). We found
that these initializations provide an initial set of partitions                    Numerical experiments
sufficiently “well-distributed” throughout Ω for successful
training.                                                          In this section, we assess the performance of POUnets with
                                                                   the two POU architectures. We implement the proposed algo-
                                                                   rithms in P YTHON, construct POUnets using T ENSORFLOW
                      Fast optimizer                               2.3.0 (Abadi et al. 2016), and employ N UM P Y and S CIPY
The least-squares structure of (3) allows application of the       packages for generating training data and solving the least
least-squares gradient descent (LSGD) block coordinate de-         squares problem. For all considered neural networks, train-
scent strategy (Cyr et al. 2019). At each epoch, one may hold      ing is performed by batch gradient descent using the Adam
the hidden parameters ξ fixed to obtain a least squares prob-      optimizer (Kingma and Ba 2014). For a set of d-variate poly-
lem for the optimal polynomial coefficients c. The step may        nomials {Pβ }, we choose truncated Taylor polynomials of
be concluded by then taking a step of gradient descent for         maximal degree mmax , providing to (m   (mmax +d)!
                                                                                                                        polynomial
the POU parameters, therefore evolving the partitions along                                                  max !)(d!)

the manifold corresponding to optimal representation of data;      coefficients per partition. The coefficients are initialized via
for details see (Cyr et al. 2019).                                 sampling from the unit normal distribution.
   Algorithm 1 with λ = 0 specifies the application of LSGD        Smooth functions
to Eqn. (3). We will demonstrate that while effective, several
of the learned partition functions φα may “collapse” to near-      We consider an analytic function as our first benchmark,
zero values everywhere. To remedy this, we will also consider      specifically the sine function defined on a cross-shaped one-
a pre-training step in Algorithm 2, which adds an `2 regular-      dimensional manifold embedded in 2-dimensional domain
izer to the polynomial coefficients. The intuition behind this     [−1, 1]2
is that a given partition regresses data using an element of the                        
                                                                                            sin(2πx1 ), if x2 = 0,
form cα,β φα Pβ . If φα is scaled by a small δ > 0, the LSGD                   y(x) =
                                                                                            sin(2πx2 ), if x1 = 0.
solver may pick up a scaling 1/δ for cα,β and achieve the
same approximation. Limiting the coefficients thus indirectly        We test RBF-Nets for varying number of partitions,
penalizes this mode of partition function collapse, promoting      Npart = {1, 2, 4, 8, 16} and the maximal polynomial degrees
more quasi-uniform partitions of space.                            {0, 1, 2, 3, 4}. For training, we collect data xi , i = 1, 2 by
uniformly sampling from the domain [-1,1] for 501 samples,                                                                 can be achieved theoretically via construction of carefully
i.e., in total, 1001 {((x1 , x2 ), y(x)}-pairs after removing the                                                          chosen architecture and weights/biases, but to our knowledge
duplicate point on the origin. We initialize centers of the                                                                has not been achieved via standard training.
RBF basis functions by sampling uniformly from the domain                                                                     The smoothness possessed by the RBF-Net partition func-
[−1, 1]2 and initialize shape parameters as ones. We then                                                                  tions (POU #1) precludes rapidly convergent approxima-
train RBF-Nets by using the LSGD method with λ = 0 (Al-                                                                    tion of piecewise smooth functions, and we instead employ
gorithm 1) with the initial learning rate for Adam set to 10−3 .                                                           ResNets (POU #2) for Φξ in this case, as the piecewise linear
The maximum number of epochs nepoch is set as 100, and we                                                                  nature of such partition functions is better suited. As a base-
choose the centers and shape parameters that yield the best                                                                line for the performance comparison, we apply regression
result on the training loss during the specified epochs.                                                                   of the same data over a mean square error using standard
   Figure 1(a) reports the relative `2 -errors of approximants                                                             ResNets, yMLP (x). The standard ResNets share the same ar-
produced by POUnets for varying Npart and varying p. The                                                                   chitectural design and hyperparameters with the ResNets
results are obtained from 10 independent runs for each value                                                               used to parametrize the POU functions in the POUnets. The
of Npart and p, with a single log-normal standard deviation.                                                               only exception is the output layer; the standard ResNets pro-
Algebraic convergence is observed of order increasing with                                                                 duce a scalar-valued output, whereas the POU parametriza-
polynomial degree before saturating at 10−10 . We conjec-                                                                  tion produces a vector-valued output, which is followed by
ture this is due to eventual loss of precision due to the ill-                                                             the softmax activation.
conditioning of the least squares matrix; however, we leave a                                                                 We consider five frequency parameters for both target func-
formal study and treatment for a future work.                                                                              tions, p = {1, 2, 3, 4, 5}, which results in piecewise linear
   In comparison to the performance achieved by POUnets,                                                                   and quadratic functions with 2p pieces. Based on the num-
we assess the performance of the standard MLPs with vary-                                                                  ber of pieces in the target function, we scale the width of the
ing depth {4,8,12,16,20} and width {8,16,32,64,128}. The                                                                   baseline neural networks and POUnets as 4 × 2p , while fixing
standard MLPs are trained by using Adam with an initial                                                                    the depth as 8, and for POUnets the number of partitions are
learning rate 10−3 and the maximum number of epochs 1000.                                                                  set as Npart = 2p . For POUnets, we choose the maximal de-
As opposed to the convergence behavior of RBF-Net-based                                                                    gree of polynomials to be mmax = 1 and mmax = 2 for the
POUnets, the standard MLP briefly exhibits roughly first or-                                                               piecewise linear and quadratic target functions, respectively.
der convergence before stagnating around an error of 10−2 .                                                                   Both neural networks are trained on the same data,
                                                                                                                           {xi , y(xi ; p)}ni=1
                                                                                                                                             data
                                                                                                                                                  , where xi are uniformly sampled from
Piecewise Smooth Functions                                                                                                 [0,1] and ndata = 2000. The standard ResNets are trained us-
We next consider piecewise linear and piecewise quadratic                                                                  ing the gradient descent method and the POUnets are trained
functions: triangle waves with varying frequencies, i.e.,                                                                  using the LSGD method with λ = 0 (Algorithm 1). The
y(x) = TRI(x; p), and their quadratic variants y(x) =                                                                      initial learning rate for Adam is set as 10−3 . The maximum
TRI2 (x; p), where                                                                                                         number of epochs nepoch is set as 2000 and we choose the
                                                                                                                         neural network weights and biases that yield the best result
                                      1                                                                                    on the training loss during the specified epochs.
           TRI(x; p) = 2 px − px +         − 1.
                                      2                                                                                       Figure 2 reports the relative `2 -errors of approximants pro-
We study the introduction of increasingly many discontinu-                                                                 duced by the standard ResNets and POUnets for varying Npart
ities by increasing the frequency p. Reproduction of such saw-                                                             and width for the piecewise linear functions (left) and the
tooth functions by ReLU networks via both wide networks                                                                    quadratic piecewise quadratic functions (right). The statistics
(He et al. 2018) and very deep networks (Telgarsky 2016)                                                                   are obtained from five independent runs. Figure 2 essentially


                                                                                                                                                    0                                                                  0
                           0
                                                                                     0                                                                          ResNet                                                             ResNet
                                                                                                                                                                POUnet                                                             POUnet
                          −2
                     10                                                             −2
                                                                               10
                                                                                                                            relative `2 -error


                                                                                                                                                                                               relative `2 -error


                     10−4
relative `2 -error


                                                          relative `2 -error


                                                                               10−4
                                                                                                                                                 10−2                                                               10−2
                     10−6                                                      10−6


                     10−8          p=0                                         10−8          depth 4
                                   p=1                                                       depth 8
                                   p=2                                                       depth 12
                     10−10                                                     10−10
                                   p=3                                                       depth 16
                                   p=4                                                       depth 20                                            10−4                                                               10−4
                     10−12                                                     10−12                                                                    1 (2)   2 (4)    3 (8) 4 (16) 5 (32)                               1 (2)   2 (4)    3 (8) 4 (16) 5 (32)
                               1     2    4      8   16                                  8     16        32     64   128
                                         Npart                                                          width
                                                                                                                                                                        p(Npart )                                                          p(Npart )


                                   (a) POUnets                                                 (b) MLPs                                             (a) Triangular waves                                                   (b) Quadratic waves

Figure 1: Relative `2 -errors (log-log scale) of approximants                                                              Figure 2: Relative `2 -errors (log-log scale) of approximants
produced by POUnets with RBF-Net partition functions for                                                                   produced by the standard ResNets and POUnets with ResNet
varying Npart and varying mmax (left) and standard MLPs for                                                                partition functions for varying Npart for approximating the
varying width and depth (right).                                                                                           target functions with varying p.
                 Ground truth           ResNet           POUnet            approximate the target functions with sub 1% errors, while
  1.0                                   1.2                                the standard ResNets significantly fails to produce accurate
                                        1.0                                approximations.
  0.5                                   0.8
                                                                           Results of two-phase LSGD
  0.0                                   0.6
                                                                           Now we demonstrate effectiveness of the two-phase LSGD
                                        0.4                                method (Algorithm 2) in constructing partitions which are
  0.5
                                        0.2                                localized according to the features in the data and nearly
  1.0                                   0.0                                disjoint, i.e., breakpoints in the target function, which we
        0.00 0.25 0.50 0.75 1.00              0.00 0.25 0.50 0.75 1.00     found leads to better approximation accuracy.
                                                                              The first phase of the algorithm aims to construct such
            (a) Npart = 16                        (b) Npart = 16           partitions. To this end, we limit the expressibility of the poly-
                                                                           nomial regression by regularizing the Frobenius norm of the
  1.0                                   1.0                                coefficients kck2F in least-squares solves. This prevents a
                                                                           small number of partitions dominating and other partitions
                                        0.8
  0.5                                                                      being collapsed per our discussion of the fast optimizer above.
                                        0.6                                These "quasi-uniform" partition are then used as the initial
  0.0                                                                      guess for the unregularized second phase. Finally, we study
                                        0.4
                                                                           the qualitative differences in the resulting POU, particularly
  0.5                                   0.2                                the ability of the network to learn a disjoint partition of space.
                                                                           We do however stress that there is no particular "correct" form
  1.0                                   0.0                                for the resulting POU - this is primarily of interest because it
        0.00 0.25 0.50 0.75 1.00              0.00 0.25 0.50 0.75 1.00
                                                                           shows the network may recover a traditional discontinuous
            (c) Npart = 32                        (d) Npart = 32           polynomial approximation.
                                                                              In the first phase, we employ a relatively larger learning
   1.0                                  1.0                                rate in order to prevent weights and biases from getting stuck
                                                                           in local minima. On the other hand, in the second phase we
                                        0.8
   0.5                                                                     employ a relatively smaller learning rate to ensure that the
                                        0.6                                training error decreases.
   0.0
                                        0.4
   0.5                                  0.2                                Triangular waves. We demonstrate how the two-phase
                                                                           LSGD works in two example cases. The first set of example
   1.0                                  0.0
      0.5                       0.625      0.5                     0.625
                                                                           problems is the triangular wave with the frequency parameter
                                                                           p = 1, 3. We use the same POU network architecture de-
    (e) Npart = 32 (zoomed-in)           (f) Npart = 32 (zoomed-in)        scribed in the previous section (i.e., ResNet with width 8 and
                                                                           depth 8) and the same initialization scheme (i.e., the box ini-
Figure 3: Snapshots of target functions y(x) and approxi-                  tialization). We set 0.1 and 0.05 for the initial learning rates
mants produced by ResNet and POUnet (i.e., yPOU (x)) are                   for phase 1 and the phase 2, respectively; we set the other
depicted in black, light green, and orange, respectively. The              LSGD parameters as λ = 0.1, ρ = 0.9, and nstag = 1000.
target function correspond to triangular waves (left) and their            Figure 4 (top) depicts both how partitions are constructed
quadratic variants (right). The bottom row depicts the snap-               during phase 1 (Figures 4(a)–4(d)), and snapshots of the par-
shots in the domain [0.5,0.625].                                           titions and the approximant at 1000th epoch of the phase 2
                                                                           in Figures 4(e) and 4(f). The approximation accuracy mea-
                                                                           sured in the relative `2 -norm of the error reaches down to
                                                                           6.2042 × 10−8 after 12000 epochs in phase 2.
shows that the POUnets outperform the standard ResNets in                      Next we consider the triangular wave with the frequency
terms of approximation accuracy; specifically, the standard                parameter p = 3. We use the POU network architecture
ResNets with the ReLU activation function significantly fails              constructed as ResNet with width 8 and dept 10, and use the
to produce accurate approximantions, while POUnets obtain                  box initialization. We set 0.05 and 0.01 for the initial learning
< 1% error for large numbers of discontinuities.                           rates for phase 1 and phase 2, respectively; we set the other
   Figure 3 illustrates snapshots of the target functions and              LSGD parameters as λ = 0.1, ρ = 0.999, and nstag = 1000.
approximants produced by the two neural networks for target                Figure 4 (bottom) depicts again how partitions evolve during
functions with the frequency parameter p = {4, 5}. Fig-                    the phase 1 and the resulting approximants. Figures 4(g)–4(j)
ure 3 confirms the trends shown in Figure 2; the benefits                  depicts that the two-phase LSGD constructs partitions that
of using POUnets are more pronounced in approximating                      are disjoint and localized according to the features during the
functions with larger p and in approximating quadratic func-               first phase.
tions (potentially, in approximating high-order polynomials).
Figures 3(c)–3(f) clearly show that the POUnets accurately
1                             1                              1                              1                              1                            1


0                             0                              0                              0                              0                            1
    0                     1       0                      1       0                      1       0                      1       0                    1       0               1

        (a) Phase 1 (0)               (b) Phase 1 (15)               (c) Phase 1 (30)               (d) Phase 1 (60)           (e) Phase 2 (1000)       (f) Approximation

1                             1                              1                              1                              1                            1


0                             0                              0                              0                              0                            1
    0                     1       0                      1       0                      1       0                      1       0                    1       0               1

        (g) Phase 1 (0)       (h) Phase 1 (30000)            (i) Phase 1 (60000)            (j) Phase 1 (90000)            (k) Phase 2 (1000)           (l) Approximation

Figure 4: Triangular wave with two pieces (top) and triangular wave with eight pieces (bottom): Phase 1 LSGD constructs
localized disjoint partitions (4(a)–4(d) and 4(g)–4(j)) and Phase 2 LSGD produces an accurate approximation.


Quadratic waves. Finally, we approximate the piecewise                                               accuracy may be able to train faster; deep learning strategies
quadratic wave with frequency parameter p = 1, 3 while em-                                           for parametrizing charts as in (Schonsheck, Chen, and Lai
ploying the same network architecture used for approximat-                                           2019) may be useful for this. Secondly, it may be fruitful
ing the triangular waves. For the p = 1 case the learning rate                                       to understand the approximation error under less restrictive
set as 0.5 and 0.25 for phase 1 and phase 2. We use the same                                         assumptions that the partitions are compactly supported or
parameter settings (i.e., λ = 0.1, ρ = 0.9, and nstag = 1000)                                        highly localized. We have been guided by the idea that poly-
as in the previous experiment. Again, in phase 1 disjoint                                            nomials defined on indicator function partitions reproduce
partitions are constructed, and the accurate approximation is                                        the constructions used by the finite element community, but a
produced in the phase 2 (Figures 5(a)–5(f)). Moreover, we                                            less restrictive paradigm combined with an effective learning
observe that the partitions are further refined during phase 2.                                      strategy may prove more flexible for general datasets.
For the p = 3 case we again employ the architecture used for
the triangular wave with p = 3 (i.e., ResNet with width 8 and                                                                      Conclusions
depth 10). We use the same hyperparameters as in the trian-                                          Borrowing ideas from numerical analysis, we have demon-
gular wave with p = 3 (i.e., 0.05 and 0.01 for learning rates,                                       strated an novel architecture and optimization strategy the
and nstag = 1000). The two exceptions are the weight for the                                         computational complexity of which need not scale expo-
regularization, λ = 1 and ρ = 0.999. Figures 5(g)–5(k) illus-                                        nentially with the ambient dimension and which provides
trates that the phase 1 LSGD constructs disjoint supports for                                        high-order convergence for smooth functions and error con-
partitions, but takes much more epochs. Again, the two-stage                                         sistently under 1% for piecewise smooth problems. Such an
LSGD produces an accurate approximation (Figure 5(l)).                                               architecture has the potential to provide DNN methods for
                                                                                                     solving high-dimensional PDE that converge in a manner
Discussion. The results of this section have demonstrated                                            competitive with traditional finite element spaces.
that it is important to apply a good regularizer during a pre-
training stage to obtain approximately disjoint piecewise                                                                  Acknowledgements
constant partitions. For the piecewise polynomial functions                                          Sandia National Laboratories is a multimission laboratory
considered here, such partitions allowed in some cases re-                                           managed and operated by National Technology and Engineer-
covery to near-machine precision. Given the abstract error                                           ing Solutions of Sandia, LLC, a wholly owned subsidiary
analysis, it is clear that such localized partitions which adapt                                     of Honeywell International, Inc., for the U.S. Department
to the features of the target function act similarly to tradi-                                       of Energy’s National Nuclear Security Administration under
tional hp-approximation. There are several challenges regard-                                        contract DE-NA0003530. This paper describes objective tech-
ing this approach, however: the strong regularization during                                         nical results and analysis. Any subjective views or opinions
phase 1 requires a large number of training epochs, and we                                           that might be expressed in the paper do not necessarily repre-
were unable to obtain a set of hyperparameters which pro-                                            sent the views of the U.S. Department of Energy or the United
vide such clean partitions across all cases. This suggests two                                       States Government. SAND Number: SAND2020-6022 J.
potential areas of future work. Firstly, an improved means                                              The work of M. Gulian, R. Patel, and N. Trask are sup-
of regularizing during pretraining that does not compromise                                          ported by the U.S. Department of Energy, Office of Advanced
1                             1                            1                            1                            1                            1


0                             0                            0                            0                            0                            0
    0                     1       0                    1       0                    1       0                    1       0                    1       0                   1

        (a) Phase 1 (0)           (b) Phase 1 (3000)           (c) Phase 1 (6000)           (d) Phase 1 (9000)           (e) Phase 2 (1000)           (f) Approximation

1                             1                            1                            1                            1                            1


0                             0                            0                            0                            0                            0
    0                     1       0                    1       0                    1       0                    1       0                    1       0                   1

        (g) Phase 1 (0)       (h) Phase 1 (100000) (i) Phase 1 (150000) (j) Phase 1 (300000)                         (k) Phase 1 (500000)             (l) Approximation

Figure 5: Quadratic wave with two pieces (top) and quadratic wave with eight pieces (bottom): Phase 1 LSGD constructs
localized disjoint partitions (5(a)–5(d) and 5(g)–5(j)) and Phase 2 LSGD produces an accurate approximation.


Scientific Computing Research under the Collaboratory on                                        Chollet, F. 2017. Deep learning with Python. Manning
Mathematics and Physics-Informed Learning Machines for                                          Publications Company.
Multiscale and Multiphysics Problems (PhILMs) project. E.
C. Cyr and N. Trask are supported by the Department of En-                                      Cyr, E. C.; Gulian, M. A.; Patel, R. G.; Perego, M.; and
ergy early career program. M. Gulian is supported by the John                                   Trask, N. A. 2019. Robust training and initialization of deep
von Neumann fellowship at Sandia National Laboratories.                                         neural networks: An adaptive basis viewpoint. arXiv preprint
                                                                                                arXiv:1912.04862 .
                                  References                                                    Daubechies, I.; DeVore, R.; Foucart, S.; Hanin, B.; and
                                                                                                Petrova, G. 2019. Nonlinear approximation and (deep) ReLU
Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean,
                                                                                                networks. arXiv preprint arXiv:1905.02199 .
J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. 2016.
Tensorflow: A system for large-scale machine learning. In                                       Fokina, D.; and Oseledets, I. 2019. Growing axons: greedy
12th {USENIX} symposium on operating systems design and                                         learning of neural networks with application to function ap-
implementation ({OSDI} 16), 265–283.                                                            proximation. arXiv preprint arXiv:1910.12686 .
Adcock, B.; and Dexter, N. 2020. The gap between the-                                           Fries, T.-P.; and Belytschko, T. 2010.           The ex-
ory and practice in function approximation with deep neural                                     tended/generalized finite element method: an overview of
networks. arXiv preprint arXiv:2001.07523 .                                                     the method and its applications. International journal for
Bach, F. 2017. Breaking the curse of dimensionality with                                        numerical methods in engineering 84(3): 253–304.
convex neural networks. The Journal of Machine Learning                                         Geist, M.; Petersen, P.; Raslan, M.; Schneider, R.; and Ku-
Research 18(1): 629–681.                                                                        tyniok, G. 2020. Numerical solution of the parametric dif-
Beck, C.; Jentzen, A.; and Kuckuck, B. 2019. Full error                                         fusion equation by deep neural networks. arXiv preprint
analysis for the training of deep neural networks. arXiv                                        arXiv:2004.12131 .
preprint arXiv:1910.00121 .                                                                     Han, J.; Jentzen, A.; and Weinan, E. 2018. Solving high-
Bengio, S.; and Bengio, Y. 2000. Taking on the curse of                                         dimensional partial differential equations using deep learning.
dimensionality in joint distributions using neural networks.                                    Proceedings of the National Academy of Sciences 115(34):
IEEE Transactions on Neural Networks 11(3): 550–557.                                            8505–8510.
Billings, S. A.; and Zheng, G. L. 1995. Radial basis func-                                      He, J.; Li, L.; Xu, J.; and Zheng, C. 2018. ReLU deep
tion network configuration using genetic algorithms. Neural                                     neural networks and linear finite elements. arXiv preprint
Networks 8(6): 877–890.                                                                         arXiv:1807.03973 .
Broomhead, D. S.; and Lowe, D. 1988. Radial basis functions,                                    He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual
multi-variable functional interpolation and adaptive networks.                                  learning for image recognition. In Proceedings of the IEEE
Technical report, Royal Signals and Radar Establishment                                         conference on computer vision and pattern recognition, 770–
Malvern (United Kingdom).                                                                       778.
Hebey, E. 2000. Nonlinear Analysis on Manifolds: Sobolev
Spaces and Inequalities, volume 5. American Mathematical
Soc.
Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochas-
tic optimization. arXiv preprint arXiv:1412.6980 .
Opschoor, J. A.; Petersen, P.; and Schwab, C. 2019. Deep
ReLU networks and high-order finite element methods. SAM,
ETH Zürich .
Patel, R. G.; Trask, N. A.; Gulian, M. A.; and Cyr, E. C. 2020.
A block coordinate descent optimizer for classification prob-
lems exploiting convexity. arXiv preprint arXiv:2006.10123
.
Schonsheck, S.; Chen, J.; and Lai, R. 2019. Chart Auto-
Encoders for Manifold Structured Data. arXiv preprint
arXiv:1912.10094 .
Spivak, M. 2018. Calculus on Manifolds: A Modern Ap-
proach to Classical Theorems of Advanced Calculus. CRC
press.
Strouboulis, T.; Copps, K.; and Babuška, I. 2001. The gener-
alized finite element method. Computer methods in applied
mechanics and engineering 190(32-33): 4081–4193.
Telgarsky, M. 2016. Benefits of depth in neural networks.
arXiv preprint arXiv:1602.04485 .
Wang, S.; Teng, Y.; and Perdikaris, P. 2020. Understand-
ing and mitigating gradient pathologies in physics-informed
neural networks. arXiv preprint arXiv:2001.04536 .
Wendland, H. 2002. Fast evaluation of radial basis func-
tions: Methods based on partition of unity. In Approximation
Theory X: Wavelets, Splines, and Applications. Citeseer.
Yarotsky, D. 2017. Error bounds for approximations with
deep ReLU networks. Neural Networks 94: 103–114.
Yarotsky, D. 2018. Optimal approximation of continuous
functions by very deep ReLU networks. arXiv preprint
arXiv:1802.03620 .