<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Partition of unity networks: deep hp-approximation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kookjin Lee</string-name>
          <email>koolee@sandia.gov</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nathaniel A. Trask</string-name>
          <email>natrask@sandia.gov</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ravi G. Patel</string-name>
          <email>rgpatel@sandia.gov</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mamikon A. Gulian</string-name>
          <email>mgulian@sandia.gov</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eric C. Cyr</string-name>
          <email>eccyr@sandia.gov</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center for Computing Research, Sandia National Laboratories Albuquerque</institution>
          ,
          <addr-line>New Mexico, 87123</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Extreme Scale Data Science &amp; Analytics Department</institution>
          ,
          <addr-line>Sandia National Laboratories</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Approximation theorists have established best-in-class optimal approximation rates of deep neural networks by utilizing their ability to simultaneously emulate partitions of unity and monomials. Motivated by this, we propose partition of unity networks (POUnets) which incorporate these elements directly into the architecture. Classification architectures of the type used to learn probability measures are used to build a meshfree partition of space, while polynomial spaces with learnable coefficients are associated to each partition. The resulting hpelement-like approximation allows use of a fast least-squares optimizer, and the resulting architecture size need not scale exponentially with spatial dimension, breaking the curse of dimensionality. An abstract approximation result establishes desirable properties to guide network design. Numerical results for two choices of architecture demonstrate that POUnets yield hp-convergence for smooth functions and consistently outperform MLPs for piecewise polynomial functions with large numbers of discontinuities.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Overview</title>
      <p>
        We consider regression over the set D = f(xi; yi)giN=da1ta ,
where xi 2 Rd and yi = y(xi) are point samples of a
piecewise smooth function. Following successes in classification
problems in high-dimensional spaces
        <xref ref-type="bibr" rid="ref10">(Chollet 2017)</xref>
        , deep
neural networks (DNNs) have garnered tremendous
interest as tools for regression problems and numerical analysis,
partially due to their apparent ability to alleviate the curse
of dimensionality in the presence of latent low-dimensional
structure. This is in contrast to classical methods for which
the computational expense grows exponentially with d, a
major challenge for solution of high-dimensional PDEs
        <xref ref-type="bibr" rid="ref16 ref4 ref6">(Bach
2017; Bengio and Bengio 2000; Han, Jentzen, and Weinan
2018)</xref>
        .
      </p>
      <p>
        Understanding the performance of DNNs requires
accounting for both optimal approximation error and optimization
error. While one may prove existence of DNN parameters
providing exponential convergence with respect to
architecture size, in practice a number of issues conspire to prevent
realizing such convergence. Several approximation theoretic
works seek to understand the role of width/depth in the
absence of optimization error
        <xref ref-type="bibr" rid="ref12 ref13 ref18 ref22 ref32 ref33">(He et al. 2018; Daubechies et al.
2019; Yarotsky 2017, 2018; Opschoor, Petersen, and Schwab
2019)</xref>
        . In particular, Yarotsky and Opschoor et al. prove the
existence of parameters for a deep neural network
architecture that approximate algebraic operations, partitions of
unity (POUs), and polynomials to exponential accuracy in the
depth of the network. This shows that sufficently deep DNNs
may in theory learn a spectrally convergent hp-element space
without a hand-tailored mesh by constructing a POU to
localize polynomial approximation. In practice, however, such
convergent approximations are not realized when training
DNNs using gradient descent optimizers, even for smooth
target functions
        <xref ref-type="bibr" rid="ref13 ref3">(Fokina and Oseledets 2019; Adcock and
Dexter 2020)</xref>
        . The thesis of this work is to incorporate the
POU and polynomial elements directly into a deep learning
architecture. Rather than attempting to force a DNN to
simultaneously perform localization and high-order approximation,
introducing localized polynomial spaces frees the DNN to
play to its strengths by focusing exclusively on partitioning
space, as in classification problems.
      </p>
      <p>
        An attractive property of the proposed architecture is its
amenability to a fast training strategy. In previous work, we
developed an optimizer which alternates between a gradient
descent update of hidden layer parameters and a globally
optimal least squares solve for a final linear layer
        <xref ref-type="bibr" rid="ref11">(Cyr et al.
2019)</xref>
        ; this was applied as well to classification problems
        <xref ref-type="bibr" rid="ref23">(Patel et al. 2020)</xref>
        . A similar strategy is applied in the current
work: updating the POU with gradient descent before finding
a globally optimal polynomial fit at each iteration ensures an
optimal representation of data over the course of training.
      </p>
      <p>
        While DNNs have been explored as a means of solving
high-dimensional PDEs
        <xref ref-type="bibr" rid="ref15 ref16">(Geist et al. 2020; Han, Jentzen, and
Weinan 2018)</xref>
        , optimization error prevents a practical
demonstration of convergence with respect to size of either data or
model parameters
        <xref ref-type="bibr" rid="ref13 ref3 ref30 ref5">(Beck, Jentzen, and Kuckuck 2019; Wang,
Teng, and Perdikaris 2020)</xref>
        . The relatively simple regression
problem considered here provides a critical first example of
how the optimization error barrier may be circumvented to
provide accuracy competitive with finite element methods.
      </p>
    </sec>
    <sec id="sec-2">
      <title>An abstract POU network</title>
      <p>Consider a partition of unity
P (x) = 1 and (x)
= f (x)gN=par1t satisfying
0 for all x. We work with the
approximant</p>
      <p>Npart
yPOU(x) = X</p>
      <p>dim(V )
(x) X c ; P (x);
where V = span fP g. For this work, we take V to be
the space m(Rd) of polynomials of order m, while is
parametrized as a neural network with weights and biases
and output dimension Npart:
(x; ) =</p>
      <p>N N (x; ) ;
1</p>
      <p>
        Npart:
(2)
We consider two architectures for N N (x; ) to be specified
later. Approximants of the form (1) allow a “soft”
localization of the basis elements P to an implicit partition of space
parametrized by the . While approximation with broken
polynomial spaces corresponds to taking to consist of
characteristic functions on the cells of a computational mesh, the
parametrization of by a DNN generalizes more broadly to
differentiable partitions of space. For applications of
partitions of unity in numerical analysis, see
        <xref ref-type="bibr" rid="ref27">Strouboulis, Copps,
and Babuška (2001</xref>
        );
        <xref ref-type="bibr" rid="ref14">Fries and Belytschko (2010)</xref>
        ;
        <xref ref-type="bibr" rid="ref31">Wendland (2002)</xref>
        ; for their use in differential geometry, see
        <xref ref-type="bibr" rid="ref26">Spivak
(2018)</xref>
        ;
        <xref ref-type="bibr" rid="ref20">Hebey (2000)</xref>
        .
      </p>
      <p>In a traditional numerical procedure, is constructed prior
to fitting c ; to data through a geometric “meshing” process.
We instead work with a POU in the form of a DNN (2) in
which the weights and biases , which are trained to fit the
data. We therefore fit both the localized basis coefficients c =
[c ; ] and the localization itself simultaneously by solving
the optimization problem
argminX
;c i2D</p>
      <p>Npart
X
=1</p>
      <p>dim(V )
(xi; ) X c ; P (xi)
=1</p>
      <p>2
yi : (3)
kyPOU
and</p>
      <p>Error analysis and architecture design
Before specifying a choice of architecture for in (2), we
present a basic estimate of the optimal training error using
the POU-Net architecture to highlight desirable properties
for the partition to have. We denote by diam(A) the diameter
of a set A Rd.</p>
      <p>Theorem 1. Consider an approximant yPOU of the form (1)
with V = m(Rd). If y( ) 2 Cm+1( ) and ; c solve (3)
to yield the approximant yPOU, then
kyPOU
yk`22(D)</p>
      <p>Cm;y max diam supp(
)
m+1
(4)
where kyPOU yk`2(D) denotes the root-mean-square norm
over the training data pairs in D,</p>
      <p>uv 1 X
yk`2(D) = u
t Ndata (x;y)2D
(yPOU(x)
y(x))2;
Cm;y = kykCm+1( ):</p>
      <p>Npart
= X</p>
      <p>Npart
= X
=1
=1
(x)q (x)
(x) (q (x)
y(x))</p>
      <p>Npart
y(x) X
=1
2
(x)
Proof. For each , take q 2 m(Rd) to be the mth order
Taylor polynomial of y( ) centered at any point of supp( ).
Then for all x 2 supp( ),
jq (x)
y(x)j</p>
      <sec id="sec-2-1">
        <title>Cm;y diam supp(</title>
        <p>)
m+1
Define the approximant yePOU = PN=par1t (x)q (x), which
is of the form (1) and represented by feasible ( ; c). Then by
definition of yPOU and (3), we have
kyPOU(x) y(x)k`22(D)
kyePOU(x)
y(x)k`22(D)
For each x = xi 2 D, if x 2 supp(D), then we apply (5);
otherwise, the summand (x) (q (x) y(x)) vanishes. So
kyPOU(x)</p>
        <p>y(x)k`22(D)
Npart
X Cm;y diam supp(
=1</p>
      </sec>
      <sec id="sec-2-2">
        <title>Cm;y max diam supp( Cm;y max diam supp(</title>
        <p>)
)
)
m+1
m+1
m+1
:</p>
        <p>(x)
Npart
X
=1
2
`2(D)
(x)
2
`2(D)</p>
        <p>Theorem 1 indicates that for smooth y, the resulting
convergence rate of the training error is independent of the specific
choice of parameterization of the , and depends only upon
their support. Further, the error does not explicitly depend
upon the dimension of the problem; if the trained encodes
a covering by supp( ) of a low dimensional manifold
containing the data locations x with latent dimension dM d,
then the approximation cost scales only with dim(V ) and thus
may break the curse of dimensionality, e.g. for linear
polynomial approximation dim(V ) = d + 1: If the parameterization
is able to find compactly supported quasi-uniform partitions
1=dM ,
such that the maximum diameter in (4) scales as Npart
(m+1)=dM .
the training loss will exhibit a scaling of Npart</p>
        <p>
          The above analysis indicates the advantage of enforcing
highly localized, compact support of the POU functions.
However, in practice we found that for a POU parametrized
by a shallow RBF-Net networks (described below), rapidly
decaying but globally supported POU functions exhibited
more consistent training results. This is likely due to a higher
tolerance to initializations poorly placed with respect to the
data locations. Similarly, when the POU is parametrized
by a deep ResNet, we obtained good results using ReLU
activation functions while training with a regularizer to
promote localization (Algorithm 2). Thus, the properties
playing a role in the above analysis – localization of the
partition functions and distribution of their support over a
latent manifold containing the data – are the motivation for
the architectures and training strategies we consider below.
POU #1 - RBF-Net: A shallow RBF-network
          <xref ref-type="bibr" rid="ref8 ref9">(Broomhead
and Lowe 1988; Billings and Zheng 1995)</xref>
          implementation
of is given by (1) and
=
        </p>
        <p>exp
P exp
jx
jx
Here, 1 denotes the RBF centers and 2 denotes RBF shape
parameters, both of which evolve during training. A measure
of the localization of these functions can be taken to be
the magnitude of 1. Such an architecture works well for
approximation of smooth functions, but the C1 continuity
of causes difficulty in the approximation of piecewise
smooth functions.</p>
        <p>
          POU #2 - ResNet: We compose a residual network
architecture
          <xref ref-type="bibr" rid="ref19">(He et al. 2016)</xref>
          with a softmax layer S to
define (2). For the experiments considered we use a ReLU
activation, allowing representation of functions with with
discontinuities in their first derivative.
        </p>
        <p>
          Initialization: All numerical experiments take data
supported within the unit hypercube [0; 1]d. To initialize
the POU #1 architecture we use unit shape parameter and
uniformly distribute centers x1 U ([0; 1]d). We initialize POU
#2 with the Box initialization
          <xref ref-type="bibr" rid="ref11">(Cyr et al. 2019)</xref>
          . We found
that these initializations provide an initial set of partitions
sufficiently “well-distributed” throughout for successful
training.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Fast optimizer</title>
      <p>
        The least-squares structure of (3) allows application of the
least-squares gradient descent (LSGD) block coordinate
descent strategy
        <xref ref-type="bibr" rid="ref11">(Cyr et al. 2019)</xref>
        . At each epoch, one may hold
the hidden parameters fixed to obtain a least squares
problem for the optimal polynomial coefficients c. The step may
be concluded by then taking a step of gradient descent for
the POU parameters, therefore evolving the partitions along
the manifold corresponding to optimal representation of data;
for details see
        <xref ref-type="bibr" rid="ref11">(Cyr et al. 2019)</xref>
        .
      </p>
      <p>Algorithm 1 with = 0 specifies the application of LSGD
to Eqn. (3). We will demonstrate that while effective, several
of the learned partition functions may “collapse” to
nearzero values everywhere. To remedy this, we will also consider
a pre-training step in Algorithm 2, which adds an `2
regularizer to the polynomial coefficients. The intuition behind this
is that a given partition regresses data using an element of the
form c ; P . If is scaled by a small &gt; 0, the LSGD
solver may pick up a scaling 1= for c ; and achieve the
same approximation. Limiting the coefficients thus indirectly
penalizes this mode of partition function collapse, promoting
more quasi-uniform partitions of space.</p>
      <sec id="sec-3-1">
        <title>Data: old; cold</title>
        <p>Result: new; cnew
Function LSGD( old, cold, nepoch, , , nstag):
; c old; cold;
for i 2 f1; :::; nepochg do
c LS( ; ) ; // solve Eqn. (1) with
a regularizer kck2F</p>
        <p>GD(c);
if LSGD stagnates more than nstag then
end
end
cnew
new</p>
        <p>LS( ; 0);</p>
        <p>
          ;
Algorithm 1: The regularized least-squares gradient
descent (LSGD) method. Setting = 0 recovers the
original LSGD method
          <xref ref-type="bibr" rid="ref11">(Cyr et al. 2019)</xref>
          .
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Data: old; cold</title>
        <p>Result: new; cnew
Function Two-phase-LSGD( old, cold,
nepoch; neppreoch, , ,nstag):
; c</p>
        <p>LSGD( old; cold; neppreoch; ; ; nstag) ;
// Phase 1: LSGD with a regularizer
new; cnew LSGD( ; c; nepoch; 0; 0; nepoch) ;
// Phase 2: LSGD without a
regularizer
Algorithm 2: The two-phase LSGD method with the
`2-regularized least-squares solves.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Numerical experiments</title>
      <p>
        In this section, we assess the performance of POUnets with
the two POU architectures. We implement the proposed
algorithms in PYTHON, construct POUnets using TENSORFLOW
2.3.0
        <xref ref-type="bibr" rid="ref1">(Abadi et al. 2016)</xref>
        , and employ NUMPY and SCIPY
packages for generating training data and solving the least
squares problem. For all considered neural networks,
training is performed by batch gradient descent using the Adam
optimizer
        <xref ref-type="bibr" rid="ref21">(Kingma and Ba 2014)</xref>
        . For a set of d-variate
polynomials fP g, we choose truncated Taylor polynomials of
maximal degree mmax, providing to ((mmmmaaxx!+)(dd)!!) polynomial
coefficients per partition. The coefficients are initialized via
sampling from the unit normal distribution.
      </p>
      <sec id="sec-4-1">
        <title>Smooth functions</title>
        <p>We consider an analytic function as our first benchmark,
specifically the sine function defined on a cross-shaped
onedimensional manifold embedded in 2-dimensional domain
[ 1; 1]2
y(x) =
sin(2 x1); if x2 = 0;
sin(2 x2); if x1 = 0:</p>
        <p>We test RBF-Nets for varying number of partitions,
Npart = f1; 2; 4; 8; 16g and the maximal polynomial degrees
f0; 1; 2; 3; 4g. For training, we collect data xi; i = 1; 2 by
uniformly sampling from the domain [-1,1] for 501 samples,
i.e., in total, 1001 f((x1; x2); y(x)g-pairs after removing the
duplicate point on the origin. We initialize centers of the
RBF basis functions by sampling uniformly from the domain
[ 1; 1]2 and initialize shape parameters as ones. We then
train RBF-Nets by using the LSGD method with = 0
(Algorithm 1) with the initial learning rate for Adam set to 10 3.
The maximum number of epochs nepoch is set as 100, and we
choose the centers and shape parameters that yield the best
result on the training loss during the specified epochs.</p>
        <p>Figure 1(a) reports the relative `2-errors of approximants
produced by POUnets for varying Npart and varying p. The
results are obtained from 10 independent runs for each value
of Npart and p, with a single log-normal standard deviation.
Algebraic convergence is observed of order increasing with
polynomial degree before saturating at 10 10. We
conjecture this is due to eventual loss of precision due to the
illconditioning of the least squares matrix; however, we leave a
formal study and treatment for a future work.</p>
        <p>In comparison to the performance achieved by POUnets,
we assess the performance of the standard MLPs with
varying depth {4,8,12,16,20} and width {8,16,32,64,128}. The
standard MLPs are trained by using Adam with an initial
learning rate 10 3 and the maximum number of epochs 1000.
As opposed to the convergence behavior of RBF-Net-based
POUnets, the standard MLP briefly exhibits roughly first
order convergence before stagnating around an error of 10 2.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Piecewise Smooth Functions</title>
        <p>We next consider piecewise linear and piecewise quadratic
functions: triangle waves with varying frequencies, i.e.,
y(x) = TRI(x; p), and their quadratic variants y(x) =
TRI2(x; p), where</p>
        <p>
          TRI(x; p) = 2 px
px +
1:
We study the introduction of increasingly many
discontinuities by increasing the frequency p. Reproduction of such
sawtooth functions by ReLU networks via both wide networks
          <xref ref-type="bibr" rid="ref18">(He et al. 2018)</xref>
          and very deep networks
          <xref ref-type="bibr" rid="ref28">(Telgarsky 2016)</xref>
          0
10− 2
rro 10− 4
2-re`
iev 10− 6
lta
re 10− 8
can be achieved theoretically via construction of carefully
chosen architecture and weights/biases, but to our knowledge
has not been achieved via standard training.
        </p>
        <p>The smoothness possessed by the RBF-Net partition
functions (POU #1) precludes rapidly convergent
approximation of piecewise smooth functions, and we instead employ
ResNets (POU #2) for in this case, as the piecewise linear
nature of such partition functions is better suited. As a
baseline for the performance comparison, we apply regression
of the same data over a mean square error using standard
ResNets, yMLP(x). The standard ResNets share the same
architectural design and hyperparameters with the ResNets
used to parametrize the POU functions in the POUnets. The
only exception is the output layer; the standard ResNets
produce a scalar-valued output, whereas the POU
parametrization produces a vector-valued output, which is followed by
the softmax activation.</p>
        <p>We consider five frequency parameters for both target
functions, p = f1; 2; 3; 4; 5g, which results in piecewise linear
and quadratic functions with 2p pieces. Based on the
number of pieces in the target function, we scale the width of the
baseline neural networks and POUnets as 4 2p, while fixing
the depth as 8, and for POUnets the number of partitions are
set as Npart = 2p. For POUnets, we choose the maximal
degree of polynomials to be mmax = 1 and mmax = 2 for the
piecewise linear and quadratic target functions, respectively.</p>
        <p>Both neural networks are trained on the same data,
fxi; y(xi; p)gin=da1ta , where xi are uniformly sampled from
[0,1] and ndata = 2000. The standard ResNets are trained
using the gradient descent method and the POUnets are trained
using the LSGD method with = 0 (Algorithm 1). The
initial learning rate for Adam is set as 10 3. The maximum
number of epochs nepoch is set as 2000 and we choose the
neural network weights and biases that yield the best result
on the training loss during the specified epochs.</p>
        <p>Figure 2 reports the relative `2-errors of approximants
produced by the standard ResNets and POUnets for varying Npart
and width for the piecewise linear functions (left) and the
quadratic piecewise quadratic functions (right). The statistics
are obtained from five independent runs. Figure 2 essentially
0
2-rrree`10− 2
o
tva
ilre
10− 4</p>
        <p>ResNet
POUnet</p>
        <p>ResNet</p>
        <p>POUnet
0
2-rrree`10− 2
o
tva
lire
10− 4
1 (2) 2 (4) 3 (8) 4 (16) 5 (32)</p>
        <p>p(Npart)
(a) Triangular waves
1 (2) 2 (4) 3 (8) 4 (16) 5 (32)</p>
        <p>p(Npart)
(b) Quadratic waves
4
Npart
8
16
32
width
64
128
(a) POUnets
(b) MLPs
shows that the POUnets outperform the standard ResNets in
terms of approximation accuracy; specifically, the standard
ResNets with the ReLU activation function significantly fails
to produce accurate approximantions, while POUnets obtain
&lt; 1% error for large numbers of discontinuities.</p>
        <p>Figure 3 illustrates snapshots of the target functions and
approximants produced by the two neural networks for target
functions with the frequency parameter p = f4; 5g.
Figure 3 confirms the trends shown in Figure 2; the benefits
of using POUnets are more pronounced in approximating
functions with larger p and in approximating quadratic
functions (potentially, in approximating high-order polynomials).
Figures 3(c)–3(f) clearly show that the POUnets accurately
approximate the target functions with sub 1% errors, while
the standard ResNets significantly fails to produce accurate
approximations.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Results of two-phase LSGD</title>
        <p>Now we demonstrate effectiveness of the two-phase LSGD
method (Algorithm 2) in constructing partitions which are
localized according to the features in the data and nearly
disjoint, i.e., breakpoints in the target function, which we
found leads to better approximation accuracy.</p>
        <p>The first phase of the algorithm aims to construct such
partitions. To this end, we limit the expressibility of the
polynomial regression by regularizing the Frobenius norm of the
coefficients kck2F in least-squares solves. This prevents a
small number of partitions dominating and other partitions
being collapsed per our discussion of the fast optimizer above.
These "quasi-uniform" partition are then used as the initial
guess for the unregularized second phase. Finally, we study
the qualitative differences in the resulting POU, particularly
the ability of the network to learn a disjoint partition of space.
We do however stress that there is no particular "correct" form
for the resulting POU - this is primarily of interest because it
shows the network may recover a traditional discontinuous
polynomial approximation.</p>
        <p>In the first phase, we employ a relatively larger learning
rate in order to prevent weights and biases from getting stuck
in local minima. On the other hand, in the second phase we
employ a relatively smaller learning rate to ensure that the
training error decreases.</p>
        <p>Triangular waves. We demonstrate how the two-phase
LSGD works in two example cases. The first set of example
problems is the triangular wave with the frequency parameter
p = 1; 3. We use the same POU network architecture
described in the previous section (i.e., ResNet with width 8 and
depth 8) and the same initialization scheme (i.e., the box
initialization). We set 0.1 and 0.05 for the initial learning rates
for phase 1 and the phase 2, respectively; we set the other
LSGD parameters as = 0:1, = 0:9, and nstag = 1000.
Figure 4 (top) depicts both how partitions are constructed
during phase 1 (Figures 4(a)–4(d)), and snapshots of the
partitions and the approximant at 1000th epoch of the phase 2
in Figures 4(e) and 4(f). The approximation accuracy
measured in the relative `2-norm of the error reaches down to
6:2042 10 8 after 12000 epochs in phase 2.</p>
        <p>
          Next we consider the triangular wave with the frequency
parameter p = 3. We use the POU network architecture
constructed as ResNet with width 8 and dept 10, and use the
box initialization. We set 0.05 and 0.01 for the initial learning
rates for phase 1 and phase 2, respectively; we set the other
LSGD parameters as = 0:1, = 0:999, and nstag = 1000.
Figure 4 (bottom) depicts again how partitions evolve during
the phase 1 and the resulting approximants. Figures 4(g)–4(j)
depicts that the two-phase LSGD constructs partitions that
are disjoint and localized according to the features during the
first phase.
0 0
1
0 0
Quadratic waves. Finally, we approximate the piecewise
quadratic wave with frequency parameter p = 1; 3 while
employing the same network architecture used for
approximating the triangular waves. For the p = 1 case the learning rate
set as 0.5 and 0.25 for phase 1 and phase 2. We use the same
parameter settings (i.e., = 0:1, = 0:9, and nstag = 1000)
as in the previous experiment. Again, in phase 1 disjoint
partitions are constructed, and the accurate approximation is
produced in the phase 2 (Figures 5(a)–5(f)). Moreover, we
observe that the partitions are further refined during phase 2.
For the p = 3 case we again employ the architecture used for
the triangular wave with p = 3 (i.e., ResNet with width 8 and
depth 10). We use the same hyperparameters as in the
triangular wave with p = 3 (i.e., 0.05 and 0.01 for learning rates,
and nstag = 1000). The two exceptions are the weight for the
regularization, = 1 and = 0:999. Figures 5(g)–5(k)
illustrates that the phase 1 LSGD constructs disjoint supports for
partitions, but takes much more epochs. Again, the two-stage
LSGD produces an accurate approximation (Figure 5(l)).
Discussion. The results of this section have demonstrated
that it is important to apply a good regularizer during a
pretraining stage to obtain approximately disjoint piecewise
constant partitions. For the piecewise polynomial functions
considered here, such partitions allowed in some cases
recovery to near-machine precision. Given the abstract error
analysis, it is clear that such localized partitions which adapt
to the features of the target function act similarly to
traditional hp-approximation. There are several challenges
regarding this approach, however: the strong regularization during
phase 1 requires a large number of training epochs, and we
were unable to obtain a set of hyperparameters which
provide such clean partitions across all cases. This suggests two
potential areas of future work. Firstly, an improved means
of regularizing during pretraining that does not compromise
accuracy may be able to train faster; deep learning strategies
for parametrizing charts as in
          <xref ref-type="bibr" rid="ref13 ref25">(Schonsheck, Chen, and Lai
2019)</xref>
          may be useful for this. Secondly, it may be fruitful
to understand the approximation error under less restrictive
assumptions that the partitions are compactly supported or
highly localized. We have been guided by the idea that
polynomials defined on indicator function partitions reproduce
the constructions used by the finite element community, but a
less restrictive paradigm combined with an effective learning
strategy may prove more flexible for general datasets.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>Borrowing ideas from numerical analysis, we have
demonstrated an novel architecture and optimization strategy the
computational complexity of which need not scale
exponentially with the ambient dimension and which provides
high-order convergence for smooth functions and error
consistently under 1% for piecewise smooth problems. Such an
architecture has the potential to provide DNN methods for
solving high-dimensional PDE that converge in a manner
competitive with traditional finite element spaces.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>Sandia National Laboratories is a multimission laboratory
managed and operated by National Technology and
Engineering Solutions of Sandia, LLC, a wholly owned subsidiary
of Honeywell International, Inc., for the U.S. Department
of Energy’s National Nuclear Security Administration under
contract DE-NA0003530. This paper describes objective
technical results and analysis. Any subjective views or opinions
that might be expressed in the paper do not necessarily
represent the views of the U.S. Department of Energy or the United
States Government. SAND Number: SAND2020-6022 J.</p>
      <p>The work of M. Gulian, R. Patel, and N. Trask are
supported by the U.S. Department of Energy, Office of Advanced
0 0
1
0 0
1
0
1
0
1
0
1
0
1
0
0
1
0
1
1
1
Scientific Computing Research under the Collaboratory on
Mathematics and Physics-Informed Learning Machines for
Multiscale and Multiphysics Problems (PhILMs) project. E.
C. Cyr and N. Trask are supported by the Department of
Energy early career program. M. Gulian is supported by the John
von Neumann fellowship at Sandia National Laboratories.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Abadi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Barham</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Devin,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Ghemawat</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          ; Irving,
          <string-name>
            <surname>G.</surname>
          </string-name>
          ; Isard,
          <string-name>
            <surname>M.</surname>
          </string-name>
          ; et al.
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>Tensorflow: A system for large-scale machine learning</article-title>
          .
          <source>In 12th fUSENIXg symposium on operating systems design and implementation (fOSDIg 16)</source>
          ,
          <fpage>265</fpage>
          -
          <lpage>283</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Adcock</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Dexter</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>The gap between theory and practice in function approximation with deep neural networks</article-title>
          .
          <source>arXiv preprint arXiv:2001</source>
          .07523 .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Bach</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Breaking the curse of dimensionality with convex neural networks</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          <volume>18</volume>
          (
          <issue>1</issue>
          ):
          <fpage>629</fpage>
          -
          <lpage>681</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Beck</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Jentzen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Kuckuck</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Full error analysis for the training of deep neural networks</article-title>
          .
          <source>arXiv preprint arXiv:1910</source>
          .00121 .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and Bengio,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <year>2000</year>
          .
          <article-title>Taking on the curse of dimensionality in joint distributions using neural networks</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>IEEE Transactions on Neural Networks</source>
          <volume>11</volume>
          (
          <issue>3</issue>
          ):
          <fpage>550</fpage>
          -
          <lpage>557</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Billings</surname>
            ,
            <given-names>S. A.</given-names>
          </string-name>
          ; and Zheng,
          <string-name>
            <surname>G. L.</surname>
          </string-name>
          <year>1995</year>
          .
          <article-title>Radial basis function network configuration using genetic algorithms</article-title>
          .
          <source>Neural Networks</source>
          <volume>8</volume>
          (
          <issue>6</issue>
          ):
          <fpage>877</fpage>
          -
          <lpage>890</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Broomhead</surname>
            ,
            <given-names>D. S.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Lowe</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>1988</year>
          .
          <article-title>Radial basis functions, multi-variable functional interpolation and adaptive networks</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Chollet</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Deep learning with Python</article-title>
          . Manning Publications Company.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Cyr</surname>
            ,
            <given-names>E. C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gulian</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Patel</surname>
            ,
            <given-names>R. G.</given-names>
          </string-name>
          ; Perego,
          <string-name>
            <given-names>M.</given-names>
            ; and
            <surname>Trask</surname>
          </string-name>
          ,
          <string-name>
            <surname>N. A.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Robust training and initialization of deep neural networks: An adaptive basis viewpoint</article-title>
          . arXiv preprint arXiv:
          <year>1912</year>
          .04862 .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Daubechies</surname>
            ,
            <given-names>I.; DeVore</given-names>
          </string-name>
          , R.; Foucart,
          <string-name>
            <given-names>S.</given-names>
            ;
            <surname>Hanin</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          ; and Petrova,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Nonlinear approximation and (deep) ReLU networks</article-title>
          . arXiv preprint arXiv:
          <year>1905</year>
          .02199 .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Fokina</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Oseledets</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Growing axons: greedy learning of neural networks with application to function approximation</article-title>
          . arXiv preprint arXiv:
          <year>1910</year>
          .12686 .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Fries</surname>
            ,
            <given-names>T.</given-names>
            -P.; and Belytschko, T.
          </string-name>
          <year>2010</year>
          .
          <article-title>The extended/generalized finite element method: an overview of the method and its applications</article-title>
          .
          <source>International journal for numerical methods in engineering 84</source>
          (3):
          <fpage>253</fpage>
          -
          <lpage>304</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Geist</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Petersen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Raslan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Schneider</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; and Kutyniok,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <year>2020</year>
          .
          <article-title>Numerical solution of the parametric diffusion equation by deep neural networks</article-title>
          .
          <source>arXiv preprint arXiv:2004</source>
          .12131 .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Han</surname>
            ,
            <given-names>J</given-names>
          </string-name>
          .;
          <string-name>
            <surname>Jentzen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Weinan</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Solving highdimensional partial differential equations using deep learning.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <source>Proceedings of the National Academy of Sciences</source>
          <volume>115</volume>
          (
          <issue>34</issue>
          ):
          <fpage>8505</fpage>
          -
          <lpage>8510</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; and Zheng,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>ReLU deep neural networks and linear finite elements</article-title>
          . arXiv preprint arXiv:
          <year>1807</year>
          .03973 .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Hebey</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <year>2000</year>
          .
          <article-title>Nonlinear Analysis on Manifolds: Sobolev Spaces and Inequalities</article-title>
          , volume
          <volume>5</volume>
          .
          <source>American Mathematical Soc.</source>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Kingma</surname>
            ,
            <given-names>D. P.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Ba</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>arXiv preprint arXiv:1412</source>
          .
          <fpage>6980</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Opschoor</surname>
            ,
            <given-names>J. A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Petersen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Schwab</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Deep ReLU networks and high-order finite element methods</article-title>
          . SAM, ETH Zürich .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>Patel</surname>
            ,
            <given-names>R. G.</given-names>
          </string-name>
          ; Trask,
          <string-name>
            <given-names>N. A.</given-names>
            ;
            <surname>Gulian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            ; and
            <surname>Cyr</surname>
          </string-name>
          ,
          <string-name>
            <surname>E. C.</surname>
          </string-name>
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <article-title>A block coordinate descent optimizer for classification problems exploiting convexity</article-title>
          . arXiv preprint arXiv:
          <year>2006</year>
          .10123 .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>Schonsheck</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chen</surname>
            , J.; and Lai,
            <given-names>R.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Chart AutoEncoders for Manifold Structured Data</article-title>
          . arXiv preprint arXiv:
          <year>1912</year>
          .10094 .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>Spivak</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Calculus on Manifolds: A Modern Approach to Classical Theorems of Advanced Calculus</article-title>
          . CRC press.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>Strouboulis</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Copps</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Babuška</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <year>2001</year>
          .
          <article-title>The generalized finite element method</article-title>
          .
          <source>Computer methods in applied mechanics and engineering</source>
          <volume>190</volume>
          (
          <fpage>32</fpage>
          -33):
          <fpage>4081</fpage>
          -
          <lpage>4193</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>Telgarsky</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Benefits of depth in neural networks</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <source>arXiv preprint arXiv:1602</source>
          .
          <fpage>04485</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Teng,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          ; and Perdikaris,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <year>2020</year>
          .
          <article-title>Understanding and mitigating gradient pathologies in physics-informed neural networks</article-title>
          .
          <source>arXiv preprint arXiv:2001</source>
          .04536 .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <surname>Wendland</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <year>2002</year>
          .
          <article-title>Fast evaluation of radial basis functions: Methods based on partition of unity. In Approximation Theory X: Wavelets, Splines, and Applications</article-title>
          . Citeseer.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <surname>Yarotsky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Error bounds for approximations with deep ReLU networks</article-title>
          .
          <source>Neural Networks</source>
          <volume>94</volume>
          :
          <fpage>103</fpage>
          -
          <lpage>114</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <string-name>
            <surname>Yarotsky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Optimal approximation of continuous functions by very deep ReLU networks</article-title>
          . arXiv preprint arXiv:
          <year>1802</year>
          .03620 .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>