Learning High-Dimensional Hilbert-Valued Functions With Deep Neural
                          Networks From Limited Data
                    Ben Adcock1 , Simone Brugiapaglia2 , Nick Dexter1 , Sebastian Moraga1
                    1
                      Department of Mathematics, Simon Fraser University, Burnaby, BC V5A 1S6, Canada
               2
               Department of Mathematics & Statistics, Concordia University, Montréal, QC H3G 1M8, Canada
              ben adcock@sfu.ca, simone.brugiapaglia@concordia.ca, nicholas dexter@sfu.ca, smoragas@sfu.ca


                           Abstract                                 connected ReLU DNN architectures (Opschoor, Schwab,
                                                                    and Zech 2019).
  Deep learning (DL) techniques have been successful in many
  applications, with the most impressive results achieved on           While these results are impressive, many important ques-
  problems where the dimension of the underlying data or prob-      tions about the application of DL techniques to problems in
  lem domain is large. In this paper, we describe recent results    scientific computing remain. The work (Adcock and Dexter
  on both scalar- and Hilbert-valued function approximation         2020) raised key issues regarding current methods of train-
  via Deep Neural Networks (DNN). This problem arises in            ing which prevent DNNs from practically achieving high-
  many engineering problems, in particular those involving the      accuracy approximations on smooth function approxima-
  solution of parametric Partial Differential Equations (PDEs).     tion tasks. It was also demonstrated in (Antun et al. 2020;
  Such problems are challenging, since point-wise samples           Gottschling et al. 2020) that DL models trained to solve in-
  are expensive to acquire, and the function is usually high-       verse problems in imaging can fail to recover small struc-
  dimensional. First, we consider a DNN architecture and train-
  ing procedure for which the resulting DNN is guaranteed to
                                                                    tural changes in images, and small perturbations can lead to
  perform as well as current best-in-class schemes for holo-        numerical artifacts. In other words, DNNs can lead to un-
  morphic, scalar- or Hilbert-valued functions based on poly-       stable methods unless they are carefully constructed. Hence,
  nomial approximations. This result demonstrates the efficacy      in order for these tools to be accepted and applied to critical
  of DL for this problem, and makes explicit the effect of all      tasks in scientific computing with strict error tolerances, e.g.,
  sources of error, including discretization error of the under-    UQ problems in mechanical engineering, it is necessary to
  lying Hilbert space and measurement error. Second, we pro-        develop theoretical foundations on stability and convergence
  vide several numerical results illustrating the performance of    of DNNs, as well as an understanding of how to train them
  DNNs on both real-valued functions and solutions of para-         to ensure these properties.
  metric PDEs. These results suggest that better approxima-            In this work we showcase results on learning high-
  tions can be achieved through careful tuning of the DNN ar-
  chitecture and training algorithm.
                                                                    dimensional scalar- and Hilbert-valued functions from lim-
                                                                    ited datasets by DNNs. While most previous studies have
                                                                    considered scalar-valued functions, Hilbert-valued functions
                     1    Introduction                              arise is various important applications, for example UQ
Modern machine learning approaches based on training                problems involving PDEs. Such problems are typically ex-
DNNs on large datasets have achieved impressive results             pressed in terms of a differential operator Dx , with x com-
on high-dimensional problems in several important scien-            prising the spatial and temporal variables, and the goal is to
tific computing applications, including inverse problems in         find for all y ∈ U ⊆ Rd a solution u ∈ V, with V a Hilbert
imaging (Adcock and Hansen 2021; Ongie et al. 2020),                or Banach space, satisfying
molecular dynamics simulations (Faber et al. 2017), partial
differential equations (PDEs) (Berg and Nyström 2018), and                                 Dx (u, y) = 0,                       (1)
parameterized PDEs for uncertainty quantification (UQ) in           subject to suitable boundary conditions. Existing techniques
engineering (Cyr et al. 2020; Dal Santo, Deparis, and Pe-           for solving such problems achieve a discretization of the
golotti 2020; Geist et al. 2020; Khoo, Lu, and Ying 2020;           infinite-dimensional parameter to solution map y 7→ u(y) ∈
Laakmann and Petersen 2020). Recent theoretical results on          V by combining spatial discretization, e.g., finite element or
DNN expressivity suggest that such architectures are ca-            difference methods, with global polynomial approximation,
pable of approximating a wide class of functions by em-             achieved through Galerkin projection (Babuška, Tempone,
ulating existing approximation schemes, e.g., polynomials,          and Zouraris 2004), hierarchical interpolation on sparse
wavelets, or free-knot splines. For smooth functions, ex-           grids (Nobile, Tempone, and Webster 2008a), discrete least-
ponential rates of convergence have been shown for fully-           squares (Chkifa et al. 2015), or compressive sensing-based
Copyright © 2021 for this paper by its authors. Use permitted un-   approaches (Adcock, Brugiapaglia, and Webster 2017; Dex-
der Creative Commons License Attribution 4.0 International (CC      ter, Tran, and Webster 2019; Doostan and Owhadi 2011;
BY 4.0)                                                             Hampton and Doostan 2015).
Challenges                                                                         2   Preliminaries
The large underlying dimension of problems of the form          We first require some notation. For d ∈ N we write Nd0 :=
(1) presents a major challenge in practical applications. The   {ν = (νk )dk=1 : νk ∈ N0 } for the set of nonnegative integer
curse of dimensionality states that the amount of data and      multi-indices. The inequality µ ≤ ν is understood compo-
computational effort needed to construct an approximation       nentwise. We write 0 and 1 for the multi-indices consisting
may grow exponentially with the problem dimension. In           of all zeros and all ones respectively. We also use the nota-
many real-world problems arising in scientific computing        tion A . B to mean that there exists a numerical constant
and UQ the parametric dimension d  1, as the parameters        c > 0 independent of A and B such that A ≤ cB, and like-
are used to model various material properties, forcing terms,   wise for A & B.
and boundary information. In such settings, naı̈ve applica-
tion of the aforementioned techniques can lead to compu-        Setup
tationally inefficient schemes or suboptimal approximations     Throughout, we let V be a separable Hilbert space over the
with respect to the sample complexity, i.e., the number of      field R, with inner product h·, ·iV and corresponding norm
samples required to construct the approximation.                                              p
                                                                                     kvkV = hv, viV .
Previous work                                                   We let V N be the vector space of Hilbert-valued vectors of
DNNs have recently been studied for these problems in           length N , i.e. u = (ui )N
                                                                                         i=1 where ui ∈ V, i = 1, . . . , N .
(Kutyniok et al. 2020) where, under assumptions guaran-         More generally, let Λ ⊆ Nd0 denote a (possibly infinite)
teeing low-dimensionality of the solution manifold, fast,       multi-index set. We write v = (vν )ν∈Λ for a sequence with
dimension-independent rates of convergence were estab-          V-valued entries, vν ∈ V. We define the space `2 (Λ; V) as
lished by connecting DNNs to reduced basis approxima-           the set of those sequences for which the corresponding norm
tions and the theory of Kolmogorov N -widths. The work                                                !1/2
(Geist et al. 2020) numerically tested the accuracy of DNNs                               X         2
in approximating solutions to the parametric diffusion equa-                 kvkV,2 :=         kvν kV      < ∞.
                                                                                         ν∈Λ
tion over a range of dimensions, finding errors scale with
the inherent complexity of the problem as the dimension in-     For the parametric domain, we consider the hypercube
creases, but the observed scaling was not exponential. Other    equipped with the uniform probability measure
works such as (Cyr et al. 2020; Dal Santo, Deparis, and
Pegolotti 2020; Khoo, Lu, and Ying 2020; Laakmann and                      U = [−1, 1]d ,      d%(y) = 2−d dy.            (2)
Petersen 2020) have also demonstrated the potential of the
                                                                For 1 ≤ p ≤     ∞, we write Lp% (U) for the corresponding
DNN approach for parametric PDEs.
   Further, it is hoped that DNNs may overcome some of the      Lebesgue spaces and k·kLp% (U ) for their norms. Next, we
limitations of standard polynomial-based methods. These         define the Bochner space Lp% (U; V) as the space consisting
methods require strong smoothness assumptions on the un-        of (equivalence classes of) strongly %-measurable functions
derlying parameterized PDE problem in order to show fast        f : U → V for which
rates of convergence. However, problems typically found                            ( R
                                                                                                p       1/p
in engineering or science applications rarely satisfy all of                             kf (y)kV d%(y)      1≤p<∞
                                                                 kf kLp% (U ;V) :=    U                                  .
these assumptions (Smith 2013). On the other hand, DNNs                             ess supy∈U kf (y)kV      p=∞
have been shown to achieve optimal rates of approxima-
tion for piecewise smooth functions (Petersen and Voigt-        Generally speaking, one cannot deal directly with an
laender 2018), suggesting far greater flexibility than the      infinite-dimensional output space V. Hence, we introduce
polynomial-based approach. Numerical studies have also          a discretization (e.g. a finite element discretization). This
demonstrated the efficiency of DNNs in approximating dis-       is a finite-dimensional subspace Vh ⊂ V of dimension
continuous (scalar-valued) functions (Adcock and Dexter         K = dim(Vh ). We let Ph : V → Vh be the orthogo-
2020), whereas polynomial-based methods fail to converge.       nal projection onto Vh and {ϕk }K   k=1 be a (not necessarily
                                                                orthogonal) basis of Vh . Moreover, for f ∈ L2% (U; V) we
This work                                                       let Ph f ∈ L2% (U; Vh ) be the function defined almost every-
                                                                where as
The purpose of this work is to describe recent results (Ad-
cock and Dexter 2020; Adcock et al. 2020) on DNN approx-                     (Ph f )(y) = Ph (f (y)), ∀y ∈ U.
imation of scalar- and Hilbert-valued functions and their ap-
plication to parametric PDEs. Although DNNs have shown          Holomorphy and polynomial approximation
some promise for discontinuous functions, in this work we       The focus of this work is the approximation of smooth func-
focus on the smooth case, where a direct comparison with        tions f : U → V. Note that f may either be scalar-valued,
polynomial-based methods is valid. In particular, we focus      e.g. V = R, or Hilbert-valued, in which case V is a sepa-
on the following question: to what extent (both theoretically   rable, but potentially infinite-dimensional Hilbert space. By
and empirically) can DNN approaches outperform current          smooth, we mean that f has a holomorphic extension to a
best-in-class polynomial-based methods?                         suitable complex region O of the form U ⊆ O ⊆ Cd .
Specifically, we take these regions to be Bernstein polyel-         where ρ is the activation function and Al are the affine maps.
lipses. Given parameters ρ > 1, these are of the form               Given such a DNN Φ the resulting approximation to f is
                   Eρ = Eρ1 × · · · × Eρd ,                                                         K
                                                                                                    X
                                                                                 f (y) ≈ fΦ (y) =       (Φ(y))k ϕk ∈ Vh .
where, for ρ > 1, Eρ = { 21 (z + z −1 ) : z ∈ C, 1 ≤ |z| ≤                                          k=1
ρ} ⊂ C are the classical Bernstein ellipses. Note that Bern-
stein polyellipses, much like classical Bernstein ellipses, are     Note that when V = R, i.e. f is scalar-valued, such consid-
intimately connected with multivarate polynomial approxi-           erations are unnecessary, since Vh = V = R and we simply
mation (Chkifa, Cohen, and Schwab 2015; Cohen and De-               have f (y) ≈ Φ(y).
Vore 2015; Cohen, DeVore, and Schwab 2011).
   To be precise, in this work we consider the classes of              3    Learning holomorphic, Hilbert-valued
functions HA = HA(γ, , d), where γ > 0 and  > 0                                   functions via DNNs
are constants. This class consists of functions f : U → V           We now present a theoretical result demonstrating that
that have a holomorphic extension to a Bernstein polyellipse        DNNs can be learned to efficiently approximate holomor-
Eρ and satisfy kf kL∞ (Eρ ;V) ≤ 1, and where the parameters         phic functions. Note that (Opschoor, Schwab, and Zech
ρ = (ρj )dj=1 also satisfy                                          2019) has previously shown that there is a DNN that
                          Qd              !1/d                      achieves the same rate of approximation (4) as the best s-
             1       d!    j=1 log(ρj )
                                                                    term polynomial approximation. In the following result, we
                                                 ≥ γ.         (3)   show that such a DNN can also be learned efficiently from
            d+1            1+
                                                                    the data (5) through a standard training strategy. Due to
The motivation for this definition is the following. For any        space limitations we omit some details. The full details and
f ∈ HA(γ, , d) its best s-term polynomial approximation            proof can be found in (Adcock et al. 2020).
fs in L2% -orthogonal polynomials (normalized, tensor Leg-          Theorem 3.1 Let m ≥ 2,
endre polynomials) decays exponentially fast in s: namely,
                                                                     e := cm/(log2 (m) min{log(m) + d, log(m) log(2d)}),
                                                                     m
  kf − fs kL2% (U ;V) ≤ exp(−γs1/d ),      ∀s ≥ s̄(d, , ρ), (4)
                                                                              e ≥ 2d and {yi }m
                                                                    such that m                i=1 be drawn randomly and in-
where s̄(d, , ρ) is a constant (Opschoor, Schwab, and Zech         dependently from %. Then, with high probability, there is
2019). This is the rate we seek to obtain with a DNN ap-            • a class of neural networks N whose size, maximum depth
proximation.                                                          and number of trainable parameters are at most polyno-
                                                                      mial in m̃ (the rate of growth can be specified, along with
Problem statement                                                     the dependence on d),
Let f : U → V be a holomorphic and potentially Hilbert-             • a regularization functional J : N → [0, ∞) based on a
valued function. We assume m sample points {yi }m      i=1 are        certain norm of the trainable parameters,
drawn independently and identically according to the uni-
                                                                    such that for every f ∈ HA(γ, , d) with noisy evaluations
form measure %. For each yi , we suppose an approximate
value of f (yi ) is computed in the space Vh . Hence, the mea-      (5), any minimizer Φ̂ of the regularized training problem
surements of f are                                                               r
                                                                                       Xm
                                                                                     1
                                                                      minimize                 kfΦ (yi ) − di k2V,2 + λJ (Φ), (6)
          di = f (yi ) + ni ∈ Vh ,      i = 1, . . . , m.     (5)          Φ∈N      m       i=1
                                                                                                  √
Note that this encompasses the fact that, for example in para-      satisfies, for e = (ni )m
                                                                                            i=1 / m and for m large enough,
metric PDEs, f (yi ) is computed via a black-box numerical                                                      √
routine (e.g. a numerical PDE solve) that must, of course,                                            e 1/(2d) / 2) + kekV,2
                                                                        kf − fΦ̂ kL2% (U ;V) . exp(−γ m
return a value in a finite-dimensional space. Thus the mea-                                    |       {z
                                                                                                          (a)
                                                                                                                  } | {z }
                                                                                                                              (b)
surement error term ni contains the error resulting from nu-
merical solver.                                                                                + kf − Ph (f )kL∞ (L2% ;V) .
   Our goal is to approximate f from the measurements                                            |        {z            }
                                                                                                                (c)
{di }m
     i=1 using a DNN. Write the projection Ph f in terms
of the basis {ϕk }K
                  k=1 as                                               Many recent works have shown that DNNs can efficiently
                                                                    approximate different classes of functions – see, e.g., (Ad-
                                  K
                                  X                                 cock and Dexter 2020; Petersen and Voigtlaender 2018; Op-
              f ≈ (Ph f )(y) =          ck (y)ϕk .                  schoor, Schwab, and Zech 2019; Grohs et al. 2019) and
                                  k=1                               references therein). Such results are of the form of exis-
We aim to approximate the vector-valued map Rd → RK ,               tence theorems, i.e. they assert the existence of a DNN of
y 7→ (ck (y))K                      d    K                          a give architecture with favourable approximation properties
             k=1 , with a DNN Φ̂ : R → R . We consider
feedforward DNN architectures, i.e.                                 – for example, as noted above (Opschoor, Schwab, and Zech
                                                                    2019) shows the existence of DNN with the same approxi-
  Φ̂(y) = AL+1 (ρ(AL (ρ(· · · ρ(A0 (y)) · · · )))),         L≥1     mation rates as the best s-term polynomial approximation –
but no constructive means for learning it, nor an estimate on      10 -2
the number and type of samples needed to do so. We term
                                                                                                        10 0
Theorem 3.1 a practical existence theorem, since it asserts
not only the existence of such a DNN, but also that it can         10
                                                                        -4


be trained. Moreover, through the first term (a) in the er-                                                  -1
                                                                                                        10
ror bound, which is exponentially small in m, e it shows that
the DNN can be trained in a sample-efficient way. It shows         10 -6

that such a DNN achieves the same error as the best s-term                                              10 -2
polynomial approximation, with a sample complexity that                      500   1000   1500   2000             500   1000   1500   2000
scales quadratically in s. This is, in fact, exactly the same
sample complexity as achieved by polynomial-based com-
pressed sensing methods for holomorphic function approx-         Figure 1: Error approximating (left) f1 (y) and (right) f2 (y)
imation (Adcock, Brugiapaglia, and Webster 2017; Dexter,         with d = 8.
Tran, and Webster 2019). Hence, (a) asserts the DNN proce-
dure achieves the sample approximation rate in terms of m   e
as current best-in-class schemes.                                precision. For training, we use the standard `2 -loss func-
   This aside, Theorem 3.1 also makes explicit the effect of     tion and the Adam optimizer (Kingma and Ba 2014). We
the three main errors in the training process. First, the ap-    train for a maximum of 50,000 epochs or until the toler-
proximation error (a), that depends on number of samples         ance εtol = 5 × 10−7 is reached. The weights and biases
m.
 e Second, the sampling error (b), i.e. the error in the nu-     are initialized as normal random variables with mean 0 and
merical PDE solve. And third, the discretization error (c).      variance 0.01. The compressed sensing schemes follow the
This error is due to working in the finite-dimensional space     setup of (Adcock, Brugiapaglia, and Webster 2017), and are
Vh rather than V. However, the key point is that it is propor-   based on Legendre polynomials in a hyperbolic cross index
tional to the error of the orthogonal projection Ph (f ), i.e.   set, combined with either `1 -minimization or weighted `1 -
the best approximation to f from Vh .                            minimization (the latter being known to offer improved per-
                                                                 formance). These experiments are performed in double pre-
             4   Numerical experiments                           cision in Matlab using the SPGL1 solver (van den Berg
                                                                 and Friedlander 2019, 2009). The approximation error is
This result demonstrates the potential efficacy of the
                                                                 taken as the relative L2 (U) error, and is computed using a
DNN approach for Hilbert-valued function approximation.
                                                                 high-order isotropic Clenshaw–Curtis sparse grid quadrature
Namely, the DNN approach can perform as well as cur-
                                                                 rule consisting of roughly 1.9 × 106 points.
rent best-in-class schemes based on compressed sensing for
                                                                    In Fig. 1, we consider two different smooth, scalar-valued
holomorphic, Hilbert-valued function approximation. How-
                                                                 functions:
ever, we caution that the training procedure outlined in The-                                 Pd             !
orem 3.1 is not expected to lead to DNNs that perform                                           k=1 cos(yk )
any better. Its proof heavily leverages the relation between                 f1 (y) = exp −                    ,
                                                                                                    8d
DNNs and polynomials and loss function minimization and
                                                                                                             !1/d
standard optimization problems for compressed sensing. On                                  Qd/2        k 2
the other hand, the generality of the fully-connected DNN                                    k=1 1 + 4 yk
                                                                             f2 (y) = Qd                           .
architectures allows them to be applied to problems where                                 k=d/2+1 100 + 5yk
polynomial approximations are known to fail, e.g. functions
                                                                 This experiment demonstrates a gap between the practical
with discontinuities or low parametric regularity. This mo-
                                                                 existence theorem, Theorem 3.1, and practical performance
tivates the need for numerical experimentation, which com-
                                                                 of DNNs when trained with standard architectures and op-
pares the practical performance of DNN training with stan-
                                                                 timizers. The first function is extremely smooth and there-
dard architectures and loss functions to other schemes such
                                                                 fore has approximately sparse polynomial coefficients. In
as polynomial-based compressed sensing.
                                                                 spite of Theorem 3.1, Fig. 1 demonstrates that compressed
                                                                 sensing significantly outperforms the DL approach. On the
Scalar-valued function approximation                             other hand, the function f2 is less favourably approximated
In this section, we consider scalar-valued function approxi-     by polynomials, and the DL approach achieves competitive
mation. Our purpose is to compare the standard DNN train-        performance with compressed sensing. This points towards
ing to best-in-class compressed sensing schemes. For fur-        the efficacy of DL for broader classes of functions. Note that
ther details of the study, and for additional experiments, see   both experiments were also completed over 30 alternative
(Adcock and Dexter 2020). We focus on two factors. First,        ReLU DNN architectures of varying depth and width with
the sample complexity, i.e. the behaviour of the error as m      similar results.
varies, and second, the effect of noise in the measurements.        Next, in Fig. 2 we consider the effect of noise. These ex-
   We consider ReLU DNN architectures with a fixed num-          periments suggest that the trained DNNs are numerically
ber of nodes N per layer and depth L. We vary the ra-            stable, with the noise contributing linearly to the overall er-
tio β := L/N to try to obtain the best approximation.            ror. Unlike the gap identified above in the approximation er-
We use TensorFlow and perform calculations in single             ror, this behaviour is in good agreement with Theorem 3.1.
                                                                             et al. 2020) for more details). We next define the Hilbert
   10
        5
                                         10 2
                                                                             spaces
                                                                                 H(div; Ω) := {τ ∈ [L2 (Ω)]2 : div(τ ) ∈ L2 (Ω)},
        0
   10
                                         10
                                              0                              and, similar to the analysis of (Gatica 2014, Section 4.2), we
                                                                             introduce σ(y) = ∇u(y) ∈ H(div; Ω) as an additional un-
  10 -5
                                                                             known. Due to limitation of space, we now simply state the
                                         10 -2
                                                                             mixed formulation and refer to its well posednes to (Gatica
            500     1000   1500   2000            500   1000   1500   2000   2014, Section 4). In this way, defining H = [L2 (Ω)]2 and
                                                                                    Q = {τ ∈ [H(div; Ω)]2 : τ · n = 0 in ΓN },
Figure 2: Errors of two best-performing ReLU DNNs in ap-                     one arrives to the mixed variational formulation: given y ∈
proximating (left) f1 (y) and (right) f2 (y) in d = 8 dimen-                 U, find (u(y), σ(y)) ∈ H × Q such that
sions with and without N (0, σ 2 ) noise for various σ.
                                                                                   hσ, τ iL2 (Ω) + hu, ∇ · τ iL2 (Ω) =hτ · n, hiΓD
                                                                                                                                             (7)
                                                                             h∇K · σ, viL2 (Ω) + hKv, ∇ · σiL2 (Ω) =−hf , viL2 (Ω)
Parametric PDEs with mixed boundary conditions
We conclude by applying DL to a parametric PDE problem.                      for all (v, τ ) ∈ H ×Q. Note that, in this case, V = L2 (Ω)×
Specifically, we consider a non-trivial, two-dimensional                     H(div; Ω).
Hilbert-valued function approximation problem arising as                        To discretize this problem, we consider a uniform dis-
the solution of a parametric PDE with mixed boundary con-                    cretization, meaning an arbitrary finite-dimensional sub-
ditions. See, e.g. (Adcock et al. 2020; Geist et al. 2020), for              space, where the usual [P1 ]2 -Lagrange finite elements are
parametric PDE problems with Dirichlet boundary condi-                       used for H and the Raviart-Thomas [RT2 ]2 are used for Q.
tions. Here, our objective is to examine the efficacy of the                 Furthermore, given y ∈ U, we define the Finite Element
DNN approach on a non-standard parametric PDE.                               (FE) and DNN approximations for u(y) = [u1 , u2 ]> (y) by
   Let Ω ⊆ R2 be a given bounded domain with polyhedral                                   K1                                K1
boundary ∂Ω = Γ, such that Γ = ΓD ∪ ΓN and ΓD ∩ ΓN =
                                                                                          X                                 X
                                                                             uh,j (y) =         cj,k (y)ϕk , uΦ,h,j (y) =         (Φ(y))k,j ϕk ,
∅. We focus on the following parametric diffusion example                                 k=1                               k=1
with mixed boundary conditions
                                                                             for j ∈ {1, 2} respectively, where (ϕk )K    1
                                                                                                                         k=1 is a basis for
 −∇ · (K(x, y)∇u(x, y)) = f (x, y) x ∈ Ω, y ∈ U,                             P1 . Similarly, defining basis functions for the RT2 -spaces
                u(x, y) = h(x, y) x ∈ ΓD , y ∈ U,
                                                                             as (ϕek )K 2
                                                                                      k=1 , we can analogously define the FE and DNN
            ∇u(x, y) · n = 0       x ∈ ΓN , y ∈ U,
                                                                             approximations to σ. We omit further details.
where, given y ∈ U, f (y) ∈ [L2 (Ω)]2 and h(y) ∈                                We consider different types of DNN architectures to ap-
[H 1/2 (ΓD )]2 , whose components are in the space of traces                 proximate the coefficients of the finite element basis. In this
of functions in H 1 (Ω). We define K = diag(a), where the                    study, we use the same finite element discretization in gener-
vector a = [a1 , a2 ]> . Then, to guarantee well-posedness                   ating the sample training and testing data. We√ consider finite
and parametric regularity of the solution u, we require uni-                 element discretizations on a mesh with h = 2/32, giving
form boundedness assumptions on each component of K,                         a total 22,914 degrees of freedom.
see (Cohen, DeVore, and Schwab 2011) for more details.                          For comparison, we consider both the Leaky-ReLU and
That is, there exists aj,max ≥ aj,min > 0 such that                          hyperbolic tangent activation functions. To train the DNN,
                                                                             we use the Adam optimizer along with an exponentially de-
aj min ≤ aj (x, y) ≤ aj,max , ∀x ∈ Ω, ∀y ∈ U, j ∈ {1, 2}.                    cay learning rate, minimizing the standard `2 -mean squared
                                                                             error. The training data for the DNN is generated by solving
   In our example, we set Ω = (0, 1)2 and define the Dirich-                 (7) at a set of uniform random points {yi }m   i=1 ⊆ U, with
let and Neumann boundaries as                                                exact solution given by
                  ΓD = {x ∈ [0, 1]2 : x1 = 0, x1 = 1},                         u1 (x, y) =
                  ΓN = {x ∈ [0, 1]2 : x2 = 0, x2 = 1}.                            sin(π(y1 + y2 ))(cos(πx2 ) exp(x21 − 1) + cos(πx1 )),
Next, we define the uniformly positive tensor K with com-                      u2 (x, y) =
ponents                                                                           cos(π(y1 + y2 ))(cos(πx2 ) exp(x21 − 1) + sin(πx1 )).
        a1 (x, y) = 3 + x1 y1 + x√2 y2 ,                                       Fig. 3 shows the results of training with different loss
        a2 (x, y) = exp(1 + y1 ( 2πL )1/2 + ζ sin(πx1 )y2 ).                 functions. Generally speaking, networks using Leaky-ReLU
                                                                             achieve smaller training loss over the hyperbolic tangent.
Here, the first is a simple affine coefficient, and the sec-                 Moreover, the numerical results shows that deeper and wider
ond is a modification of the example from (Nobile, Tem-                      networks are faster to train, meaning they can achieve a
pone, and Webster 2008b) of a diffusion coefficient with                     smaller training loss error after a fixed number of epochs.
one-dimensional (layered) spatial dependence (see (Adcock                    On the other hand, it is interesting to see that test errors
                                                                  Figure 4: Visualization of the DNN approximation with hy-
Figure 3: Shows the performance of different DNN architec-        perbolic tangent activation function, L = 5 hidden layers,
tures for approximating the mixed parametric PDE problem,         and N = 50 nodes per hidden layer. The colours show the
using batch size of m/2, L hidden layers and N nodes per          approximation to uh,j , and the white arrows show the the
layer with L/N = 0.1. (Left) Comparison of training `2 loss       approximation for σh,j (j = 1 left and j = 2 right).
function on m = 200 (uniform random samples). (Right)
Shows the testing error Bochner norm of the function uΦ,h,1
and the error norm of its gradient σΦ,h,1 , square and triangle
markers respectively.                                             those on which polynomials perform badly, e.g. nonsmooth
                                                                  functions. Finally, we also presented new results on DNN
                                                                  approximation of parametric PDEs with mixed boundary
are not so different for the different architectures, with the    conditions. These results highlight the potential of the DNN
best two performing Leaky-ReLU and hyperbolic tangent             approach, although further work is needed to tune the archi-
networks giving similar results. Interestingly, even though       tectures to get optimal performance – including better ap-
separate DNNs are trained for each (due to the mixed for-         proximation of the gradient – and to determine the method’s
mulation), the test errors for u1 are substantially smaller (by   efficacy, in particular, in comparison to current best-in-class
several orders of magnitude) than those for its gradient ∇u1 .    schemes based on polynomial approximation.
   In addition to this we also plot the best approximate solu-
tions for u and σ produced by the DNN. Figure 4 shows
the approximations achieved by a DNN trained on m =
200 samples to error 6.3699 × 10−6 in the loss after 2900
epochs of training, and evaluated at input parameter value
y = [0, −0.707106]. It is noticeable that the approximation
to ∇u1 is slightly better than the approximation to ∇u2 . In
Figure 5, we also compare the DNN solution for u2 with the
FE approximation.

                     5    Conclusion
Deep neural networks offer many advantages for efficiently
learning scalar- and Hilbert-valued functions and, in partic-
ular, solutions to parametric PDE problems. Unlike virtu-
ally every polynomial-based approach, selection of a basis
or dictionary is not necessary when applying DNNs. They
also offer the capability to solve challenging UQ problems
which possess discontinuities leading to low parametric reg-
ularity, problems for which methods such as sparse poly-
nomial approximation via compressed sensing are poorly
suited. In this work we presented a theoretical result which
demonstrated the efficacy of the DNN approach for smooth
function approximation. However, as observed in our first
experiments and elaborated further in (Adcock and Dexter          Figure 5: Comparison of the output of the DNN from
2020), such theoretical rates may not be met in practice via      Fig 4 to approximate u2,h (y). Here the testing error is
standard architecture designs and training. This suggests a       6.7321 · 10−3 , yellow/green colours the FE approximation,
strong possibility to further modify DNN architecture and         and blue/red colours the DNN approximation. We omit the
training procedures to achieve superior performance across a      visualization of u1 since its indistinguishable from the FE
range of challenging high-dimensional problems, including         approximation.
                       References                              Dexter, N.; Tran, H.; and Webster, C. 2019. A mixed `1 reg-
Adcock, B.; Brugiapaglia, S.; Dexter, N.; and Moraga, S.       ularization approach for sparse simultaneous approximation
2020. Deep Neural Networks Are Effective At Learning           of parameterized PDEs. ESAIM Math. Model. Numer. Anal.
High-Dimensional Hilbert-Valued Functions From Limited         53: 2025–2045.
Data. Preprint .                                               Doostan, A.; and Owhadi, H. 2011. A non-adapted sparse
                                                               approximation of PDEs with stochastic inputs. J. Comput.
Adcock, B.; Brugiapaglia, S.; and Webster, C. G. 2017.
                                                               Phys. 230(8): 3015–3034.
Compressed sensing approaches for polynomial approxima-
tion of high-dimensional functions. In Boche, H.; Caire, G.;   Faber, F. A.; Hutchison, L.; Huang, B.; Gilmer, J.; Schoen-
Calderbank, R.; März, M.; Kutyniok, G.; and Mathar, R.,       holz, S. S.; Dahl, G. E.; Vinyals, O.; Kearnes, S.; Riley, P. F.;
eds., Compressed Sensing and its Applications: Second In-      and von Lilienfeld, O. A. 2017. Prediction Errors of Molec-
ternational MATHEON Conference 2015, Applied and Nu-           ular Machine Learning Models Lower than Hybrid DFT Er-
merical Harmonic Analysis, 93–124. Cham: Birkhäuser.          ror. Journal of Chemical Theory and Computation 13(11):
                                                               5255–5264.
Adcock, B.; and Dexter, N. 2020. The gap between theory
and practice in function approximation with deep neural net-   Gatica, G. 2014. A Simple Introduction to the Mixed Finite
works. arXiv:2001.07523 .                                      Element Method. Theory and Applications. SpringerBriefs
                                                               in Mathematics. Springer, Cham. .
Adcock, B.; and Hansen, A. C. 2021. Compressive Imag-
ing: Structure, Sampling, Learning. Cambridge: Cambridge       Geist, M.; Petersen, P.; Raslan, M.; Schneider, R.; and
University Press (in press).                                   Kutyniok, G. 2020. Numerical Solution of the Para-
                                                               metric Diffusion Equation by Deep Neural Networks.
Antun, V.; Renna, F.; Poon, C.; Adcock, B.; and Hansen,        arXiv:2004.12131 .
A. C. 2020. On instabilities of deep learning in image re-
construction and the potential costs of AI. Proceedings of     Gottschling, N. M.; Antun, V.; Adcock, B.; and Hansen,
the National Academy of Sciences .                             A. C. 2020. The troublesome kernel: why deep learning for
                                                               inverse problems is typically unstable. arXiv:2001.01258 .
Babuška, I.; Tempone, R.; and Zouraris, G. E. 2004.
Galerkin Finite Element Approximations of Stochastic El-       Grohs, P.; Perekrestenko, D.; Elbrächter, D.; and Bölcskei,
liptic Partial Differential Equations. SIAM J. Numer. Anal.    H. 2019. Deep Neural Network Approximation Theory.
42(2): 800–825.                                                arXiv:1901.02220 .

Berg, J.; and Nyström, K. 2018. A unified deep artificial     Hampton, J.; and Doostan, A. 2015. Compressive Sampling
neural network approach to partial differential equations in   of Polynomial Chaos Expansions: convergence Analysis and
complex geometries. Neurocomputing 317: 28 – 41.               Sampling Strategies. J. Comput. Phys. 280: 363–386.
                                                               Khoo, Y.; Lu, J.; and Ying, L. 2020. Solving parametric PDE
Chkifa, A.; Cohen, A.; Migliorati, G.; Nobile, F.; and Tem-
                                                               problems with artificial neural networks. European J. Appl.
pone, R. 2015. Discrete least squares polynomial approxi-
                                                               Math. (in press) .
mation with random evaluations - application to parametric
and stochastic elliptic pdes. ESAIM Math. Model. Numer.        Kingma, D. P.; and Ba, J. 2014. Adam: A method for
Anal. 49(3): 815–837.                                          stochastic optimization. arXiv:1412.6980 .
Chkifa, A.; Cohen, A.; and Schwab, C. 2015. Breaking the       Kutyniok, G.; Petersen, P.; Raslan, M.; and Schneider, R.
curse of dimensionality in sparse polynomial approximation     2020. A Theoretical Analysis of Deep Neural Networks and
of parametric PDEs. J. Math. Pures Appl. 103(2): 400–428.      Parametric PDEs. arXiv:1904.00377 .
Cohen, A.; DeVore, R.; and Schwab, C. 2011. Analytic           Laakmann, F.; and Petersen, P. 2020. Efficient approxima-
regularity and polynomial approximation of parametric and      tion of solutions of parametric linear transport equations by
stochastic elliptic PDE’s. Analysis and Applications 9(01):    ReLU DNNs. arXiv:2001.11441 .
11–47.                                                         Nobile, F.; Tempone, R.; and Webster, C. G. 2008a. A
Cohen, A.; and DeVore, R. A. 2015. Approximation of high-      sparse grid stochastic collocation method for partial differ-
dimensional parametric PDEs. Acta Numer. .                     ential equations with random input data. SIAM J. Numer.
                                                               Anal. 46(5): 2309–2345.
Cyr, E. C.; Gulian, M. A.; Patel, R. G.; Perego, M.; and
Trask, N. A. 2020. Robust Training and Initialization of       Nobile, F.; Tempone, R.; and Webster, C. G. 2008b. A
Deep Neural Networks: An Adaptive Basis Viewpoint. In          sparse grid stochastic collocation method for partial differ-
Lu, J.; and Ward, R., eds., Proceedings of The First Mathe-    ential equations with random input data. SIAM Journal on
matical and Scientific Machine Learning Conference, vol-       Numerical Analysis 46(5): 2309–2345.
ume 107 of Proceedings of Machine Learning Research,           Ongie, G.; Jalal, A.; Metzler, C. A.; Baraniuk, R. G.; Di-
512–536. Princeton University, Princeton, NJ, USA: PMLR.       makis, A. G.; and Willett, R. 2020. Deep Learning Tech-
Dal Santo, N.; Deparis, S.; and Pegolotti, L. 2020. Data       niques for Inverse Problems in Imaging.
driven approximation of parametrized PDEs by Reduced           Opschoor, J. A. A.; Schwab, C.; and Zech, J. 2019. Ex-
Basis and Neural Networks. J. Comput. Phys. 416: 109550.       ponential ReLU DNN expression of holomorphic maps in
high dimension. Technical Report 35, ETH: Zürich. doi:
https://doi.org/10.1142/S0219530518500203.
Petersen, P.; and Voigtlaender, F. 2018. Optimal approxima-
tion of piecewise smooth functions using deep ReLU neural
networks. Neural Netw. 108: 296–330.
Smith, R. 2013. Uncertainty Quantification: Theory, Imple-
mentation, and Applications. Computational Science and
Engineering. SIAM.
van den Berg, E.; and Friedlander, M. P. 2009. Probing the
Pareto frontier for basis pursuit solutions. SIAM J. Sci. Com-
put. 31(2): 890–912.
van den Berg, E.; and Friedlander, M. P. 2019. SPGL1:
A solver for large-scale sparse reconstruction. https://
friedlander.io/spgl1.