Learning High-Dimensional Hilbert-Valued Functions With Deep Neural Networks From Limited Data Ben Adcock1 , Simone Brugiapaglia2 , Nick Dexter1 , Sebastian Moraga1 1 Department of Mathematics, Simon Fraser University, Burnaby, BC V5A 1S6, Canada 2 Department of Mathematics & Statistics, Concordia University, Montréal, QC H3G 1M8, Canada ben adcock@sfu.ca, simone.brugiapaglia@concordia.ca, nicholas dexter@sfu.ca, smoragas@sfu.ca Abstract connected ReLU DNN architectures (Opschoor, Schwab, and Zech 2019). Deep learning (DL) techniques have been successful in many applications, with the most impressive results achieved on While these results are impressive, many important ques- problems where the dimension of the underlying data or prob- tions about the application of DL techniques to problems in lem domain is large. In this paper, we describe recent results scientific computing remain. The work (Adcock and Dexter on both scalar- and Hilbert-valued function approximation 2020) raised key issues regarding current methods of train- via Deep Neural Networks (DNN). This problem arises in ing which prevent DNNs from practically achieving high- many engineering problems, in particular those involving the accuracy approximations on smooth function approxima- solution of parametric Partial Differential Equations (PDEs). tion tasks. It was also demonstrated in (Antun et al. 2020; Such problems are challenging, since point-wise samples Gottschling et al. 2020) that DL models trained to solve in- are expensive to acquire, and the function is usually high- verse problems in imaging can fail to recover small struc- dimensional. First, we consider a DNN architecture and train- ing procedure for which the resulting DNN is guaranteed to tural changes in images, and small perturbations can lead to perform as well as current best-in-class schemes for holo- numerical artifacts. In other words, DNNs can lead to un- morphic, scalar- or Hilbert-valued functions based on poly- stable methods unless they are carefully constructed. Hence, nomial approximations. This result demonstrates the efficacy in order for these tools to be accepted and applied to critical of DL for this problem, and makes explicit the effect of all tasks in scientific computing with strict error tolerances, e.g., sources of error, including discretization error of the under- UQ problems in mechanical engineering, it is necessary to lying Hilbert space and measurement error. Second, we pro- develop theoretical foundations on stability and convergence vide several numerical results illustrating the performance of of DNNs, as well as an understanding of how to train them DNNs on both real-valued functions and solutions of para- to ensure these properties. metric PDEs. These results suggest that better approxima- In this work we showcase results on learning high- tions can be achieved through careful tuning of the DNN ar- chitecture and training algorithm. dimensional scalar- and Hilbert-valued functions from lim- ited datasets by DNNs. While most previous studies have considered scalar-valued functions, Hilbert-valued functions 1 Introduction arise is various important applications, for example UQ Modern machine learning approaches based on training problems involving PDEs. Such problems are typically ex- DNNs on large datasets have achieved impressive results pressed in terms of a differential operator Dx , with x com- on high-dimensional problems in several important scien- prising the spatial and temporal variables, and the goal is to tific computing applications, including inverse problems in find for all y ∈ U ⊆ Rd a solution u ∈ V, with V a Hilbert imaging (Adcock and Hansen 2021; Ongie et al. 2020), or Banach space, satisfying molecular dynamics simulations (Faber et al. 2017), partial differential equations (PDEs) (Berg and Nyström 2018), and Dx (u, y) = 0, (1) parameterized PDEs for uncertainty quantification (UQ) in subject to suitable boundary conditions. Existing techniques engineering (Cyr et al. 2020; Dal Santo, Deparis, and Pe- for solving such problems achieve a discretization of the golotti 2020; Geist et al. 2020; Khoo, Lu, and Ying 2020; infinite-dimensional parameter to solution map y 7→ u(y) ∈ Laakmann and Petersen 2020). Recent theoretical results on V by combining spatial discretization, e.g., finite element or DNN expressivity suggest that such architectures are ca- difference methods, with global polynomial approximation, pable of approximating a wide class of functions by em- achieved through Galerkin projection (Babuška, Tempone, ulating existing approximation schemes, e.g., polynomials, and Zouraris 2004), hierarchical interpolation on sparse wavelets, or free-knot splines. For smooth functions, ex- grids (Nobile, Tempone, and Webster 2008a), discrete least- ponential rates of convergence have been shown for fully- squares (Chkifa et al. 2015), or compressive sensing-based Copyright © 2021 for this paper by its authors. Use permitted un- approaches (Adcock, Brugiapaglia, and Webster 2017; Dex- der Creative Commons License Attribution 4.0 International (CC ter, Tran, and Webster 2019; Doostan and Owhadi 2011; BY 4.0) Hampton and Doostan 2015). Challenges 2 Preliminaries The large underlying dimension of problems of the form We first require some notation. For d ∈ N we write Nd0 := (1) presents a major challenge in practical applications. The {ν = (νk )dk=1 : νk ∈ N0 } for the set of nonnegative integer curse of dimensionality states that the amount of data and multi-indices. The inequality µ ≤ ν is understood compo- computational effort needed to construct an approximation nentwise. We write 0 and 1 for the multi-indices consisting may grow exponentially with the problem dimension. In of all zeros and all ones respectively. We also use the nota- many real-world problems arising in scientific computing tion A . B to mean that there exists a numerical constant and UQ the parametric dimension d  1, as the parameters c > 0 independent of A and B such that A ≤ cB, and like- are used to model various material properties, forcing terms, wise for A & B. and boundary information. In such settings, naı̈ve applica- tion of the aforementioned techniques can lead to compu- Setup tationally inefficient schemes or suboptimal approximations Throughout, we let V be a separable Hilbert space over the with respect to the sample complexity, i.e., the number of field R, with inner product h·, ·iV and corresponding norm samples required to construct the approximation. p kvkV = hv, viV . Previous work We let V N be the vector space of Hilbert-valued vectors of DNNs have recently been studied for these problems in length N , i.e. u = (ui )N i=1 where ui ∈ V, i = 1, . . . , N . (Kutyniok et al. 2020) where, under assumptions guaran- More generally, let Λ ⊆ Nd0 denote a (possibly infinite) teeing low-dimensionality of the solution manifold, fast, multi-index set. We write v = (vν )ν∈Λ for a sequence with dimension-independent rates of convergence were estab- V-valued entries, vν ∈ V. We define the space `2 (Λ; V) as lished by connecting DNNs to reduced basis approxima- the set of those sequences for which the corresponding norm tions and the theory of Kolmogorov N -widths. The work !1/2 (Geist et al. 2020) numerically tested the accuracy of DNNs X 2 in approximating solutions to the parametric diffusion equa- kvkV,2 := kvν kV < ∞. ν∈Λ tion over a range of dimensions, finding errors scale with the inherent complexity of the problem as the dimension in- For the parametric domain, we consider the hypercube creases, but the observed scaling was not exponential. Other equipped with the uniform probability measure works such as (Cyr et al. 2020; Dal Santo, Deparis, and Pegolotti 2020; Khoo, Lu, and Ying 2020; Laakmann and U = [−1, 1]d , d%(y) = 2−d dy. (2) Petersen 2020) have also demonstrated the potential of the For 1 ≤ p ≤ ∞, we write Lp% (U) for the corresponding DNN approach for parametric PDEs. Further, it is hoped that DNNs may overcome some of the Lebesgue spaces and k·kLp% (U ) for their norms. Next, we limitations of standard polynomial-based methods. These define the Bochner space Lp% (U; V) as the space consisting methods require strong smoothness assumptions on the un- of (equivalence classes of) strongly %-measurable functions derlying parameterized PDE problem in order to show fast f : U → V for which rates of convergence. However, problems typically found ( R p 1/p in engineering or science applications rarely satisfy all of kf (y)kV d%(y) 1≤p<∞ kf kLp% (U ;V) := U . these assumptions (Smith 2013). On the other hand, DNNs ess supy∈U kf (y)kV p=∞ have been shown to achieve optimal rates of approxima- tion for piecewise smooth functions (Petersen and Voigt- Generally speaking, one cannot deal directly with an laender 2018), suggesting far greater flexibility than the infinite-dimensional output space V. Hence, we introduce polynomial-based approach. Numerical studies have also a discretization (e.g. a finite element discretization). This demonstrated the efficiency of DNNs in approximating dis- is a finite-dimensional subspace Vh ⊂ V of dimension continuous (scalar-valued) functions (Adcock and Dexter K = dim(Vh ). We let Ph : V → Vh be the orthogo- 2020), whereas polynomial-based methods fail to converge. nal projection onto Vh and {ϕk }K k=1 be a (not necessarily orthogonal) basis of Vh . Moreover, for f ∈ L2% (U; V) we This work let Ph f ∈ L2% (U; Vh ) be the function defined almost every- where as The purpose of this work is to describe recent results (Ad- cock and Dexter 2020; Adcock et al. 2020) on DNN approx- (Ph f )(y) = Ph (f (y)), ∀y ∈ U. imation of scalar- and Hilbert-valued functions and their ap- plication to parametric PDEs. Although DNNs have shown Holomorphy and polynomial approximation some promise for discontinuous functions, in this work we The focus of this work is the approximation of smooth func- focus on the smooth case, where a direct comparison with tions f : U → V. Note that f may either be scalar-valued, polynomial-based methods is valid. In particular, we focus e.g. V = R, or Hilbert-valued, in which case V is a sepa- on the following question: to what extent (both theoretically rable, but potentially infinite-dimensional Hilbert space. By and empirically) can DNN approaches outperform current smooth, we mean that f has a holomorphic extension to a best-in-class polynomial-based methods? suitable complex region O of the form U ⊆ O ⊆ Cd . Specifically, we take these regions to be Bernstein polyel- where ρ is the activation function and Al are the affine maps. lipses. Given parameters ρ > 1, these are of the form Given such a DNN Φ the resulting approximation to f is Eρ = Eρ1 × · · · × Eρd , K X f (y) ≈ fΦ (y) = (Φ(y))k ϕk ∈ Vh . where, for ρ > 1, Eρ = { 21 (z + z −1 ) : z ∈ C, 1 ≤ |z| ≤ k=1 ρ} ⊂ C are the classical Bernstein ellipses. Note that Bern- stein polyellipses, much like classical Bernstein ellipses, are Note that when V = R, i.e. f is scalar-valued, such consid- intimately connected with multivarate polynomial approxi- erations are unnecessary, since Vh = V = R and we simply mation (Chkifa, Cohen, and Schwab 2015; Cohen and De- have f (y) ≈ Φ(y). Vore 2015; Cohen, DeVore, and Schwab 2011). To be precise, in this work we consider the classes of 3 Learning holomorphic, Hilbert-valued functions HA = HA(γ, , d), where γ > 0 and  > 0 functions via DNNs are constants. This class consists of functions f : U → V We now present a theoretical result demonstrating that that have a holomorphic extension to a Bernstein polyellipse DNNs can be learned to efficiently approximate holomor- Eρ and satisfy kf kL∞ (Eρ ;V) ≤ 1, and where the parameters phic functions. Note that (Opschoor, Schwab, and Zech ρ = (ρj )dj=1 also satisfy 2019) has previously shown that there is a DNN that Qd !1/d achieves the same rate of approximation (4) as the best s- 1 d! j=1 log(ρj ) term polynomial approximation. In the following result, we ≥ γ. (3) show that such a DNN can also be learned efficiently from d+1 1+ the data (5) through a standard training strategy. Due to The motivation for this definition is the following. For any space limitations we omit some details. The full details and f ∈ HA(γ, , d) its best s-term polynomial approximation proof can be found in (Adcock et al. 2020). fs in L2% -orthogonal polynomials (normalized, tensor Leg- Theorem 3.1 Let m ≥ 2, endre polynomials) decays exponentially fast in s: namely, e := cm/(log2 (m) min{log(m) + d, log(m) log(2d)}), m kf − fs kL2% (U ;V) ≤ exp(−γs1/d ), ∀s ≥ s̄(d, , ρ), (4) e ≥ 2d and {yi }m such that m i=1 be drawn randomly and in- where s̄(d, , ρ) is a constant (Opschoor, Schwab, and Zech dependently from %. Then, with high probability, there is 2019). This is the rate we seek to obtain with a DNN ap- • a class of neural networks N whose size, maximum depth proximation. and number of trainable parameters are at most polyno- mial in m̃ (the rate of growth can be specified, along with Problem statement the dependence on d), Let f : U → V be a holomorphic and potentially Hilbert- • a regularization functional J : N → [0, ∞) based on a valued function. We assume m sample points {yi }m i=1 are certain norm of the trainable parameters, drawn independently and identically according to the uni- such that for every f ∈ HA(γ, , d) with noisy evaluations form measure %. For each yi , we suppose an approximate value of f (yi ) is computed in the space Vh . Hence, the mea- (5), any minimizer Φ̂ of the regularized training problem surements of f are r Xm 1 minimize kfΦ (yi ) − di k2V,2 + λJ (Φ), (6) di = f (yi ) + ni ∈ Vh , i = 1, . . . , m. (5) Φ∈N m i=1 √ Note that this encompasses the fact that, for example in para- satisfies, for e = (ni )m i=1 / m and for m large enough, metric PDEs, f (yi ) is computed via a black-box numerical √ routine (e.g. a numerical PDE solve) that must, of course, e 1/(2d) / 2) + kekV,2 kf − fΦ̂ kL2% (U ;V) . exp(−γ m return a value in a finite-dimensional space. Thus the mea- | {z (a) } | {z } (b) surement error term ni contains the error resulting from nu- merical solver. + kf − Ph (f )kL∞ (L2% ;V) . Our goal is to approximate f from the measurements | {z } (c) {di }m i=1 using a DNN. Write the projection Ph f in terms of the basis {ϕk }K k=1 as Many recent works have shown that DNNs can efficiently approximate different classes of functions – see, e.g., (Ad- K X cock and Dexter 2020; Petersen and Voigtlaender 2018; Op- f ≈ (Ph f )(y) = ck (y)ϕk . schoor, Schwab, and Zech 2019; Grohs et al. 2019) and k=1 references therein). Such results are of the form of exis- We aim to approximate the vector-valued map Rd → RK , tence theorems, i.e. they assert the existence of a DNN of y 7→ (ck (y))K d K a give architecture with favourable approximation properties k=1 , with a DNN Φ̂ : R → R . We consider feedforward DNN architectures, i.e. – for example, as noted above (Opschoor, Schwab, and Zech 2019) shows the existence of DNN with the same approxi- Φ̂(y) = AL+1 (ρ(AL (ρ(· · · ρ(A0 (y)) · · · )))), L≥1 mation rates as the best s-term polynomial approximation – but no constructive means for learning it, nor an estimate on 10 -2 the number and type of samples needed to do so. We term 10 0 Theorem 3.1 a practical existence theorem, since it asserts not only the existence of such a DNN, but also that it can 10 -4 be trained. Moreover, through the first term (a) in the er- -1 10 ror bound, which is exponentially small in m, e it shows that the DNN can be trained in a sample-efficient way. It shows 10 -6 that such a DNN achieves the same error as the best s-term 10 -2 polynomial approximation, with a sample complexity that 500 1000 1500 2000 500 1000 1500 2000 scales quadratically in s. This is, in fact, exactly the same sample complexity as achieved by polynomial-based com- pressed sensing methods for holomorphic function approx- Figure 1: Error approximating (left) f1 (y) and (right) f2 (y) imation (Adcock, Brugiapaglia, and Webster 2017; Dexter, with d = 8. Tran, and Webster 2019). Hence, (a) asserts the DNN proce- dure achieves the sample approximation rate in terms of m e as current best-in-class schemes. precision. For training, we use the standard `2 -loss func- This aside, Theorem 3.1 also makes explicit the effect of tion and the Adam optimizer (Kingma and Ba 2014). We the three main errors in the training process. First, the ap- train for a maximum of 50,000 epochs or until the toler- proximation error (a), that depends on number of samples ance εtol = 5 × 10−7 is reached. The weights and biases m. e Second, the sampling error (b), i.e. the error in the nu- are initialized as normal random variables with mean 0 and merical PDE solve. And third, the discretization error (c). variance 0.01. The compressed sensing schemes follow the This error is due to working in the finite-dimensional space setup of (Adcock, Brugiapaglia, and Webster 2017), and are Vh rather than V. However, the key point is that it is propor- based on Legendre polynomials in a hyperbolic cross index tional to the error of the orthogonal projection Ph (f ), i.e. set, combined with either `1 -minimization or weighted `1 - the best approximation to f from Vh . minimization (the latter being known to offer improved per- formance). These experiments are performed in double pre- 4 Numerical experiments cision in Matlab using the SPGL1 solver (van den Berg and Friedlander 2019, 2009). The approximation error is This result demonstrates the potential efficacy of the taken as the relative L2 (U) error, and is computed using a DNN approach for Hilbert-valued function approximation. high-order isotropic Clenshaw–Curtis sparse grid quadrature Namely, the DNN approach can perform as well as cur- rule consisting of roughly 1.9 × 106 points. rent best-in-class schemes based on compressed sensing for In Fig. 1, we consider two different smooth, scalar-valued holomorphic, Hilbert-valued function approximation. How- functions: ever, we caution that the training procedure outlined in The- Pd ! orem 3.1 is not expected to lead to DNNs that perform k=1 cos(yk ) any better. Its proof heavily leverages the relation between f1 (y) = exp − , 8d DNNs and polynomials and loss function minimization and !1/d standard optimization problems for compressed sensing. On Qd/2 k 2 the other hand, the generality of the fully-connected DNN k=1 1 + 4 yk f2 (y) = Qd . architectures allows them to be applied to problems where k=d/2+1 100 + 5yk polynomial approximations are known to fail, e.g. functions This experiment demonstrates a gap between the practical with discontinuities or low parametric regularity. This mo- existence theorem, Theorem 3.1, and practical performance tivates the need for numerical experimentation, which com- of DNNs when trained with standard architectures and op- pares the practical performance of DNN training with stan- timizers. The first function is extremely smooth and there- dard architectures and loss functions to other schemes such fore has approximately sparse polynomial coefficients. In as polynomial-based compressed sensing. spite of Theorem 3.1, Fig. 1 demonstrates that compressed sensing significantly outperforms the DL approach. On the Scalar-valued function approximation other hand, the function f2 is less favourably approximated In this section, we consider scalar-valued function approxi- by polynomials, and the DL approach achieves competitive mation. Our purpose is to compare the standard DNN train- performance with compressed sensing. This points towards ing to best-in-class compressed sensing schemes. For fur- the efficacy of DL for broader classes of functions. Note that ther details of the study, and for additional experiments, see both experiments were also completed over 30 alternative (Adcock and Dexter 2020). We focus on two factors. First, ReLU DNN architectures of varying depth and width with the sample complexity, i.e. the behaviour of the error as m similar results. varies, and second, the effect of noise in the measurements. Next, in Fig. 2 we consider the effect of noise. These ex- We consider ReLU DNN architectures with a fixed num- periments suggest that the trained DNNs are numerically ber of nodes N per layer and depth L. We vary the ra- stable, with the noise contributing linearly to the overall er- tio β := L/N to try to obtain the best approximation. ror. Unlike the gap identified above in the approximation er- We use TensorFlow and perform calculations in single ror, this behaviour is in good agreement with Theorem 3.1. et al. 2020) for more details). We next define the Hilbert 10 5 10 2 spaces H(div; Ω) := {τ ∈ [L2 (Ω)]2 : div(τ ) ∈ L2 (Ω)}, 0 10 10 0 and, similar to the analysis of (Gatica 2014, Section 4.2), we introduce σ(y) = ∇u(y) ∈ H(div; Ω) as an additional un- 10 -5 known. Due to limitation of space, we now simply state the 10 -2 mixed formulation and refer to its well posednes to (Gatica 500 1000 1500 2000 500 1000 1500 2000 2014, Section 4). In this way, defining H = [L2 (Ω)]2 and Q = {τ ∈ [H(div; Ω)]2 : τ · n = 0 in ΓN }, Figure 2: Errors of two best-performing ReLU DNNs in ap- one arrives to the mixed variational formulation: given y ∈ proximating (left) f1 (y) and (right) f2 (y) in d = 8 dimen- U, find (u(y), σ(y)) ∈ H × Q such that sions with and without N (0, σ 2 ) noise for various σ. hσ, τ iL2 (Ω) + hu, ∇ · τ iL2 (Ω) =hτ · n, hiΓD (7) h∇K · σ, viL2 (Ω) + hKv, ∇ · σiL2 (Ω) =−hf , viL2 (Ω) Parametric PDEs with mixed boundary conditions We conclude by applying DL to a parametric PDE problem. for all (v, τ ) ∈ H ×Q. Note that, in this case, V = L2 (Ω)× Specifically, we consider a non-trivial, two-dimensional H(div; Ω). Hilbert-valued function approximation problem arising as To discretize this problem, we consider a uniform dis- the solution of a parametric PDE with mixed boundary con- cretization, meaning an arbitrary finite-dimensional sub- ditions. See, e.g. (Adcock et al. 2020; Geist et al. 2020), for space, where the usual [P1 ]2 -Lagrange finite elements are parametric PDE problems with Dirichlet boundary condi- used for H and the Raviart-Thomas [RT2 ]2 are used for Q. tions. Here, our objective is to examine the efficacy of the Furthermore, given y ∈ U, we define the Finite Element DNN approach on a non-standard parametric PDE. (FE) and DNN approximations for u(y) = [u1 , u2 ]> (y) by Let Ω ⊆ R2 be a given bounded domain with polyhedral K1 K1 boundary ∂Ω = Γ, such that Γ = ΓD ∪ ΓN and ΓD ∩ ΓN = X X uh,j (y) = cj,k (y)ϕk , uΦ,h,j (y) = (Φ(y))k,j ϕk , ∅. We focus on the following parametric diffusion example k=1 k=1 with mixed boundary conditions for j ∈ {1, 2} respectively, where (ϕk )K 1 k=1 is a basis for −∇ · (K(x, y)∇u(x, y)) = f (x, y) x ∈ Ω, y ∈ U, P1 . Similarly, defining basis functions for the RT2 -spaces u(x, y) = h(x, y) x ∈ ΓD , y ∈ U, as (ϕek )K 2 k=1 , we can analogously define the FE and DNN ∇u(x, y) · n = 0 x ∈ ΓN , y ∈ U, approximations to σ. We omit further details. where, given y ∈ U, f (y) ∈ [L2 (Ω)]2 and h(y) ∈ We consider different types of DNN architectures to ap- [H 1/2 (ΓD )]2 , whose components are in the space of traces proximate the coefficients of the finite element basis. In this of functions in H 1 (Ω). We define K = diag(a), where the study, we use the same finite element discretization in gener- vector a = [a1 , a2 ]> . Then, to guarantee well-posedness ating the sample training and testing data. We√ consider finite and parametric regularity of the solution u, we require uni- element discretizations on a mesh with h = 2/32, giving form boundedness assumptions on each component of K, a total 22,914 degrees of freedom. see (Cohen, DeVore, and Schwab 2011) for more details. For comparison, we consider both the Leaky-ReLU and That is, there exists aj,max ≥ aj,min > 0 such that hyperbolic tangent activation functions. To train the DNN, we use the Adam optimizer along with an exponentially de- aj min ≤ aj (x, y) ≤ aj,max , ∀x ∈ Ω, ∀y ∈ U, j ∈ {1, 2}. cay learning rate, minimizing the standard `2 -mean squared error. The training data for the DNN is generated by solving In our example, we set Ω = (0, 1)2 and define the Dirich- (7) at a set of uniform random points {yi }m i=1 ⊆ U, with let and Neumann boundaries as exact solution given by ΓD = {x ∈ [0, 1]2 : x1 = 0, x1 = 1}, u1 (x, y) = ΓN = {x ∈ [0, 1]2 : x2 = 0, x2 = 1}. sin(π(y1 + y2 ))(cos(πx2 ) exp(x21 − 1) + cos(πx1 )), Next, we define the uniformly positive tensor K with com- u2 (x, y) = ponents cos(π(y1 + y2 ))(cos(πx2 ) exp(x21 − 1) + sin(πx1 )). a1 (x, y) = 3 + x1 y1 + x√2 y2 , Fig. 3 shows the results of training with different loss a2 (x, y) = exp(1 + y1 ( 2πL )1/2 + ζ sin(πx1 )y2 ). functions. Generally speaking, networks using Leaky-ReLU achieve smaller training loss over the hyperbolic tangent. Here, the first is a simple affine coefficient, and the sec- Moreover, the numerical results shows that deeper and wider ond is a modification of the example from (Nobile, Tem- networks are faster to train, meaning they can achieve a pone, and Webster 2008b) of a diffusion coefficient with smaller training loss error after a fixed number of epochs. one-dimensional (layered) spatial dependence (see (Adcock On the other hand, it is interesting to see that test errors Figure 4: Visualization of the DNN approximation with hy- Figure 3: Shows the performance of different DNN architec- perbolic tangent activation function, L = 5 hidden layers, tures for approximating the mixed parametric PDE problem, and N = 50 nodes per hidden layer. The colours show the using batch size of m/2, L hidden layers and N nodes per approximation to uh,j , and the white arrows show the the layer with L/N = 0.1. (Left) Comparison of training `2 loss approximation for σh,j (j = 1 left and j = 2 right). function on m = 200 (uniform random samples). (Right) Shows the testing error Bochner norm of the function uΦ,h,1 and the error norm of its gradient σΦ,h,1 , square and triangle markers respectively. those on which polynomials perform badly, e.g. nonsmooth functions. Finally, we also presented new results on DNN approximation of parametric PDEs with mixed boundary are not so different for the different architectures, with the conditions. These results highlight the potential of the DNN best two performing Leaky-ReLU and hyperbolic tangent approach, although further work is needed to tune the archi- networks giving similar results. Interestingly, even though tectures to get optimal performance – including better ap- separate DNNs are trained for each (due to the mixed for- proximation of the gradient – and to determine the method’s mulation), the test errors for u1 are substantially smaller (by efficacy, in particular, in comparison to current best-in-class several orders of magnitude) than those for its gradient ∇u1 . schemes based on polynomial approximation. In addition to this we also plot the best approximate solu- tions for u and σ produced by the DNN. Figure 4 shows the approximations achieved by a DNN trained on m = 200 samples to error 6.3699 × 10−6 in the loss after 2900 epochs of training, and evaluated at input parameter value y = [0, −0.707106]. It is noticeable that the approximation to ∇u1 is slightly better than the approximation to ∇u2 . In Figure 5, we also compare the DNN solution for u2 with the FE approximation. 5 Conclusion Deep neural networks offer many advantages for efficiently learning scalar- and Hilbert-valued functions and, in partic- ular, solutions to parametric PDE problems. Unlike virtu- ally every polynomial-based approach, selection of a basis or dictionary is not necessary when applying DNNs. They also offer the capability to solve challenging UQ problems which possess discontinuities leading to low parametric reg- ularity, problems for which methods such as sparse poly- nomial approximation via compressed sensing are poorly suited. In this work we presented a theoretical result which demonstrated the efficacy of the DNN approach for smooth function approximation. However, as observed in our first experiments and elaborated further in (Adcock and Dexter Figure 5: Comparison of the output of the DNN from 2020), such theoretical rates may not be met in practice via Fig 4 to approximate u2,h (y). Here the testing error is standard architecture designs and training. This suggests a 6.7321 · 10−3 , yellow/green colours the FE approximation, strong possibility to further modify DNN architecture and and blue/red colours the DNN approximation. We omit the training procedures to achieve superior performance across a visualization of u1 since its indistinguishable from the FE range of challenging high-dimensional problems, including approximation. References Dexter, N.; Tran, H.; and Webster, C. 2019. A mixed `1 reg- Adcock, B.; Brugiapaglia, S.; Dexter, N.; and Moraga, S. ularization approach for sparse simultaneous approximation 2020. Deep Neural Networks Are Effective At Learning of parameterized PDEs. ESAIM Math. Model. Numer. Anal. High-Dimensional Hilbert-Valued Functions From Limited 53: 2025–2045. Data. Preprint . Doostan, A.; and Owhadi, H. 2011. A non-adapted sparse approximation of PDEs with stochastic inputs. J. Comput. Adcock, B.; Brugiapaglia, S.; and Webster, C. G. 2017. Phys. 230(8): 3015–3034. Compressed sensing approaches for polynomial approxima- tion of high-dimensional functions. In Boche, H.; Caire, G.; Faber, F. A.; Hutchison, L.; Huang, B.; Gilmer, J.; Schoen- Calderbank, R.; März, M.; Kutyniok, G.; and Mathar, R., holz, S. S.; Dahl, G. E.; Vinyals, O.; Kearnes, S.; Riley, P. F.; eds., Compressed Sensing and its Applications: Second In- and von Lilienfeld, O. A. 2017. Prediction Errors of Molec- ternational MATHEON Conference 2015, Applied and Nu- ular Machine Learning Models Lower than Hybrid DFT Er- merical Harmonic Analysis, 93–124. Cham: Birkhäuser. ror. Journal of Chemical Theory and Computation 13(11): 5255–5264. Adcock, B.; and Dexter, N. 2020. The gap between theory and practice in function approximation with deep neural net- Gatica, G. 2014. A Simple Introduction to the Mixed Finite works. arXiv:2001.07523 . Element Method. Theory and Applications. SpringerBriefs in Mathematics. Springer, Cham. . Adcock, B.; and Hansen, A. C. 2021. Compressive Imag- ing: Structure, Sampling, Learning. Cambridge: Cambridge Geist, M.; Petersen, P.; Raslan, M.; Schneider, R.; and University Press (in press). Kutyniok, G. 2020. Numerical Solution of the Para- metric Diffusion Equation by Deep Neural Networks. Antun, V.; Renna, F.; Poon, C.; Adcock, B.; and Hansen, arXiv:2004.12131 . A. C. 2020. On instabilities of deep learning in image re- construction and the potential costs of AI. Proceedings of Gottschling, N. M.; Antun, V.; Adcock, B.; and Hansen, the National Academy of Sciences . A. C. 2020. The troublesome kernel: why deep learning for inverse problems is typically unstable. arXiv:2001.01258 . Babuška, I.; Tempone, R.; and Zouraris, G. E. 2004. Galerkin Finite Element Approximations of Stochastic El- Grohs, P.; Perekrestenko, D.; Elbrächter, D.; and Bölcskei, liptic Partial Differential Equations. SIAM J. Numer. Anal. H. 2019. Deep Neural Network Approximation Theory. 42(2): 800–825. arXiv:1901.02220 . Berg, J.; and Nyström, K. 2018. A unified deep artificial Hampton, J.; and Doostan, A. 2015. Compressive Sampling neural network approach to partial differential equations in of Polynomial Chaos Expansions: convergence Analysis and complex geometries. Neurocomputing 317: 28 – 41. Sampling Strategies. J. Comput. Phys. 280: 363–386. Khoo, Y.; Lu, J.; and Ying, L. 2020. Solving parametric PDE Chkifa, A.; Cohen, A.; Migliorati, G.; Nobile, F.; and Tem- problems with artificial neural networks. European J. Appl. pone, R. 2015. Discrete least squares polynomial approxi- Math. (in press) . mation with random evaluations - application to parametric and stochastic elliptic pdes. ESAIM Math. Model. Numer. Kingma, D. P.; and Ba, J. 2014. Adam: A method for Anal. 49(3): 815–837. stochastic optimization. arXiv:1412.6980 . Chkifa, A.; Cohen, A.; and Schwab, C. 2015. Breaking the Kutyniok, G.; Petersen, P.; Raslan, M.; and Schneider, R. curse of dimensionality in sparse polynomial approximation 2020. A Theoretical Analysis of Deep Neural Networks and of parametric PDEs. J. Math. Pures Appl. 103(2): 400–428. Parametric PDEs. arXiv:1904.00377 . Cohen, A.; DeVore, R.; and Schwab, C. 2011. Analytic Laakmann, F.; and Petersen, P. 2020. Efficient approxima- regularity and polynomial approximation of parametric and tion of solutions of parametric linear transport equations by stochastic elliptic PDE’s. Analysis and Applications 9(01): ReLU DNNs. arXiv:2001.11441 . 11–47. Nobile, F.; Tempone, R.; and Webster, C. G. 2008a. A Cohen, A.; and DeVore, R. A. 2015. Approximation of high- sparse grid stochastic collocation method for partial differ- dimensional parametric PDEs. Acta Numer. . ential equations with random input data. SIAM J. Numer. Anal. 46(5): 2309–2345. Cyr, E. C.; Gulian, M. A.; Patel, R. G.; Perego, M.; and Trask, N. A. 2020. Robust Training and Initialization of Nobile, F.; Tempone, R.; and Webster, C. G. 2008b. A Deep Neural Networks: An Adaptive Basis Viewpoint. In sparse grid stochastic collocation method for partial differ- Lu, J.; and Ward, R., eds., Proceedings of The First Mathe- ential equations with random input data. SIAM Journal on matical and Scientific Machine Learning Conference, vol- Numerical Analysis 46(5): 2309–2345. ume 107 of Proceedings of Machine Learning Research, Ongie, G.; Jalal, A.; Metzler, C. A.; Baraniuk, R. G.; Di- 512–536. Princeton University, Princeton, NJ, USA: PMLR. makis, A. G.; and Willett, R. 2020. Deep Learning Tech- Dal Santo, N.; Deparis, S.; and Pegolotti, L. 2020. Data niques for Inverse Problems in Imaging. driven approximation of parametrized PDEs by Reduced Opschoor, J. A. A.; Schwab, C.; and Zech, J. 2019. Ex- Basis and Neural Networks. J. Comput. Phys. 416: 109550. ponential ReLU DNN expression of holomorphic maps in high dimension. Technical Report 35, ETH: Zürich. doi: https://doi.org/10.1142/S0219530518500203. Petersen, P.; and Voigtlaender, F. 2018. Optimal approxima- tion of piecewise smooth functions using deep ReLU neural networks. Neural Netw. 108: 296–330. Smith, R. 2013. Uncertainty Quantification: Theory, Imple- mentation, and Applications. Computational Science and Engineering. SIAM. van den Berg, E.; and Friedlander, M. P. 2009. Probing the Pareto frontier for basis pursuit solutions. SIAM J. Sci. Com- put. 31(2): 890–912. van den Berg, E.; and Friedlander, M. P. 2019. SPGL1: A solver for large-scale sparse reconstruction. https:// friedlander.io/spgl1.