=Paper=
{{Paper
|id=Vol-2491/paper121
|storemode=property
|title=Scientific Machine Learning: Towards Predictive Closed-Form Models
|pdfUrl=https://ceur-ws.org/Vol-2491/paper121.pdf
|volume=Vol-2491
|authors=Dimitri Papadimitriou,Steven Latre
|dblpUrl=https://dblp.org/rec/conf/bnaic/PapadimitriouL19
}}
==Scientific Machine Learning: Towards Predictive Closed-Form Models==
Scientific Machine Learning: Towards Predictive Closed-Form Models Dimitri Papadimitriou and Steven Latre University of Antwerpen, Antwerpen, Belgium, dimitri.papadimitriou@uantwerpen.be Abstract. Captured from ecological or natural systems, high-dimensional scientific data span a large spectrum ranging from physical to bio-chemical processes. Though machine learning plays nowadays a role in automating various related data analysis and mining tasks such as pattern recogni- tion, feature extraction (detection, identification, etc.) and dimensional- ity reduction tasks, data alone are not sufficient to understand phenom- ena. If the aim is actually to understand underlying mechanisms and processes, then models are needed. The question becomes thus which models to build and how to build these models ? This paper addresses these questions and related challenges from the perspective of neural- based learning principles and techniques. Keywords: neural networks, domain uninformed problems, partial differential equations 1 Introduction Scientific disciplines mainly focus on exact mathematical models obtained by combining first principles (natural laws, interactions, etc.) with experimental data although approximate models (of lower complexity) could be obtained from measurement data that allow to tune complexity vs. accuracy. Measurement data being imprecise due to measurement errors, often noisy and possibly gen- erated by complex phenomena, are not exactly representable by a model in the considered hypothesis class. Thus, both statistical estimation and deterministic approximation aspects should be present in data modeling related tasks though often dominated by the former (as further detailed in Section 2) [17]. The involvement of machine learning in explanatory but also predictive mod- eling remains mainly driven by pre-existing physical model knowledge: hypothe- sis are drawn from function spaces/model classes derived from and delimited by knowledge acquired by external means. Indeed, application of machine learning to scientific disciplines, focuses so far primarily on forward problems (predict the response u of the system from model parameters m) and inverse problems (iden- tify/estimate the model parameters m from observations of its response u). Both problems are often formulated by assuming that the underlying phenomenon Φ is a dynamical (data generating) system characterized by sets of partial differential Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 Dimitri Papadimitriou and Steven Latre equations (PDEs), that are typically non-linear and non-homogeneous. In both cases the (non)linear differential operator is assumed to be known but the PDEs parameters (coefficients, source terms, etc.) have to be identified. However, these parameters are often hardly or even not at all accessible to direct measurements: they have to be fitted from indirect measurements. In turn, the involvement of machine learning focuses on solving inverse problems of parameter identification [25] that are mathematically challenging since often not well-posed in the sense of Hadamard [11] 1 . Hypothesis generation and model identification, i.e., automatically finding the explicit formulation of the (nonlinear) operator F mapping the model pa- rameters m to the system’s response/solution u = F(m), stays outside the main application scope of machine learning. This is the case when the dynamics of the underlying phenomenon Φ is assumed to be characterized by sets of PDEs; espe- cially when their explicit formulation is unknown, i.e., the available data doesn’t follow a given or known physical model. Indeed, identifying from high dimen- sional spatio-temporal data the mathematical expression of the nonlinear PDEs that governs the dynamics/evolution of the data is even less attainable with cur- rent machine learning methods and techniques. As we will show throughout this paper, operating in closed and adaptive loop to learn the physical laws, models and structures underlying the data remains to be accomplished. Hence, for sci- entific disciplines, machine learning is (still) not used to build a mathematical model of the underlying phenomenon but mainly as a pre-existing model fitting or statistical inference tool. On the other hand, machine learning and neural learning techniques in par- ticular don’t look so far at scientific disciplines with a specific lens. Compared to computer vision for instance, where neural learning has both adapted its models (e.g., convolutional neural networks, Cayley neural networks) and specialized training algorithms (e.g., pooling, iterative total variation methods), nothing similar can be recorded for scientific disciplines. To bring machine learning at the level of a mathematical modeling tool for scientific disciplines, referred to as Scientific Machine Learning (SML), a broader set of techniques ranging from mathematical model building until their validation and verification shall be in- volved. Few trials to fill this significant gap have been recently initiated, e.g., [7]. They often start by considering a broad set of statistical learning challenges and hope that their potential solving could find some traction or applicability. However, SML is not about overcoming the well-known fundamental limits of statistical learning or generic machine learning techniques but about understanding and solving scientific problems by involving specific/customized machine learning techniques when generic ones fail to address them or are simply unavailable (thus, requiring new techniques). The remainder of this paper is structured as follows. Section 2 positions and details the proposed modeling method together with the possible role and cur- 1 Note also that parameter identification problems are often nonlinear even if the PDE itself is linear SML 3 rently identified limits of statistical machine learning. Related and prior research work are documented in Section 3 and contrasted against already identified fun- damental tradeoffs. Section 4 documents the essential techniques involved in the scientific machine learning context. Finally, Section 5 draws some concluding remarks and perspectives for this nascent research domain. 2 Predictive Closed-Form Modeling For bringing machine learning as a mathematical modeling tool for scientific dis- ciplines, its primary goal shall move beyond solving inverse problems and focus on discovering hidden or even new physical models and laws underlying the dy- namics of, e.g., natural/ecological systems and its various bio-physical processes. The overall method consists of automatically transforming the high-dimensional spatio-temporal data obtained from observation into predictive mathematical models without assuming that the available data follows a given or known phys- ical law/model described by a system of partial differential equations. 2.1 Positioning As outlined in the Introduction, this problem –and related challenges– contrasts to the current application scope of statistical machine learning and related meth- ods in the context of physical and more generally natural sciences. This scope is still mainly focused on the solving of domain-informed/explicit inverse prob- lems by means of data modeling methods [5]. They rely on the assumption that the data (representing the response u of the system to independent input vari- ables x) are generated by a given stochastic model. These methods enable to formulate hypothesis and estimate the model that best explains the data (fitting pre-existing physics-based models to the data) but also perform various inference tasks (such as density estimation) from data samples (x, u). As data obtained from observations of dynamical systems involving unknown physical, bio-chemical or ecological mechanisms show increasing complexity, data models specification would also become more complex up to losing the advantage of providing a (relatively) simple explanation of the underlying natural phenom- ena (if ever explainable by parametric models). Hence, in addition to statis- tical/quantitative methods, solving such problems may also involve –but to a lesser extent– qualitative methods. Involving the latter enables to approximate high-dimensional process by a small number of free parameters by extracting local and global intrinsic features and structures as well as their relations that are invariant under various transformations or embeddings. Heavily influenced by data models due to their explanatory power (with the assumption that models with high explanatory power possess inherently high predictive power), physical/natural sciences have not widely exploited algorith- mic models so far. Algorithmic modeling methods include among others deep neural networks. They aim at finding a transformation function of the input x, i.e., an algorithm that operates on the input data to produce/predict the 4 Dimitri Papadimitriou and Steven Latre response u while treating the data mechanism as unknown. Though often lack- ing interpretability, algorithmic methods are a priori suitable for the automated learning of predictive models capable to forecast with high accuracy the evolu- tion of physical systems. Nevertheless, discovering the mathematical expression of predictive models even with algorithmic methods remains highly challenging. Especially, when the high-dimensional spatio-temporal data that is available doesn’t follow a pre- existing/known physical law or model. In this case, searching for an (approx- imate) model of the physical process Φ involves the generation of hypothe- sis from a function space that isn’t necessarily limited anymore by these pre- existing/known physical models. Hence, some constraints have to be considered to guide the search process. For instance, fixing the highest derivatives order (instead of considering it as a free parameter) would limit the dimensionality of the search space. Compared to classical physics- or more general domain-informed inverse problems, such class of problems can be referred to as domain-uninformed 2 . Although inverse problems span nowadays implicit variants when data cannot be formulated explicitly as a function of the unknown model parameters, prob- abilistic methods through Monte Carlo sampling of feasible solutions of implicit inverse problems or local search matheuristics yield numerical procedures. 2.2 PD(A)E Modeling To uncover/extract the physical laws from high dimensional spatio-temporal data obtained from potentially noisy observations, the learning task consists of identifying the mathematical expression of the nonlinear partial differential equations (PDE) that governs the dynamics/evolution of the observed data. Nonlinear dynamics implies that the system is not assumed a priori to satisfy the superposition principle. For this purpose, we assume throughout this paper that the underlying physical system is governed by coupled equations of the form: ( ut = F (x, y, u, v, ux , uy , vx , vy , uxx , uyy , vxx , vyy , . . .) (1) vt = G(x, y, u, v, ux , uy , vx , vy , uxx , uyy , vxx , vyy , . . .) In this system, the nonlinear functionals F and G depend on – the solutions u = u(x, y, t) and v = v(x, y, t) function of time t and multi- dimensional vectors x = (x1 , . . . , xm ) and y = (y1 , . . . , ym ) ∈ Ω ⊂ Rm – their first-order time derivatives ut = ∂u ∂t and vt defined similarly ∂u ∂u – their element-wise first-order partial derivatives ux = ∂x 1 , . . . , ∂xm , uy , vx , and vy defined similarly 2 This class of problems could also be referred to as model- or physics-free though this terminology is avoided for trivial reasons SML 5 2 2 2 – their element-wise second-order partial derivatives uxx = ∂∂xu2 , . . . , ∂x∂i ∂x u j ∂ u , . . . , ∂x 2 , 1 m uyy , vxx , and vyy defined similarly – etc. Note that the solutions u and/or v can also be multi-dimensional, i.e., u = (u1 , . . . , up ) and v = (v1 , . . . , vq ). In this case, ux denotes the component-wise ∂(u ,...,u ) elements of the Jacobian ∂(x11,...,xmp ) and uxx the components of the divergence of the Jacobian (a.k.a. vector Laplacian), and similarly for vx and vxx . This system of equations could be further generalized by involving, e.g., external force field, bifurcation parameters. As in general the system of PDEs is often incomplete, including a set of algebraic equations closes the system to form a partial differential algebraic equation (PDAE) set. Hence, throughout this paper, such models are referred to as closed-form. This system of equations is quite general and can be used for modeling many physical processes in various domains and environments including ecology, cli- matology, geophysics, hydrology, fluid dynamics, etc. but also neuro-physiology. Observe that the Fitzhugh-Nagumo neuron model (simplified version of the Hodgkin-Huxley model) is governed by a system of equations of this form. 2.3 Role of Machine Learning Common statistical machine learning techniques applied to finite data samples find limited applicability in the context of predictive modeling. The main reason being that physical processes often verify at least one if not all of the following properties that violate their key working assumptions: 1) non-linear, sometimes even the type of nonlinearity is unknown, i.e., their dy- namics cannot be modeled linearly due to, e.g., phase transitions, saturation effects, asymmetric cycles, chaotic behavior, etc.; although some (covariance- stationary) processes admit a Wold representation (sum of a deterministic component and an infinite sum of noise terms) that looks linear; 2) non-stationary their statistical properties are time-dependent including but not limited to, e.g., non-constant mean, variance as well as possible trends, periodicity or seasonality (sequential data over long time periods); 3) non-mixing, i.e., the statistical dependence between events does not reach 0 as time increases. Neural networks, thanks to their nonlinear function approximation capacity have the potential to address -at least partially- the former. When handling tem- poral data that show different variation at different levels, their transformation (e.g., Box-Cox, Fisher transform) can help stabilizing the variance. First- and second-order differences at lag 1 and seasonal differences can help stabilizing the mean; thus, reducing trend and seasonality. These operations are commonly applied in the analysis and modeling of time series, particular case of difference equations. In comparison, fitting continuous time models (i.e., time-dependent PDEs) to the data observed from non-stationary non-mixing processes remains 6 Dimitri Papadimitriou and Steven Latre challenging, especially when distributional assumptions can’t be formulated. The predictive capacity of the model renders this task even more difficult. For in- stance, with neural networks the target model would be approximated globally since training data (observations) are generalized during the learning phase. Instead, one might consider a learning method that produces different local ap- proximations in the target model and generalization beyond observation data delayed until a prediction request is made. On the other hand, empirical data samples do not necessarily verify the i.i.d. assumption at the root of statistical learning techniques, in particular, when training/calibration dataset and testing/running dataset do not follow the same distribution. Related measurement (variables) are affected by random errors that violate key assumptions of OLS fitting since often non-Gaussian and heteroskedastic (their variance is not constant instead of being unrelated to any of the explanatory/independent variables). Last but not least, accuracy (closeness of agreement between measured value and true value -when known) and precision level at which predictions have to be realized play an essential role in the selection/investigation of the appropriate (class of) technique(s). Consequently, although the availability of often large sets of labeled data pairs {(x, t; m), u} would imply the straightforward execution of existing model selection and identification techniques to find F such that u = F(x, t; m), their criteria of applicability may not be verified for the classes of problems under consideration. Although neural networks can be trained to obtain the numeric approximation of the relationship between the independent variables x and the various parameters m of the underlying process, determining the mathematical expression that would translate such relationships remains to be addressed. 2.4 Example In this section, we present an example of how physics-based and measurement- based modeling would model fluxes of greenhouse gas (GHG) that play an es- sential role in climate change [13]. In atmospheric transport, air motion can be modeled as an horizontal flow of numerous rotating swirls (called Eddies) of various sizes. Each Eddy has 3-D components including vertical air move- ment/vertical wind. The modeling of molecular and turbulent mass transport starts from the advection-diffusion equation describing how the concentration c = c(x, y, z, t) varies with time: ∂c ∂c ∂c ∂c ∂c ∂c ∂c ∂c ∂c ∂c +u +v +w = (Dx ) + (Dy ) + (Dz ) (2) ∂t ∂x ∂y ∂z ∂x ∂x ∂y ∂y ∂z ∂z Following the Reynolds decomposition (averaging rule), the instantaneous velocity ui along axis i can be decomposed as ui = hui i + u0i (a) where hui i is the time averaged velocity and u0i the instantaneous fluctuation component (de- viation from mean value) and the instantaneous concentration as c = hci + c0 (b) where hci is time averaged concentration and c0 the instantaneous fluctuation component (deviation from mean value), with hc0 i = 0. SML 7 ii Substituting (a) and (b) in Eq.2 with incompressible flow assumption ∂hu ∂xi = 0, homogeneous diffusion coefficients, yields the time averaged equation: ∂hci ∂hci ∂ 2 hci ∂hu0i c0 i ∂ ∂hci 0 0 + ui =D − = D − hu i c i (3) ∂t ∂xi ∂x2i ∂xi ∂xi ∂xi The second term in the RHS of Eq.3 corresponds to the turbulent flux aris- ing from the correlation between u0i and c0 : hu0i c0 i = hui ci–hui ihci. Since the coefficient D is usually a very small quantity, the molecular transport term can be neglected compared to the turbulent transport. Hence, we obtain: ∂hci ∂hci ∂hci ∂hci ∂hu0 c0 i ∂hv 0 c0 i ∂hw0 c0 i + hui + hvi + hwi =− − − (4) ∂t ∂x ∂y ∂z ∂x ∂y ∂z This PDE cannot be solved exactly because the time averaged equation com- prises 3 new unknown quantities (leading to more unknowns than number of equations). Three methods can be considered to overcome this problem: – Physics model method: approximates the turbulent fluxes hu0 c0 i, hv 0 c0 i and hw0 c0 i by their mean concentration gradient (following the Ficks law): ∂hci 0 0 ∂hci ∂hci hu0 c0 i = −x , hv c i = −y , hw0 c0 i = −z (5) ∂x ∂y ∂z The coefficients x , y and z (assumed D) are determined by, e.g., numer- ical simulation and the modeling equation retrieves the form of a canonical advection-diffusion equation (with c denoting hci and ui denoting hui i): ∂c ∂c ∂c ∂c ∂ ∂c ∂ ∂c ∂ ∂c +u +v +w = x + y + z (6) ∂t ∂x ∂y ∂z ∂x ∂x ∂y ∂y ∂z ∂z – Measurement-based method: estimates the turbulent fluxes from mea- surement. From the observation of the measurement variables w0 and c0 com- pute the vertical flux Fz along the z−axis as the mean covariance product hw0 c0 i between the deviation (from time averaged value) of the instantaneous vertical wind speed w0 and the deviation (from time averaged value) of the instantaneous concentration c0 . Thus, model unknowns appear as (composi- tion of) measurement variables. – Machine learning-based method: approximates the solution by means of a multi-layer neural network (trained with a pre-defined range of coef- ficient values obtained, e.g., from simulation). Then, use the (numeric) so- lution ĉ as input for the solving of an inverse problem to determine the diffusivity coefficients , i.e., ĉ = F(). Since the operator is nonlinear, solv- ing the inverse problem relies on an iterative method based, e.g., on the Levenberg-Marquardt algorithm. This combined forward-inverse loop may also be repeated iteratively or tuned until obtaining a solution satisfying quality criteria. This method extends to the case where the PDE model it- self is unknown and only its numerical approximation by the neural network is of interest. However, extracting the mathematical expression of the PDE model itself remains to be tackled as explained in Sections 3 and 4. 8 Dimitri Papadimitriou and Steven Latre 3 Related Work: Main Approaches and Tradeoffs The first method proposed by [23] to uncover the mathematical expression of the PDE model consists of constructing a dictionary containing a large collection of candidate atomic terms (partial derivatives) that are likely to appear in the unknown functions F and G. Then, by taking advantage of sparsity-promoting techniques such as sparse nonlinear regression, one can determine the coeffi- cients of the terms appearing in the expansion of the unknown functions F and G. Thus, determine the most informative/significantly contributing terms in the RHS of the partial differential equations that most accurately represent the time dynamics of the data. This method leverages the fact that most physical systems have only a few relevant terms that define the dynamics, making the governing equations sparse in a high-dimensional nonlinear function space. However, dictio- nary learning is typically limited to small-scale representation problems whereas the required number of terms to include a priori (i.e., at training time) in the library increases exponentially with the dimensionality of the input data, the output/solution as well as the derivatives order. Observe also that in compar- ison symbolic differentiation suffers from expression swell as it can easily pro- duce exponentially large symbolic expressions which take correspondingly long to evaluate. Hence, the main challenge becomes to avoid the computational lim- its of numerical differentiation while preserving interpretability of the learned dynamics (that is unknown beforehand) through its mathematical expression. Changing the dictionary such that the system of equations is expressed in a sin- gle multi-vector equation using geometric algebra could be a viable alternative at the condition that the resulting model remains interpretable. The main alternative exploits the nonlinear function approximation prop- erty of (feedfoward) multi-layer neural networks. The straightforward method as proposed in [22] would proceed by representing the unknown solutions u and v as well as the nonlinear functions F and G each by a multi-layer neural network. However, exploiting the expressivity of neural networks comes at the cost of completely losing the interpretability of the learned functions F and G defined by Eq.1, since their functional form remains unrevealed. Thus, this ap- proach keeps the physical laws underlying the dynamics of the data generating system/physical process hidden. Hence, such method introduces a fundamental tradeoff between expressivity and interpretability in addition to the usual trade- off between expressivity and trainability of neural networks (i.e., learnability of their parameters). Obviously, one could fall back onto physics-informed meth- ods to inverse problems but then one should relax model assumptions, otherwise uncovering new physical laws from observation data would remain unaddressed. Next to direct methods, an indirect method has been recently proposed in [18] that develops a lifting technique for nonlinear system identification/parameter estimation based on the framework of the Koopman operator theory [15]. In general, indirect methods solve an initial value problem and do not require the estimation of time derivatives [4]; they provide an alternative to direct methods at the expense of solving a (nonconvex) nonlinear least squares problem. Instead of identifying the nonlinear system in the state space, the key idea developed is SML 9 to identify the linear infinite-dimensional (linear) Koopman operator in the lifted space of observables, a process which results in a linear method for nonlinear systems identification. Hence, this indirect method not only circumvents the estimation of time derivatives but also relies on linear least squares optimization. It provides a parametric identification technique that can accurately reconstruct linear/polynomial vector field of a broad class of systems (including unstable, chaotic, and system with inputs). Its dual provides estimates of the vector field at the data points and is well-suited to identify high-dimensional systems with small datasets. Extension of this lifting method for identifying nonlinear PDEs though potentially attractive for low-sampling rate datasets remain centered on identification / nonlinear parameter estimation; hence, subject to the same limits experienced by direct parametric methods using library functions in addition to its inability in identifying general vector fields. Consequently, a workable and provable technique to uncover the closed-form expression of predictive PD(A)E-based models remains clearly beyond reach of existing machine learning methods including neural networks. 4 Neural Network-based PDE Learning New insights are required to bring machine learning at the level of a data- driven mathematical modeling tool capable to automatically discover the closed- form expression of the PD(A)Es that govern the dynamics of various natu- ral/ecological systems (as observed from the data). For this purpose, this section focuses on how the main learning task could be realized by means of advances in neural networks for the reason explained in Section 2.3. 4.1 Neural Networks as PDE Solvers Though approximating solution of nonlinear PDEs by (deep-)neural networks is not the final objective when learning the closed-form expression of PD(A)E mod- els, such class of solvers constitute an essential part of the proposed modeling method. The theoretical foundation for approximating a function and higher- order derivatives with a neural network relies on a variant of the universal ap- proximation theorem by Hornik [12]. Theorem 4 in [12] establishes that if the activation function is k−times continuously differentiable, bounded and non- constant, then any k−times continuously differentiable function and its deriva- tives up to order k can be uniformly approximated by two-hidden layer neural networks on compact subsets. Even if rather restrictive, most recent papers show the existence of neural networks approximating solutions of the PDEs by (some- times implicitly) referring to this Theorem. In [24], authors propose a learning algorithm, referred to as Deep Galerkin Method (DGM), for approximating solutions of high-dimensional quasi-linear parabolic PDE. The DGM algorithm is similar in spirit to Galerkin methods, with the solution approximated by a neural network instead of a linear combi- nation of basis functions. To satisfy the differential operator, initial conditions, 10 Dimitri Papadimitriou and Steven Latre and boundary conditions, the training algorithm performs, instead of forming a mesh, on sequences of randomly sampled space and time points. This technique yields a meshfree method, which is essential for high-dimensional PDEs. Despite these promising results and their potential extension to hyperbolic, elliptic, and partial-integral differential equations, several open research ques- tions remain to be addressed: – PDEs with highly non-monotonic or oscillatory solutions may be more chal- lenging to solve and further developments in terms of neural network archi- tecture are necessary. Indeed, neural model selection, including its depth, layer width and activation function requires to account for nonlinear func- tions of the input that rapidly change in certain time and space regions. Moreover, only under certain growth and smoothness assumptions on the nonlinear terms, the convergence proof follows by the smoothness of the neural networks together with compactness arguments. – Solving the corresponding initial and boundary value problem (iBVP) re- lies on the penalty method which converts the constrained error minimiza- tion to an unconstrained problem using a first-order iterative method. The main drawback being that this method yields ill-conditioned problems when naively selecting penalty coefficients. The training algorithm appears thus as a main bottleneck for the solving of nonlinear PDEs. Instead, one should consider iteratively solving this error minimization problem using the aug- mented Lagrangian method. The latter adds a term corresponding to an estimate of the Lagrange multiplier in the objective of the unconstrained problem. Although, this method avoids ill-conditioning as the accuracy of the Lagrange multiplier estimation improves at every step, it not yet es- tablished if this algorithm will scale to more three-dimensional problems (domain Ω ∈ R3 ). – Neural network trained following this algorithm converge in Lp (p ≤ 2) to the solution of the PDE as the number of hidden neuron units tends to infinity; hence, deep neural networks do not circumvent/avoid the curse of dimensionality problem though they can avoid forming a mesh. Moreover, upper bounds for the number of weights/neuron units and layers increase exponentially with the dimension d of the input space. Such bounds have been obtained when approximating, e.g., Sobolev-regular functions with re- spect to (fractional) Sobolev norms and Besov-regular functions, by deep neural networks; even if for these classes of functions deep neural models can achieve an optimal rate of approximation error. On the positive side, the solving of nonlinear PDEs using neural networks with rectified linear units (ReLU) has been recently proven to achieve for both univariate and multivariate real-valued functions3 , the approximation error with respect to Sobolev norms (compared to the usual Lp norms, p ∈ [1, ∞]) and the approximation rate bounds fidelity achievable with hp-FEM. The latter is a 3 Note: at the time of writing only the demonstration for the univariate case has been published in [21] SML 11 general version of the finite element method (FEM) developed by Babuska in the early 90’s [2]. In this seminal paper, Babuska et al. discovered that FEM converges exponentially fast when the mesh is refined using a suitable combi- nation of h-refinements (dividing elements into smaller ones) and p-refinements (increasing their polynomial degree). Although, from the approximation theo- retical perspective, deep neural networks perform as well as the best available FEM method, the (uninformed) modeling/identification of nonlinear PDEs by means of neural networks still remains highly challenging. 4.2 Polynomial Neural Networks For recognition/identification tasks, neural networks identify patterns accord- ing to their relationship, responding to related patterns with a similar out- put by transforming the weighted sum of inputs to describe overall similar- ity relationships of input patterns. Their main shortcoming in pattern recogni- tion/identification is the disability of input pattern generalization: in case of rota- tion or translation of the input matrix of variables, the recognition/identification will fail. Polynomial Neural Network (PNN) [16] [27] for identification of de- pendence of variables describe a functional dependence of input variables –not entire patterns. A neural network can thus be seen as a simplified form of PNN [27] for which combinations of variables are missing. Polynomial functions ap- ply these combinations to preserve partial dependencies of variables compared to conventional neural networks which can’t utilize data relations for recogni- tion/identification tasks. Further, sum fraction terms can approximate multi-variable functions on the basis of discrete observations enables replacing a general PDE definition with polynomial elementary data relation descriptions. Differential polynomial neural networks (D-PNN) introduced by Zvajka in 2011 [28] form a class of neural networks, which construct and solve an unknown general PDE of a function of interest with selected substitution relative terms using non-linear multi-variable composite polynomials. The layers of the network generate simple and composite relative substitution terms whose convergent series combinations can describe partial dependent derivative changes of the input variables. The starting point of the D-PNN development relies on the Group Method of Data Handling (GMDH) developed by Ivakhnenko [14]. – Group Method of Data Handling (GMDH): iterative method intro- duced by [14] which decomposes the complexity of a system or process into many simpler relationships each described by a processing function of a sin- gle neuron. This method identifies general non-linear relations between in- put and output variables by means of support functions. Most GMDH algo- rithms use polynomial support functions expressed in the form of a functional Volterra series whose discrete analogue is known as the Kolmogorov-Gabor (K-G) polynomial 12 Dimitri Papadimitriou and Steven Latre n X n X X n n X X n X n y = a0 + ai xi + aij xi xj + aijk xi xj xk + . . . (7) i=1 i=1 j=1 i=1 j=1 k=1 where (x1 , x2 , . . . , xn ) denotes the vector of input variables and (a1 , a2 , . . . , an ) the vector of coefficients. Each relationship is described by a low order two- variable polynomial processing function of a single neuron node. y = a0 + a1 xi + a2 xj + a3 xi xj + a4 x2i + a5 x2j (8) – Polynomial Neural Network (PNN): enable the identification of vari- ables dependence by describing a functional dependence of input variables (not entire patterns as DNN does). This could be regarded as a pattern ab- straction in which identification is not based on values of variables but only on their relations. Multi-layer neural networks (including deep neural net- works) can thus regarded as a simplified form of PNN where combinations of variables are missing and thus can’t utilize data relations for recogni- tion/identification tasks. Instead, polynomial functions apply those to pre- serve partial dependencies of variables. The GMDH algorithm builds up a polynomial (actually a multinomial) combination of the input components by forming in successive steps a PNN: adding one layer at a time, calculating its polynomial coefficients and selecting the best-output polynomial neurons, entered the next layer, so far as possible improvements in the last added out- put layer. The PNN structure is developed through learning: the number of layers of the PNN is not fixed in advance but becomes dynamically meaning as this self-organizing network grows over the trained period [19]. – Differential PNN (D-PNN) uses a complete multi-layer PNN structure to form (decompose) and solve an unknown general PDE of a searched func- tion u = f (x1 , . . . , xn ) with selected combination of substitution terms us- ing non-linear/factional multi-variable composite polynomials. The layers of the network generate composite substitution terms whose convergent series combinations can describe partial dependent derivative changes of the input variables. The substitution relative derivative terms are selected and com- bined according to the similarity dimensional analysis, which forms system characteristics from dimensionless ratio groups of variables [20]. Although, D-PNN extends the complete GMDH network structure to pro- duce convergent sum series of relative derivative terms, which can define and substitute for an unknown general PDE of a searched multi-variable func- tion model, the D-PNN’s operating and design principles are different from those of GMDH. D-PNN decomposes the general PDE analogously to GMDH performs the general input–output connection polynomial (7). It applies the complete PNN skeleton to generate convergent sum series of selected sub- stitution derivative terms using the principles of similarity analysis to solve the general PDE. Similarity analysis facilitates substitutions for differential equations or can form dimensional units from data samples to describe real- world problems. K-G polynomials (8) are suitable for modeling polynomials SML 13 with PNN. Instead, for a polynomial approximation of complicated multi- variable periodic functions or time-series functions, the PNN approximation ability has to be improved by modifying some GMDH polynomial (7) items with the use of the sigmoidal function. The sigmoidal function may trans- form some polynomial items together with the parameters to improve the polynomial derivative term series ability to approximate complicated peri- odic functions. The simple reason stems because low order polynomials are not able to fully make up for the complete cycles. Though at first glance attractive, the main drawback of D-PNN in model- ing complexity is their direct proportionality to the increasing number of input variables, as further hidden layers are required to define all the possible com- bination PDE terms. Compared to conventional neural networks, Though such structures can form, more complex and versatile model types of systems de- scribed by a large number of data variables, the D-PNN complexity becomes overwhelming due to combinatorial nature of the compositions. Hence, progress- ing such techniques involves the progressive/iterative construction of the search space, thus the dynamic construction of the neural network itself. 4.3 General-purpose Automatic Differentiation (GAD) Classical algorithmic data modeling methods, such as neural networks, consider F and G as black box functions. However, one could also model them as dif- ferentiable directed graphs that are dynamically composed by functional blocks and rely on general-purpose Automatic Differentiation (AD) [3]. AD enables to compute derivatives through accumulation of numerical values during code ex- ecution to generate numerical derivative evaluations as opposed to derivative expressions. This class of techniques performs these operations by using sym- bolic rules of differentiation but keeps track of derivative values as opposed to the resulting expressions. By extending the arithmetic of AD to higher or- der derivatives of multivariate functions, all numerical computations appear as compositions of a finite set of elementary operations for which derivatives are known (instead of involving numerical differentiation). Next, by combining the derivatives of the constituent operations through the higher order chain rule one obtains the derivative of the overall composition. Observe that fixing the highest derivatives order (instead of considering it as a free parameter) limits the dimen- sionality of the search space. It remains an open question whether imposing such constraint could improve the computation time vs. prediction accuracy tradeoff. A second neural network can then be associated that i) controls the dynamic composition of the differentiable graph and ii) is trained to predict the numerical values corresponding to the functions F and G defined by (1). For this purpose: – Multi-dimensional neural networks also referred to as Clifford Neural Net- works (CNN) [1] [6] can be considered for training each (group of) vertex to identify the operation each of them performs on input features/activations. This class of networks can model multi-level interactions: within data points 14 Dimitri Papadimitriou and Steven Latre (between vector components) and between data points (between vectors) while interactions themselves take a vectorial form (instead of a scalar as in real-valued neural networks). – Combining both group- and exclusivity- sparsity methods shall be enforced during the training phase to exploit positive and negative correlations be- tween features. – Techniques to interrogate the model to determine why and what has been learned shall then be elaborated in order to i) enable interpreting the learned model and its associated features and ii) translate this information into com- position primitives. Further, the predictive power of the learned model shall provide capability beyond single-step prediction, i.e., given the value of a variable/quantity at time t, predict its value at time t + 1, but instead enable for multi-step predictions over longer time horizon t + k, k 1. Involving multiple steps allows in turn to incorporate memory effects in learning the temporal dynamics and cover prob- lems with a non-Markovian dynamical structure. As available training data may not be invariant to time shifts, this non-stationarity property would be better captured by the modeling capacity of recurrent neural networks (RNN). – In the discrete time domain, we could consider the training of multi-directional recurrent neural networks (MD-RNN) [10] for this purpose since training data are often characterized by more than one spatio-temporal dimension. Instead of pre-processing the data to one dimension and involving a single recurrent connection, MD-RNNs uses as many recurrent connections as there are dimensions in the data. – In the continuous time domain, their combination with continuous time RNN (CT-RNN) [9] would enable handling training over continuous timescales while increasing the computational efficiency compared to standard discrete- time RNNs. The latter induce dependence of the resulting models on the sampling period used in the learning process when no information is given about the model trajectories between the sampling instants. 5 Discussion Extracting physical laws and models from data involves to a broad set of learn- ing tasks with increasing level of complexity. When the domain is known, the task can be seen as a PDE model fitting combining a neural network that ap- proximates the PDE solution and another that checks the concordance between the approximated solution and the fitted/trialed model. The latter differentiates the numerical solution following the known analytic form of the PDE. One ap- proximates a physical model from a relatively well delimited hypothesis space and in the simplest case only parameters needs to be identified. Unfortunately, this brings us back to the main observation that neural-based learning is (still) not an essential modeling tool but mainly a pre-existing model fitting tool. SML 15 When the domain is unknown, discovering the closed-form expression of the PD(A)E model even with algorithmic methods remains highly challenging. This paper by detailing the two main machine learning-based approaches (dictionary- based learning and a particular variant of the previous model) shows that cur- rent neural network training algorithms but also neural network model selection are not (yet) at the level required to perform such tasks. Of course, we could still decompose this case into situations where the physical model can’t be ap- propriately identified (hidden) but is known and the case where the physical process itself has never been identified. However, the main problem remains: the hypothesis/pattern search and matching process implemented in the neural learning paradigm is still too limited to cope with the exploration a large spaces in particular when multi-scale interactions among input variables are involved. References 1. P. Arena, L. Fortuna, G. Muscato and M.G. Xibilia, Neural networks in multidi- mensional domains, Lecture Notes in Control and Information Sciences (LNCIS), vol.234, Springer-Verlag, 1998. 2. I. Babuska, and B.Q. Guo, The h, p and h-p version of the finite element method: basis theory and applications, Advances in Eng. Software, 15(3-4):159-174, 1992. 3. A.G. Baydin, B.A. Pearlmutter, A.A. Radul, and J.M. Siskind, Automatic differ- entiation in machine learning: a survey, JMLR, 18:1-43, 2018. 4. H.G. Bock, Recent advances in parameter identification techniques for ODE, Numerical treatment of inverse problems in differential and integral equations, Springer, 1983, pp.95-121. 5. L. Breiman, Statistical Modeling: The Two Cultures, Statistical Science, 16(3):199- 231, 2001. 6. S. Buchholz, G. Sommer, On Clifford neurons and Clifford multi-layer perceptrons, Neural Networks, 21:925-935, 2008. 7. N. Baker et al., Workshop Report on Basic Research Needs for Scientific Ma- chine Learning: Core Technologies for Artificial Intelligence, Feb. 2019. Available at https://www.osti.gov/biblio/ 8. H. Edelsbrunner, D. Letscher, and A.J. Zomorodian, Topological persistence and simplification, Discrete & Computational Geometry, 28(4):511-533, 2002. 9. K.L. Funahashi, and Y. Nakamura, Approximation of Dynamical Systems by Con- tinuous time Recurrent Neural Networks, Neural Networks, 6:183-192, 1993. 10. A. Graves, S. Fernandez, J. Schmidhuber, Multi-Dimensional Recurrent Neural Networks, Proceedings of ICANN 2007, Part I, LNCS 4668, pp.549-558, Springer Berlin Heidelberg, 2007. 11. J. Hadamard, Sur les problèmes aux dérivées partielles et leur signification physique, Princeton University Bulletin, pp.49-52, 1902. 12. K. Hornik, Approximation capabilities of multilayer feedforward networks, Neural Networks, 4(2):251-257, 1991. 13. Climate Change 2013: The Physical Science Basis. Contribution of Working Group I to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change. Cambridge University Press, Cambridge, United Kingdom and New York, NY, USA. 14. A. Ivakhnenko, Polynomial theory of complex systems, IEEE Transactions on Sys- tems, Man and Cybernetics, 1(4):364-378, 1971. 16 Dimitri Papadimitriou and Steven Latre 15. B.O. Koopman, Hamiltonian systems and transformation in Hilbert space, Pro- ceedings of National Academy of Sciences (PNAS) of the United States of America, 17(5):315-318, 1931. 16. D.Marcek and M. Marcek, Neural Networks and their Applications, EDIS Publi- cations, Slovakia, 2006. 17. I Markovsky, Low-Rank Approximation, Algorithms, Implementation, Applica- tions, 2018, Springer. 18. A. Mauroy, and J. Goncalves, Koopman-based lifting techniques for nonlinear sys- tem identification, https://arxiv.org/pdf/1709.02003.pdf. 19. S.K. Oh, W. Pedrycz and B.J Park, Polynomial neural networks architecture: Anal- ysis and design, Computational Electrical Engineering, 29:703-725, 2003. 20. J.F. Price, Polynomial differential equations compute all real computable functions on computable compact intervals, American Journal of Physics, 71:437-447, 2003. 21. J.A.A. Opschoor, P.C. Petersen, and C. Schwab, Deep ReLU Networks and High- Order Finite Element Methods, Research Report No.2019-17, Seminar for Applied Mathematics (SAM), ETH Zurich, January 2019 22. M. Raissy, Deep Hidden Physics Models: Deep Learning of Nonlinear Partial Dif- ferential Equations, Journal of Machine Learning Research, vol.19, pp.1-24, 2018. 23. S.H. Rudy, S.L. Brunton, J.L. Proctor, and J.N. Kutz, Data-driven discovery of partial differential equations, Science Advances, 3(4), 2017. 24. J. Sirignano and K. Spiliopoulos. DGM: A deep learning algorithm for solving partial differential equations, Journal of Comp. Physics, 375:1339–1364, 2018. 25. V. Vemuri, Inverse Problems, in (Eds. and G. A. Bekey and B. Y. Kogan), Modeling and Simulation: Theory and Practice, A Memorial Volume for prof. Walter J. Karplus, pages 89-101, Kluwer Academic Publishers, Boston, 2002 26. A. Zomorodian, and G. Carlsson, Computing persistent homology, Discrete & Com- putational Geometry, 33(2):249-274, 2005. 27. L. Zjavka, Polynomial neural network, Proceedings of 7th European Conference Information and Communication Technologies, pp.277-280, June 2007. 28. L. Zjavka, Differential Polynomial Neural Network, Journal of Artificial Intelli- gence, 4:89-99, 2011.