Introduction

Automatic differentiation in machine learning: a survey. The Journal of Machine Learning Research

10.2172/1478744

Self-Adaptive Physics-Informed Neural Networks using a Soft Attention Mechanism

Levi McClenny

Ulisses Braga-Neto

ulissesg@tamu.edu 0 0 Department of Electrical and Computer Engineering Texas A&M Univeristy College Station , TX 77845 , USA

2019

18 1 4967 4975

Physics-Informed Neural Networks (PINNs) have emerged recently as a promising application of deep neural networks to the numerical solution of nonlinear partial differential equations (PDEs). However, the solution of more stiff or semi-linear PDEs can contain regions where the gradient and solution changes rapidly, creating difficulties in training the solution network. It has been recognized that adaptive procedures are needed to force the neural network to fit accurately these “stubborn” spots in the solution. To accomplish that, previous approaches have used fixed weights in the loss function hard-coded over regions of the solution deemed to be important. In this paper, we propose a new method to train PINNs adaptively, using fully-trainable weights that force the neural network to focus on regions of the solution are difficult, in a way that is reminiscent of soft multiplicative attention masks used in Computer Vision. The key idea in Self-Adaptive PINNs is to make the weights increase as the corresponding losses increase, which is accomplished by training the network to simultaneously minimize the losses and maximize the weights, as in augmented Lagrangian and constraint-satisfaction methods in classical nonlinear optimization. We present numerical experiments with the Allen-Cahn PDE in which the Self-Adaptive PINN outperformed other state-of-the-art PINN algorithms in L2 error, while using a smaller number of training epochs.

Introduction

As part of the burgeoning field of scientific machine learning (Baker et al. 2019), physics-informed neural networks (PINNs) have emerged recently as an alternative to traditional partial different equation (PDE) solvers (Raissi, Perdikaris, and Karniadakis 2019; Raissi 2018; Wight and Zhao 2020; Wang, Yu, and Perdikaris 2020). Typical blackbox deep learning methodologies do not take into account physical understanding of the problem domain. The PINN approach is based on constraining the output of a deep neural network to satisfy a physical model specified by a PDE.

A great advantage of PINNs over traditional timestepping PDE solvers is that the entire spatial-temporal domain can be solved at once using collocation points distributed irregularly (rather than on a grid) across the spatialtemporal domain, in a process that can be massively parallelized via GPU. As we have continued to see GPU capabilities increase in recent years, a method that relies on parallelism in training iterations could begin to emerge as the predominant approach in scientific computing.

The original continuous PINN in (Raissi, Perdikaris, and Karniadakis 2019), henceforth referred to as the “baseline PINN,” is effective at estimating solutions that are reasonably smooth, such as Burger’s equation, the wave equation, Poisson’s equation, and Schrodinger’s equation. On the other hand, it has been observed that the baseline PINN has convergence and accuracy problems when solving more stiff semi-linear PDEs, with solutions that contain sharp and intricate space and time transitions (Wight and Zhao 2020; Wang, Teng, and Perdikaris 2020). This is the case, for example, of the Allen-Cahn and Cahn-Hilliard equations of phase-field models (Moelans, Blanpain, and Wollants 2008).

To address this issue, various modifications of the baseline PINN algorithm have been proposed. For example, in (Wight and Zhao 2020), a series of schemes are introduced, including nonadaptive weighting of the training loss function, adaptive resampling of the collocation points, and timeadaptive approaches, while in (Wang, Teng, and Perdikaris 2020), a learning rate annealing scheme was proposed. The consensus has been that adaptation mechanisms are essential to make PINNs more stable and able to approximate well difficult regions of the solution.

This paper introduces Self-Adaptive PINNs, a simple solution to the adaptation problem for solving partial difference equations (PDEs), which uses trainable weights as a soft multiplicative mask reminiscent of the attention mechanism used in computer vision (Wang et al. 2017; Pang et al. 2019). The weights are trained concurrently with the approximation network. As a result, initial, boundary or collocation points in difficult regions of the solution are automatically weighted heavier in the loss function, forcing the approximation to improve on those points. Experimental results show that Self-Adaptive PINNs can solve the traditionally “stiffer” Allen Cahn PDE accurately. The Self-Adaptive PINN displayed more accurate results than other state-ofthe-art PINN adaptive training algorithms, while using a smaller number of training epochs.

Background Overview of Physics-Informed Neural Networks

Consider a general nonlinear PDE of the form: ut + Nx[u] = 0 ; x 2 u(x; 0) = h(x) ; x 2

; where x 2 is a spatial vector variable in a domain

Rd, t is time, and Nx is a spatial differential operator. Following (Raissi, Perdikaris, and Karniadakis 2019), let u(x; t) be approximated by the output u (x; t) of a deep neural network with inputs x and t. Define the residual as: r (x; t) :=

u (x; t) + Nx[u (x; t)] ; @t where all partial derivatives can be computed by automatic differentiation methods (Baydin et al. 2017; Paszke et al. 2017). The parameters are trained by backpropagation (Chauvin and Rumelhart 1995) on a loss function that penalizes the output for not satisfying (1)-(3):

L( ) = Lr( ) + Lb( ) + L0( ) ; where Lr is the loss corresponding to the residual (4), Lb is the loss due to the boundary conditions (2), and L0 is the loss due to the initial conditions (3):

Lr( ) = Lb( ) = L0( ) = 1 Nr

X r(xir; tir)2; Nr i=1 Nb i=1 1 XNb ju(xib; tib) N0 i=1 1 XN0 ju(xi0; 0) gbij2; hi0j2; where fxi0; hi0 = h(xi0)giN=01 are the data at time t = 0, fxib; tib; gbi = g(xib; tib))giN=b1 are the data at the boundary, fxir; tirgiN=r1 are collocation points randomly distributed in the domain , and N0; Nb and Nr denote the total number of initial data, boundary data, and collocation points, respectively. The parameters can be tuned by minimizing the total training loss L( ) via standard gradient descent procedures used in deep learning.

Related Work

The baseline PINN algorithm can be unstable during training and produce inaccurate approximations around sharp space and time transitions in the solution of semi-linear PDEs. Much of the recent literature on PINNs has been devoted to mitigating these issues by introducing modifications to the baseline PINN algorithm that can increase training stability and accuracy of the approximation, mostly via attempting to mitigate spectral bias inherent to neural network approximations. We mention some of these approaches below. (1) (2) (3) (4) (5) (6) (7) (8)

Nonadaptive Weighting. In (Wight and Zhao 2020), it was pointed out that a premium should be put on forcing the neural network to satisfy the initial conditions closely, especially for PDEs describing time-irreversible processes, where the solution has to be approximated well early. Accordingly, a loss function of the form L( ) = Lr( ) + Lb( ) + C L0( ) was suggested, where C 1 is a hyperparameter.

Learning Rate Annealing. In (Wang, Teng, and

Perdikaris 2020), it is argued that the optimal value of the weight C in the previous scheme may vary wildly among different PDEs so that choosing its value would be difficult. Instead they propose to use weights that are tuned during training using statistics of the backpropagated gradients of the loss function. It is noteworthy that the weights themselves are not adjusted by backpropagation. Instead, they behave as learning rate coefficients, which are updated after each epoch of training.

Adaptive Resampling. In (Wight and Zhao 2020), a strategy to adaptively resample the residual collocation points based on the magnitude of the residual is proposed. While this approach improves the approximation, the training process must be interrupted and the MSE evaluated on the residual points to deterministically resample the ones with the highest error. After each resampling step, the number of residual points grows, increasing computational complexity.

Time-Adaptive Approaches. In (Wight and Zhao 2020),

another method is suggested, which divides the time axis into several smaller intervals, and trains PINNs separately on them, either sequentially or in parallel. This approach is time-consuming due to the need to train multiple PINNs.

Neural Tangent Kernel (NTK) Weighting. Most re

cently, (Wang, Yu, and Perdikaris 2020) introduced weights on the collocation and boundary losses, which are updated via neural tangent kernels. This approach derives a deterministic kernel which remains constant or is updated periodically at preset time intervals during training.

Methods

While the methods outlined in the previous section produce improvements in stability and accuracy over the baseline PINN, they are either nonadaptive or require brute-force adaptation at increased computational cost. Here we propose a self-adaptive procedure that uses fully-trainable weights to produce a multiplicative soft attention mask, in a manner that is reminiscent of attention mechanisms used in computer vision (Wang et al. 2017; Pang et al. 2019). This is in agreement with the neural network philosophy of selfadaptation: instead of hard-coding weights at particular regions of the solution, the adaptation weights are updated by backpropagation together with the network weights.

The proposed Self-adaptive PINN utilizes the following loss function

L(w; r; b; 0) =

Lr(w; r) + Lb(w; b) + L0(w; 0) ; (9) where r = ( r1; : : : ; rNr ), b = ( b1; : : : ; bNb ), and 0 = ( 10; : : : ; 0N0 ) are trainable, nonnegative self-adaptation weights for the initial, boundary, and collocation points, respectively, and

Lr(w; r) = Lb(w; b) = L0(w; 0) = 1 Nr

X g( ir) r(xir; tir; w)2 Nr i=1 1 Nb

X g( ib)(u(xib; tib; w) Nb i=1 1 N0

X g( i0)(u(xi0; 0; w) N0 i=1 gi)2 b hi0)2: where the self-adaptation mask function g is a nonnegative, differentiable, strictly increasing function. The key feature of self-adaptive PINNs is that the loss L(w; r; b; 0) is minimized with respect to the network weights w, as usual, but is maximized with respect to the self-adaptation weights r; b; 0, i.e., the objective is: min w max r; b; 0

L(w; r; b; 0) : Consider the updates of a gradient descent/ascent approach to this problem: wk+1 = wk k+1 = r k+1 = b k+1 = 0 rk + bk + 0k + k rwL(wk; rk; bk; 0k) k r r L(wk; rk; bk; 0k) k r b L(wk; rk; bk; 0k) k r 0 L(wk; rk; bk; 0k) : where k is the learning rate at step k, and r r L = g0( rk;1) r(xr1; tr1; wk)2

g0( rk;Nr ) r(xrNb ; trNr ; wk)2 T r b L = g0( bk;1)(u(xb1; tb1; wk) g1)2 b g0( bk;Nb )(u(xbNb ; tbNb ; wk) gbNb )2 T r 0 L = 0( 0k;1)(u(x01; 0; wk)

Hence, if g0( ) > 0, i.e. the mask function is strictly increasing, then r r L; r b L; r 0 L 0, and any of the gradients is only zero if the corresponding unmasked loss is zero; e.g., r 0 L = 0 if and only u(xib; t1; wk) = gbi, for all b i = 1; : : : ; N0, i.e., the neural network approximation satisfies the initial condition perfectly (at all given points). This shows that the sequences of weights f rk; k = 1; 2; : : :g, f bk; k = 1; 2; : : :g, f 0k; k = 1; 2; : : :g (and the associated mask values) are monotonically increasing, provided that the corresponding unmasked losses are nonzero. Furthermore, the magnitude of the gradients r r L; r b L; r 0 L, and therefore of the updates, are larger if the corresponding unmasked losses are larger. This progressively penalizes the network more for not fitting the residual, boundary, and initial points closely (the self-adaptive weights, i.e., the amount of penalty, is are typically initialized to small nonzero values). We remark that any of the weights can be set to fixed, non-trainable values, if desired. For example, by setting bk 1, only the weights of the initial and collocation points would be trained.

The shape of the function g affects mask sharpness and training of the PINN. Examples include polynomial masks g( ) = c q, for c; q > 0, and sigmoidal masks. See Figure 1 for a few examples. In practice, the polynomial mask functions have to be kept below a suitable (large) value, to avoid numerical overflow. The sigmoidal masks do not have this issue, and can also be used to produce sharp masks.

Results

In this section, we report experimental results obtained with the Allen-Cahn PDE using a simple quadratic mask, which contrast the performance of the proposed Self-Adaptive PINN algorithm against the baseline PINN and two of the PINN algorithms mentioned in Section , namely, the nonadaptive weighting and time-adaptive schemes (for the latter, Approach 1 in (Wight and Zhao 2020) was used). The main figure of merit used is the L2-error, similar to related work in this area, for a direct comparison of the efficacy of our technique. The code for these examples was written in Tensorflow 2 and is available on Github1, where all the implementations details are publicly available for reproducability.

Allen-Cahn Equation

The Allen-Cahn reaction-diffusion PDE is typically encountered in phase-field models, which can be used, for instance, to simulate the phase separation process in the microstructure evolution of metallic alloys (Moelans, Blanpain, and Wollants 2008; Shen and Yang 2010; Kunselman et al. 2020). The Allen-Cahn PDE considered here is specified as 1https://github.com/levimcclenny/SA-PINNs

The Allen-Cahn PDE is an interesting benchmark for PINNs for multiple reasons. It is a stiffer semi-linear PDE that challenges PINNs to approximate solutions with sharp space and time transitions, and is also introduces periodic boundary conditions (23, 24). In order to deal with the latter, the boundary loss function Lb( ; wb) in (11) is replaced by

Lb( ; wb) = 1 Nb

X wbi(ju(1; tib) Nb i=1 u( 1; tib)j2+

The neural network architecture is fully connected with layer sizes [2; 128; 128; 128; 128; 1]. (The 2 inputs to the network are (x; t) pairs and the output is the approximated value of u .) This architecture is identical to (Wight and Zhao 2020), in order to allow a direct comparison of performance. We set the number of collocation, initial, and boundary points to Nr = 20; 000; N0 = 100 and Nb = 100, jux(1; tib) ux( 1; tib)j2) (25)

Conclusion respectively (due to the periodic boundary condition, there are in fact 200 boundary points). Here we hold the boundary weights wbi at 1, while the initial weights w0i and collocation weights wri are trained. The initial and collocation weights are initialized from a uniform distribution in the intervals [0; 100] and [0; 1], respectively. Training took 13ms/iteration on an Nvidia V100 GPU.

Numerical results obtained with the Self-Adaptive PINN are displayed in figure 2. The average L2 error across 10 runs with random restarts was 2.1% 1.21%, while the L2 error on 10 runs obtained by the time-adaptive approach in (Wight and Zhao 2020) was 8.0% 0.56%. Neither the baseline PINN nor the nonadaptive weighted scheme, with fixed initial condition weight C = 100, were able to solve this PDE satisfactorily, with L2 errors 96.15% 6.45% and 49.61% 2.50%, respectively — these numbers matched almost exactly those reported in (Wight and Zhao 2020).

The plot in Figure 3 is unique to the proposed selfadaptive PINN algorithm. It displays the trained weights for the collocation points across the spatio-temporal domain. These are the weights of the multiplicative soft attention mask self-imposed by the PINN. This plot stays remarkably constant across different runs with random restarts, which is an indication that it is a property of the particular PDE being solved. We can observe that in this case, more attention is needed early in the solution, but not uniformly across the space variable. In (Wight and Zhao 2020), this observation was justified by the fact that the Allen-Cahn PDEs describes a time-irreversible diffusion-reaction processes, where the solution has to be approximated well early. However, here this fact is “discovered” by the self-adaptive PINN itself.

In this paper, we introduced a novel PINN algorithm based on self-adaptation. This approach uses a conceptual framework that is reminiscent of soft attention mechanisms employed in Computer Vision, in that the network identifies which inputs are most important to its own training. Experimental results with the Allen-Cahn PDE system indicate that Self-Adaptive PINNs allows for more accurate solutions of PDEs with smaller computational cost than other stateof-the-art PINN algorithms. We believe that self-adaptive PINNs open up new possibilities for the improvement and implementation of PINN solvers for complex nonlinear, semi-linear, and stiff PDEs in engineering and science. Acknowledgments The authors would like to acknowledge the support of the D3EM program funded through NSF Award DGE-1545403. The authors would further like to thank the US Army CCDC Army Research Lab for their generous support and affiliation, as well as the Nvidia DGX Station hardware which allowed the implementation and experimentation shown in this abstract. Wang, S.; Teng, Y.; and Perdikaris, P. 2020. Understanding and mitigating gradient pathologies in physics-informed neural networks. arXiv preprint arXiv:2001.04536 . Wang, S.; Yu, X.; and Perdikaris, P. 2020. When and why PINNs fail to train: A neural tangent kernel perspective. arXiv preprint arXiv:2007.14527 .

Wight, C. L.; and Zhao, J. 2020. Solving Allen-Cahn and Cahn-Hilliard Equations using the Adaptive Physics Informed Neural Networks. arXiv preprint arXiv:2007.04542 .