<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Automatic differentiation in machine learning:
a survey. The Journal of Machine Learning Research</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.2172/1478744</article-id>
      <title-group>
        <article-title>Self-Adaptive Physics-Informed Neural Networks using a Soft Attention Mechanism</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Levi McClenny</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ulisses Braga-Neto</string-name>
          <email>ulissesg@tamu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Electrical and Computer Engineering Texas A&amp;M Univeristy College Station</institution>
          ,
          <addr-line>TX 77845</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>18</volume>
      <issue>1</issue>
      <fpage>4967</fpage>
      <lpage>4975</lpage>
      <abstract>
        <p>Physics-Informed Neural Networks (PINNs) have emerged recently as a promising application of deep neural networks to the numerical solution of nonlinear partial differential equations (PDEs). However, the solution of more stiff or semi-linear PDEs can contain regions where the gradient and solution changes rapidly, creating difficulties in training the solution network. It has been recognized that adaptive procedures are needed to force the neural network to fit accurately these “stubborn” spots in the solution. To accomplish that, previous approaches have used fixed weights in the loss function hard-coded over regions of the solution deemed to be important. In this paper, we propose a new method to train PINNs adaptively, using fully-trainable weights that force the neural network to focus on regions of the solution are difficult, in a way that is reminiscent of soft multiplicative attention masks used in Computer Vision. The key idea in Self-Adaptive PINNs is to make the weights increase as the corresponding losses increase, which is accomplished by training the network to simultaneously minimize the losses and maximize the weights, as in augmented Lagrangian and constraint-satisfaction methods in classical nonlinear optimization. We present numerical experiments with the Allen-Cahn PDE in which the Self-Adaptive PINN outperformed other state-of-the-art PINN algorithms in L2 error, while using a smaller number of training epochs.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>As part of the burgeoning field of scientific machine
learning (Baker et al. 2019), physics-informed neural networks
(PINNs) have emerged recently as an alternative to
traditional partial different equation (PDE) solvers (Raissi,
Perdikaris, and Karniadakis 2019; Raissi 2018; Wight and
Zhao 2020; Wang, Yu, and Perdikaris 2020). Typical
blackbox deep learning methodologies do not take into account
physical understanding of the problem domain. The PINN
approach is based on constraining the output of a deep
neural network to satisfy a physical model specified by a PDE.</p>
      <p>A great advantage of PINNs over traditional
timestepping PDE solvers is that the entire spatial-temporal
domain can be solved at once using collocation points
distributed irregularly (rather than on a grid) across the
spatialtemporal domain, in a process that can be massively
parallelized via GPU. As we have continued to see GPU
capabilities increase in recent years, a method that relies on
parallelism in training iterations could begin to emerge as the
predominant approach in scientific computing.</p>
      <p>The original continuous PINN in (Raissi, Perdikaris, and
Karniadakis 2019), henceforth referred to as the “baseline
PINN,” is effective at estimating solutions that are
reasonably smooth, such as Burger’s equation, the wave
equation, Poisson’s equation, and Schrodinger’s equation. On the
other hand, it has been observed that the baseline PINN has
convergence and accuracy problems when solving more stiff
semi-linear PDEs, with solutions that contain sharp and
intricate space and time transitions (Wight and Zhao 2020;
Wang, Teng, and Perdikaris 2020). This is the case, for
example, of the Allen-Cahn and Cahn-Hilliard equations of
phase-field models (Moelans, Blanpain, and Wollants 2008).</p>
      <p>To address this issue, various modifications of the
baseline PINN algorithm have been proposed. For example, in
(Wight and Zhao 2020), a series of schemes are introduced,
including nonadaptive weighting of the training loss
function, adaptive resampling of the collocation points, and
timeadaptive approaches, while in (Wang, Teng, and Perdikaris
2020), a learning rate annealing scheme was proposed. The
consensus has been that adaptation mechanisms are
essential to make PINNs more stable and able to approximate well
difficult regions of the solution.</p>
      <p>This paper introduces Self-Adaptive PINNs, a simple
solution to the adaptation problem for solving partial
difference equations (PDEs), which uses trainable weights as a
soft multiplicative mask reminiscent of the attention
mechanism used in computer vision (Wang et al. 2017; Pang et al.
2019). The weights are trained concurrently with the
approximation network. As a result, initial, boundary or
collocation points in difficult regions of the solution are
automatically weighted heavier in the loss function, forcing the
approximation to improve on those points. Experimental
results show that Self-Adaptive PINNs can solve the
traditionally “stiffer” Allen Cahn PDE accurately. The Self-Adaptive
PINN displayed more accurate results than other
state-ofthe-art PINN adaptive training algorithms, while using a
smaller number of training epochs.</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <sec id="sec-2-1">
        <title>Overview of Physics-Informed Neural Networks</title>
        <p>Consider a general nonlinear PDE of the form:
ut + Nx[u] = 0 ; x 2
u(x; 0) = h(x) ; x 2</p>
        <p>;
where x 2 is a spatial vector variable in a domain</p>
        <p>Rd, t is time, and Nx is a spatial differential
operator. Following (Raissi, Perdikaris, and Karniadakis 2019),
let u(x; t) be approximated by the output u (x; t) of a deep
neural network with inputs x and t. Define the residual as:
r (x; t) :=</p>
        <p>u (x; t) + Nx[u (x; t)] ;
@t
where all partial derivatives can be computed by
automatic differentiation methods (Baydin et al. 2017; Paszke
et al. 2017). The parameters are trained by
backpropagation (Chauvin and Rumelhart 1995) on a loss
function that penalizes the output for not satisfying (1)-(3):</p>
        <p>L( ) = Lr( ) + Lb( ) + L0( ) ;
where Lr is the loss corresponding to the residual (4), Lb
is the loss due to the boundary conditions (2), and L0 is the
loss due to the initial conditions (3):</p>
        <p>Lr( ) =
Lb( ) =
L0( ) =
1 Nr</p>
        <p>X r(xir; tir)2;
Nr i=1
Nb i=1
1 XNb ju(xib; tib)
N0 i=1
1 XN0 ju(xi0; 0)
gbij2;
hi0j2;
where fxi0; hi0 = h(xi0)giN=01 are the data at time t = 0,
fxib; tib; gbi = g(xib; tib))giN=b1 are the data at the boundary,
fxir; tirgiN=r1 are collocation points randomly distributed in
the domain , and N0; Nb and Nr denote the total number
of initial data, boundary data, and collocation points,
respectively. The parameters can be tuned by minimizing the total
training loss L( ) via standard gradient descent procedures
used in deep learning.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Related Work</title>
        <p>The baseline PINN algorithm can be unstable during
training and produce inaccurate approximations around sharp
space and time transitions in the solution of semi-linear
PDEs. Much of the recent literature on PINNs has been
devoted to mitigating these issues by introducing modifications
to the baseline PINN algorithm that can increase training
stability and accuracy of the approximation, mostly via
attempting to mitigate spectral bias inherent to neural network
approximations. We mention some of these approaches
below.
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)</p>
        <p>Nonadaptive Weighting. In (Wight and Zhao 2020), it
was pointed out that a premium should be put on forcing
the neural network to satisfy the initial conditions closely,
especially for PDEs describing time-irreversible processes,
where the solution has to be approximated well early.
Accordingly, a loss function of the form L( ) = Lr( ) +
Lb( ) + C L0( ) was suggested, where C 1 is a
hyperparameter.</p>
      </sec>
      <sec id="sec-2-3">
        <title>Learning Rate Annealing. In (Wang, Teng, and</title>
        <p>Perdikaris 2020), it is argued that the optimal value of the
weight C in the previous scheme may vary wildly among
different PDEs so that choosing its value would be difficult.
Instead they propose to use weights that are tuned during
training using statistics of the backpropagated gradients of
the loss function. It is noteworthy that the weights
themselves are not adjusted by backpropagation. Instead, they
behave as learning rate coefficients, which are updated
after each epoch of training.</p>
        <p>Adaptive Resampling. In (Wight and Zhao 2020), a
strategy to adaptively resample the residual collocation points
based on the magnitude of the residual is proposed. While
this approach improves the approximation, the training
process must be interrupted and the MSE evaluated on the
residual points to deterministically resample the ones with the
highest error. After each resampling step, the number of
residual points grows, increasing computational complexity.</p>
      </sec>
      <sec id="sec-2-4">
        <title>Time-Adaptive Approaches. In (Wight and Zhao 2020),</title>
        <p>another method is suggested, which divides the time axis
into several smaller intervals, and trains PINNs separately
on them, either sequentially or in parallel. This approach is
time-consuming due to the need to train multiple PINNs.</p>
      </sec>
      <sec id="sec-2-5">
        <title>Neural Tangent Kernel (NTK) Weighting. Most re</title>
        <p>cently, (Wang, Yu, and Perdikaris 2020) introduced weights
on the collocation and boundary losses, which are updated
via neural tangent kernels. This approach derives a
deterministic kernel which remains constant or is updated
periodically at preset time intervals during training.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Methods</title>
      <p>While the methods outlined in the previous section
produce improvements in stability and accuracy over the
baseline PINN, they are either nonadaptive or require brute-force
adaptation at increased computational cost. Here we propose
a self-adaptive procedure that uses fully-trainable weights to
produce a multiplicative soft attention mask, in a manner
that is reminiscent of attention mechanisms used in
computer vision (Wang et al. 2017; Pang et al. 2019). This is
in agreement with the neural network philosophy of
selfadaptation: instead of hard-coding weights at particular
regions of the solution, the adaptation weights are updated by
backpropagation together with the network weights.</p>
      <p>The proposed Self-adaptive PINN utilizes the following
loss function</p>
      <p>L(w; r; b; 0) =</p>
      <p>Lr(w; r) + Lb(w; b) + L0(w; 0) ; (9)
where r = ( r1; : : : ; rNr ), b = ( b1; : : : ; bNb ), and 0 =
( 10; : : : ; 0N0 ) are trainable, nonnegative self-adaptation
weights for the initial, boundary, and collocation points,
respectively, and</p>
      <p>Lr(w; r) =
Lb(w; b) =
L0(w; 0) =
1 Nr</p>
      <p>X g( ir) r(xir; tir; w)2
Nr i=1
1 Nb</p>
      <p>X g( ib)(u(xib; tib; w)
Nb i=1
1 N0</p>
      <p>X g( i0)(u(xi0; 0; w)
N0 i=1
gi)2
b
hi0)2:
where the self-adaptation mask function g is a nonnegative,
differentiable, strictly increasing function. The key feature
of self-adaptive PINNs is that the loss L(w; r; b; 0) is
minimized with respect to the network weights w, as usual,
but is maximized with respect to the self-adaptation weights
r; b; 0, i.e., the objective is:
min
w
max
r; b; 0</p>
      <p>L(w; r; b; 0) :
Consider the updates of a gradient descent/ascent approach
to this problem:
wk+1 = wk
k+1 =
r
k+1 =
b
k+1 =
0
rk +
bk +
0k +
k rwL(wk; rk; bk; 0k)
k r r L(wk; rk; bk; 0k)
k r b L(wk; rk; bk; 0k)
k r 0 L(wk; rk; bk; 0k) :
where k is the learning rate at step k, and
r r L =
g0( rk;1) r(xr1; tr1; wk)2</p>
      <p>g0( rk;Nr ) r(xrNb ; trNr ; wk)2 T
r b L =
g0( bk;1)(u(xb1; tb1; wk)
g1)2
b
g0( bk;Nb )(u(xbNb ; tbNb ; wk)
gbNb )2 T
r 0 L = 0( 0k;1)(u(x01; 0; wk)</p>
      <p>Hence, if g0( ) &gt; 0, i.e. the mask function is strictly
increasing, then r r L; r b L; r 0 L 0, and any of the
gradients is only zero if the corresponding unmasked loss is
zero; e.g., r 0 L = 0 if and only u(xib; t1; wk) = gbi, for all
b
i = 1; : : : ; N0, i.e., the neural network approximation
satisfies the initial condition perfectly (at all given points). This
shows that the sequences of weights f rk; k = 1; 2; : : :g,
f bk; k = 1; 2; : : :g, f 0k; k = 1; 2; : : :g (and the associated
mask values) are monotonically increasing, provided that
the corresponding unmasked losses are nonzero.
Furthermore, the magnitude of the gradients r r L; r b L; r 0 L,
and therefore of the updates, are larger if the
corresponding unmasked losses are larger. This progressively
penalizes the network more for not fitting the residual,
boundary, and initial points closely (the self-adaptive weights, i.e.,
the amount of penalty, is are typically initialized to small
nonzero values). We remark that any of the weights can be
set to fixed, non-trainable values, if desired. For example, by
setting bk 1, only the weights of the initial and
collocation points would be trained.</p>
      <p>The shape of the function g affects mask sharpness and
training of the PINN. Examples include polynomial masks
g( ) = c q, for c; q &gt; 0, and sigmoidal masks. See Figure 1
for a few examples. In practice, the polynomial mask
functions have to be kept below a suitable (large) value, to avoid
numerical overflow. The sigmoidal masks do not have this
issue, and can also be used to produce sharp masks.</p>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>In this section, we report experimental results obtained with
the Allen-Cahn PDE using a simple quadratic mask, which
contrast the performance of the proposed Self-Adaptive
PINN algorithm against the baseline PINN and two of the
PINN algorithms mentioned in Section , namely, the
nonadaptive weighting and time-adaptive schemes (for the
latter, Approach 1 in (Wight and Zhao 2020) was used). The
main figure of merit used is the L2-error, similar to related
work in this area, for a direct comparison of the efficacy of
our technique. The code for these examples was written in
Tensorflow 2 and is available on Github1, where all the
implementations details are publicly available for
reproducability.</p>
      <sec id="sec-4-1">
        <title>Allen-Cahn Equation</title>
        <p>The Allen-Cahn reaction-diffusion PDE is typically
encountered in phase-field models, which can be used, for instance,
to simulate the phase separation process in the
microstructure evolution of metallic alloys (Moelans, Blanpain, and
Wollants 2008; Shen and Yang 2010; Kunselman et al.
2020). The Allen-Cahn PDE considered here is specified as
1https://github.com/levimcclenny/SA-PINNs</p>
        <p>The Allen-Cahn PDE is an interesting benchmark for
PINNs for multiple reasons. It is a stiffer semi-linear PDE
that challenges PINNs to approximate solutions with sharp
space and time transitions, and is also introduces periodic
boundary conditions (23, 24). In order to deal with the
latter, the boundary loss function Lb( ; wb) in (11) is replaced
by</p>
        <p>Lb( ; wb) =
1 Nb</p>
        <p>X wbi(ju(1; tib)
Nb i=1
u( 1; tib)j2+</p>
        <p>The neural network architecture is fully connected with
layer sizes [2; 128; 128; 128; 128; 1]. (The 2 inputs to the
network are (x; t) pairs and the output is the approximated
value of u .) This architecture is identical to (Wight and
Zhao 2020), in order to allow a direct comparison of
performance. We set the number of collocation, initial, and
boundary points to Nr = 20; 000; N0 = 100 and Nb = 100,
jux(1; tib)
ux( 1; tib)j2)
(25)</p>
        <p>Conclusion
respectively (due to the periodic boundary condition, there
are in fact 200 boundary points). Here we hold the boundary
weights wbi at 1, while the initial weights w0i and collocation
weights wri are trained. The initial and collocation weights
are initialized from a uniform distribution in the intervals
[0; 100] and [0; 1], respectively. Training took 13ms/iteration
on an Nvidia V100 GPU.</p>
        <p>Numerical results obtained with the Self-Adaptive PINN
are displayed in figure 2. The average L2 error across 10 runs
with random restarts was 2.1% 1.21%, while the L2 error
on 10 runs obtained by the time-adaptive approach in (Wight
and Zhao 2020) was 8.0% 0.56%. Neither the baseline
PINN nor the nonadaptive weighted scheme, with fixed
initial condition weight C = 100, were able to solve this PDE
satisfactorily, with L2 errors 96.15% 6.45% and 49.61%
2.50%, respectively — these numbers matched almost
exactly those reported in (Wight and Zhao 2020).</p>
        <p>The plot in Figure 3 is unique to the proposed
selfadaptive PINN algorithm. It displays the trained weights for
the collocation points across the spatio-temporal domain.
These are the weights of the multiplicative soft attention
mask self-imposed by the PINN. This plot stays remarkably
constant across different runs with random restarts, which is
an indication that it is a property of the particular PDE
being solved. We can observe that in this case, more attention
is needed early in the solution, but not uniformly across the
space variable. In (Wight and Zhao 2020), this observation
was justified by the fact that the Allen-Cahn PDEs describes
a time-irreversible diffusion-reaction processes, where the
solution has to be approximated well early. However, here
this fact is “discovered” by the self-adaptive PINN itself.</p>
        <p>In this paper, we introduced a novel PINN algorithm based
on self-adaptation. This approach uses a conceptual
framework that is reminiscent of soft attention mechanisms
employed in Computer Vision, in that the network identifies
which inputs are most important to its own training.
Experimental results with the Allen-Cahn PDE system indicate
that Self-Adaptive PINNs allows for more accurate solutions
of PDEs with smaller computational cost than other
stateof-the-art PINN algorithms. We believe that self-adaptive
PINNs open up new possibilities for the improvement and
implementation of PINN solvers for complex nonlinear,
semi-linear, and stiff PDEs in engineering and science.
Acknowledgments The authors would like to
acknowledge the support of the D3EM program funded through NSF
Award DGE-1545403. The authors would further like to
thank the US Army CCDC Army Research Lab for their
generous support and affiliation, as well as the Nvidia DGX
Station hardware which allowed the implementation and
experimentation shown in this abstract.
Wang, S.; Teng, Y.; and Perdikaris, P. 2020.
Understanding and mitigating gradient pathologies in physics-informed
neural networks. arXiv preprint arXiv:2001.04536 .
Wang, S.; Yu, X.; and Perdikaris, P. 2020. When and why
PINNs fail to train: A neural tangent kernel perspective.
arXiv preprint arXiv:2007.14527 .</p>
        <p>Wight, C. L.; and Zhao, J. 2020. Solving Allen-Cahn
and Cahn-Hilliard Equations using the Adaptive Physics
Informed Neural Networks. arXiv preprint arXiv:2007.04542
.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>