<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Solving inverse-PDE problems with physics-
aware neural networks. arXiv preprint arXiv:</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>ADCME MPI: Distributed Machine Learning for Computational Engineering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kailai Xu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eric Darve</string-name>
          <email>darveg@stanford.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute for Computational and Mathematical Engineering, Stanford University</institution>
          ,
          <addr-line>Stanford, CA 94305</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Mechanical Engineering, Stanford University</institution>
          ,
          <addr-line>Stanford, CA 94305</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2001</year>
      </pub-date>
      <volume>03608</volume>
      <fpage>19</fpage>
      <lpage>28</lpage>
      <abstract>
        <p>We propose a framework for training deep neural networks (DNNs) that are coupled with partial differential equations (PDEs) in a parallel computing environment. Unlike most distributed computing frameworks for DNNs, our focus is to parallelize both numerical solvers and DNNs in forward and adjoint computations. Our parallel computing model views data communication as a node in the computational graph for numerical simulations. The advantage of our model is that data communication and computing are cleanly separated, which enables better flexibility, modularity, and testability of the software. We demonstrate our approach on a large-scale problem and show that we can achieve substantial acceleration by using parallel numerical PDE solvers while training DNNs that are coupled with PDEs.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Deep neural networks (DNNs) have been demonstrated to
be very effective for solving inverse problems in
computational engineering
        <xref ref-type="bibr" rid="ref23">(Raissi, Perdikaris, and Karniadakis
2019; Meng et al. 2020; Pakravan et al. 2020)</xref>
        . In our
previous work
        <xref ref-type="bibr" rid="ref14 ref14 ref18 ref19 ref19">(Xu, Huang, and Darve 2020; Xu and Darve
2019; Huang et al. 2020; Fan et al. 2020)</xref>
        , we successfully
combined numerical solvers and deep neural networks for
data-driven inverse modeling. Mathematically, we consider
an implicit model F (f; u) = 0 where f is an unknown
function, which we approximate using a DNN. We denote the
DNN N , where is the neural network weights and biases;
u is a function which depends on f through F (f; u) = 0. We
are given some (partial) observations uobs of the function u,
which are used to optimize and minimize the difference
between f and N . The optimization problem is formulated
as a PDE-constrained optimization problem
min L(u) (u is indirectly a function of )
(1)
such that F (N ; u) = 0
L(u) is a loss function, which measures the discrepancy
between hypothetical and actual observations. For
example, in this paper we use the square loss function L(u) =
ku uobsk22.
      </p>
      <p>One advantage of such a formulation is that the known
physics, such as the physical laws described by PDEs,
are preserved to the largest extent and solved with
welldeveloped and efficient numerical solvers. Meanwhile, we
can leverage the approximation power of DNNs.</p>
      <p>To solve this optimization problem, one can first solve the
physics constraint F (N ; u) = 0 in Equation (1)
numerically and then plug the solution u( ) into L(u). This leads
to an unconstrained optimization problem</p>
      <p>
        min L~( ) = L(u( ))
We developed a Julia (Bezanson et al. 2017) library,
ADCME
        <xref ref-type="bibr" rid="ref14 ref18 ref19">(Xu and Darve 2020a)</xref>
        , with a TensorFlow (Abadi
et al. 2016) automatic differentiation backend to solve
problems of this type. ADCME expresses both the
numerical solver and DNNs using computational graphs.
Therefore, the gradient r L~( ) can be calculated automatically
by back-propagating gradients through both the numerical
solvers1 and DNNs. In this paper, we use “operator” and
“node” interchangeably to refer to a node in the
computational graph, which is a function that takes incoming edges
(intermediate data) as inputs and outputs outgoing edges
(intermediate data).
      </p>
      <p>
        However, one challenge with this approach is that for
large-scale problems, the memory and computational costs
for the numerical solver are prohibitive. The de-facto
standard for solving such large-scale problems on modern
distributed memory high performance computing (HPC)
architectures is the Message Passing Interface (MPI)
        <xref ref-type="bibr" rid="ref15 ref16 ref17">(Gabriel
et al. 2004; Gropp, Thakur, and Lusk 1999)</xref>
        . The TensorFlow
backend used by ADCME was originally designed for deep
learning/machine learning. Despite that there are much work
on extending TensorFlow for distributed training of machine
learning models, some of the key capabilities, such as
distributed linear algebra and domain decomposition, for
solving scientific computing problems are still lacking. This
paper is about incorporating MPI functionalities into ADCME
to achieve scalability and flexibility for distributed memory.
      </p>
      <p>
        1For details on how the gradient back-propagation works for
numerical solvers in ADCME, we refer readers to
        <xref ref-type="bibr" rid="ref14 ref18 ref19">(Xu and Darve
2020b)</xref>
        .
      </p>
    </sec>
    <sec id="sec-2">
      <title>Distributed Computing Models</title>
      <p>
        There are many existing work and software for distributed
computing with deep neural networks
        <xref ref-type="bibr" rid="ref12 ref16 ref25">(Griebel and
Zumbusch 1999; Notay 1995; Douglas, Haase, and Langer 2003)</xref>
        and numerical PDEs
        <xref ref-type="bibr" rid="ref20 ref4">(Bekkerman, Bilenko, and Langford
2011; Jordan and Mitchell 2015)</xref>
        . The two domains have
quite different distributed computing models due to distinct
features of targeted applications. ADCME MPI embraces a
hybrid model that is suitable for inverse modeling in
computational engineering because our method for solving
Equation (1) requires a combination of the above two models.
      </p>
      <p>In deep learning, one major challenge is that datasets are
too large to fit into memory. Therefore, both datasets and
computational loads are distributed onto different machines</p>
      <p>
        The main idea is to parallelize numerical solvers by
splitting the mesh or matrices onto different MPI ranks (or
processors). Then, data communication nodes are inserted into
the computational graph. Because for our computational
engineering applications DNNs are typically small, they are
duplicated on each processor. Each computational graph
includes a set of “communication” nodes (Fig. 1-top), which
are absent in a single processor computational graph. These
operators invoke MPI calls and are in charge of data
communication between different computer nodes. During the
gradient back-propagation, we need to reverse the data-flow
direction and operation of the data communication operators
        <xref ref-type="bibr" rid="ref20">(Wang and Pothen 2015; Utke et al. 2009)</xref>
        (Fig. 1-bottom).
and mini-batch optimization algorithms, such as stochastic
gradient descent
        <xref ref-type="bibr" rid="ref9">(Bottou 2010)</xref>
        , are used. Each processor
calculates predictions and gradients, which are aggregated
on one or more processors. To further scale out in a limited
bandwidth environment, parameter servers
        <xref ref-type="bibr" rid="ref22">(Li et al. 2013,
2014)</xref>
        , where each server stores a part of the parameters, are
implemented. There are also extensive work on model
parallism, which parallelizes the computation by splitting the
DNNs into multiple parts
        <xref ref-type="bibr" rid="ref10 ref18">(Chen, Yang, and Cheng 2018;
Hewett and Grady II 2020)</xref>
        .
      </p>
      <p>The parallel computing model in computational
engineering is quite different from deep learning. The computational
engineering applications feature data communication across
neighboring points in a mesh (domain decomposition) or
different parts of a matrix. In terms of computational graph,
this pattern indicates that there are many more
communications besides the reduction of gradients at the end. This
motivates us to design new distributed computing models for
computational engineering inverse modeling applications.</p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>
        ADCME MPI aims at providing a modular, efficient, and
flexible implementation for distributed computing in inverse
modeling. Conceptually, we can treat data communication
operations as a node in the computational graph: they are
similar to computational nodes (e.g., a linear solver),
except that their responsibility is to invoke MPI calls and
backpropagate the gradients in the reverse mode automatic
differentiation
        <xref ref-type="bibr" rid="ref3">(Baydin et al. 2017)</xref>
        . This solution provides an
elegant enhancement to the ADCME library because to convert
a single processor program to multiple processor one, users
only need to insert data communication nodes as needed and
most parts of the original codes are unchanged. In this
section, we briefly describe our contributions in ADCME MPI
to extend its distributed computing capabilities to couple
DNNs and PDE solvers.
      </p>
      <sec id="sec-3-1">
        <title>MPI APIs</title>
        <p>
          ADCME MPI provides a set of commonly used MPI
primitives, such as mpi bcast, mpi gather, mpi send,
etc. These operators are wrappers for standard MPI APIs.
However, these operators are also “differentiable,” in the
sense that they can handle gradient back-propagation. The
gradient back-propagation functionality uses the fact that
there exists one-to-one correspondence between forward
and backward MPI calls. For example, the forward “send”
corresponds to the backward “receive”
          <xref ref-type="bibr" rid="ref11 ref20">(Cheng 2006;
Towara, Schanen, and Naumann 2015)</xref>
          .
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Halo Exchange</title>
        <p>
          To enable the communication between adjacent patches
in a mesh, ADCME provides efficient implementations of
data communication operators for halo exchange patterns
(Fig. 2). Halo exchange patterns are very common in
scientific computing because numerical solvers often involve
communication between neighboring points
          <xref ref-type="bibr" rid="ref7">(Bianco 2014)</xref>
          .
We use a nonblocking send/receive strategy. In the
gradient back-propagation phase, this order of send/receive is
reversed.
Sparse matrices are very important tools in computational
engineering. ADCME MPI stores large sparse matrices as
CSR matrices and each MPI processor stores a portion of
rows with continuous row indices.
        </p>
        <p>For reverse-mode automatic differentiation, matrix
transposition is an operator that is common in gradient
backpropagation. For example, assume the forward computation
is (x is the input, y is the output, and A is a matrix)
y = Ax
(2)
Given a loss function L(y), the gradient back-propagation
calculates
=
Here @L(y) is a row vector, and therefore
@y
= AT
requires a matrix vector multiplication, where the matrix is
AT .</p>
        <p>The transposition of a distributed sparse matrix is
implemented in three steps (Fig. 3):
1. The submatrix owned by each MPI processor is split into
subblocks. Meta information (e.g., number of nonzeros in
each block) is collected.
2. Each block B exchanges the meta information with the
target block, where BT should be placed.
3. Each subblock is transposed and the data are transferred
to the target block.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Distributed Optimization</title>
        <p>In general, the objective function of our problem can be
written as a sum of local objective functions
min L~( ) =</p>
        <p>N
X fi( )
i=1</p>
        <p>Despite many existing distributed optimization algorithm,
in this work we adopt a simple approach: aggregating
gradients r fi( ) and updating on the root processors. Fig. 4
shows how we can convert an existing optimizer to an
MPIenabled optimizer. The basic idea is to let the root processor
notify worker processors whether to compute the loss
function or the gradient. Then the root processor and workers
will collaborate on executing the same routines and thus
ensuring the correctness of collective MPI calls.</p>
        <p>Master
for k = 1, 2, 3, …
flag = COMPUTE_OBJ
mpi_sync!(flag)
f = compute_objective_function(x)
…
flag = COMPUTE_GRAD
mpi_sync!(flag)
dx = compute_gradient(x)
…
x = x – alpha * dx
flag = OPTIMIZATION_STOP
mpi_sync!(flag)
flag</p>
        <p>Worker
while true
mpi_sync!(flag)
if (flag==COMPUTE_OBJ)</p>
        <p>compute_objective_function(x)
elseif (flag==COMPUTE_GRAD)</p>
        <p>compute_gradient(x)
else</p>
        <p>break
end
end</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Numerical Benchmarks</title>
      <p>As a demonstration, we present a benchmark result with
ADCME MPI. The example shows that the overhead
introduced by ADCME is very small compared to the actual
computation.</p>
      <p>The governing equation is given by Poisson’s equation
r ( (x)ru(x)) = f (x)
u(x) = 0
1, and (x) is approximated by a deep neural
Here f (x)
network</p>
      <p>(x) = N (x)
where is the neural network weights and biases. The
equation is discretized using the finite difference method on a
uniform grid and the discretization leads to a linear system
A( ) u
f
(3)
(4)
where fi is the local objective function, N is the number of
processors.
u is the solution vector and f is the source vector. Note that
A is a sparse matrix and its entries depend on .</p>
      <p>
        The sparse matrix is constructed using the differentiable
halo exchange operator and stored as a distributed CSR
matrix. Equation (4) is solved using the algebraic multigrid
method in Hypre
        <xref ref-type="bibr" rid="ref13">(Falgout and Yang 2002)</xref>
        . During the
gradient back-propagation, we need to solve a linear system with
the coefficient matrix AT . This matrix is obtained using the
technique described in the last section.
      </p>
      <p>In the strong scaling experiments, we consider a fixed
problem size 1,800 1,800 (mesh size, which implies the
matrix size is around 32 million 32 million). In the weak
scaling experiments, each MPI processor owns a 300 300
block. For example, a problem with 3,600 processors has the
problem size 90,000 3,600 0.3 billion.</p>
      <p>We first consider the weak scaling case. We consider
two cases: each MPI rank has 1 core or 4 cores. In the
latter case, the TensorFlow backend enjoys the benefit of
inter-parallelism, where independent operators in the
computational graph can be executed simultaneously. However,
4 cores do not guarantee a 4 times acceleration; the
performance depends on the availability of independent tasks,
scheduling conflicts, resource contention, etc. Fig. 5 shows
the runtime for the forward computation as well as the
gradient back-propagation. There are two important observations:
1. By using more cores per processor, the runtime is reduced
significantly. For example, the runtime for the backward is
reduced to around 10 seconds from 30 seconds by
switching from 1 core to 4 cores per processor.
2. The runtime for the backward pass is typically less than
twice the forward computation. Although the backward
pass requires solving two linear systems (one of them is in
the forward computation), the AMG (algebraic multigrid)
linear solver in the back-propagation may converge faster,
and therefore may cost less than during the forward pass.</p>
      <p>Additionally, we show the overhead in Fig. 6, which is
defined as the difference between total runtime and Hypre
linear solver time, for both the forward and backward
calculation.</p>
      <p>We see that the overhead is quite small compared to the
total time, especially when the problem size is large. This
indicates that the ADCME MPI implementation is very
effective.</p>
      <p>In Fig. 7, we consider the strong scaling. In this case, we
fixed the whole problem size and split the mesh onto
different MPI processors. Fig. 7 shows the runtime for the forward
computation and the gradient back-propagation. We can
reduce the runtime by more than 20 times for the expensive
gradient back-propagation by utilizing more than 100 MPI
processors. Fig. 8 shows the speedup and efficiency. We can
see that the 4 cores have smaller runtime compared to 1 core
2.</p>
      <p>
        2Finding a scaling sweet spot for a mixed programming model
(MPI and OpenMP) of Hypre AMG solvers on multicore clusters is
challenging
        <xref ref-type="bibr" rid="ref2">(Baker, Schulz, and Yang 2010)</xref>
        . The intra- and
interparallelism of the TensorFlow backend also add difficulties to
finding a scaling strategy.
0.7
0.6
)sd 0.5
n
eco 0.4
(s
iem 0.3
T 0.2
0.1
0.0
Forward
Backward
10
8
)s
cond 6
(se
iem 4
T
2
0
0:15
)s
ond 0:1
(sece
m
5 iT10 2
      </p>
      <p>0
1 4 9 16 25 36Num4b9er o6f4Proc8e1ssors100 400 900 1600 2500 3600</p>
      <p>Forward
Backward
9 16 2N5um36ber4o9f P6r4oce8s1sor1s00 400 900
eeuppd 20
0 20Number of Processors
40 60 80
0 20Number of Processors
40 60 80
We presented the functionalities of ADCME MPI. Our
benchmark results show that the overhead introduced by
ADCME for distributed computing programs is very small
compared with the computing time. The ADCME MPI
distributed computing solution is quite flexible, allowing users
to use custom parallel algorithms or libraries at their
discretion. With the advent of experimental techniques that enable
gathering large amounts of data, deep neural network based
data-driven modeling will become essential tools for
scientific discovery. The growing dataset and problem size add
another level of challenges. Therefore, ongoing work on
distributed computing in machine learning for computational
engineering remains an important and promising direction
in the foreseeable future.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>This work is supported by the Applied Mathematics
Program within the Department of Energy (DOE) Office of
Advanced Scientific Computing Research (ASCR), through the
Collaboratory on Mathematics and Physics-Informed
Learning Machines for Multiscale and Multiphysics Problems
Research Center (DE-SC0019453).</p>
      <p>Xu, K.; and Darve, E. 2019. The neural network approach
to inverse problems in differential equations. arXiv preprint
arXiv:1901.07758 .</p>
      <p>Xu, K.; and Darve, E. 2020a. ADCME: Learning
Spatiallyvarying Physical Fields using Deep Neural Networks.
Xu, K.; and Darve, E. 2020b. Physics constrained learning
for data-driven inverse modeling from sparse observations.
arXiv preprint arXiv:2002.10521 .</p>
      <p>Xu, K.; Huang, D. Z.; and Darve, E. 2020. Learning
constitutive relations using symmetric positive definite neural
networks. arXiv preprint arXiv:2004.00265 .</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          2016.
          <article-title>Tensorflow: A system for large-scale machine learning</article-title>
          .
          <source>In 12th fUSENIXg symposium on operating systems design and implementation (fOSDIg 16)</source>
          ,
          <fpage>265</fpage>
          -
          <lpage>283</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Baker</surname>
            ,
            <given-names>A. H.</given-names>
          </string-name>
          ; Schulz,
          <string-name>
            <given-names>M.</given-names>
            ; and
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <surname>U. M.</surname>
          </string-name>
          <year>2010</year>
          .
          <article-title>On the performance of an algebraic multigrid solver on multicore clusters</article-title>
          .
          <source>In International Conference on High Performance Computing for Computational Science</source>
          ,
          <volume>102</volume>
          -
          <fpage>115</fpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Baydin</surname>
            ,
            <given-names>A. G.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Pearlmutter</surname>
            ,
            <given-names>B. A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Radul</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Siskind</surname>
            ,
            <given-names>J. M.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Automatic differentiation in machine learning: a survey</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          <volume>18</volume>
          (
          <issue>1</issue>
          ):
          <fpage>5595</fpage>
          -
          <lpage>5637</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Bekkerman</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; Bilenko,
          <string-name>
            <given-names>M.</given-names>
            ; and
            <surname>Langford</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          <year>2011</year>
          .
          <article-title>Scaling up machine learning: Parallel and distributed approaches</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          2017.
          <article-title>Julia: A fresh approach to numerical computing.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>SIAM review 59</source>
          <volume>(1)</volume>
          :
          <fpage>65</fpage>
          -
          <lpage>98</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Bianco</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>An interface for halo exchange pattern</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>www. prace-ri. eu/IMG/pdf/wp86. pdf .</mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Bottou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Large-scale machine learning with stochastic gradient descent</article-title>
          .
          <source>In Proceedings of COMPSTAT'</source>
          <year>2010</year>
          ,
          <fpage>177</fpage>
          -
          <lpage>186</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>C.-C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          -L.; and Cheng, H.-Y.
          <year>2018</year>
          .
          <article-title>Efficient and robust parallel dnn training through model parallelism on multi-gpu platform</article-title>
          . arXiv preprint arXiv:
          <year>1809</year>
          .02839 .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Cheng</surname>
            ,
            <given-names>B. N.</given-names>
          </string-name>
          <year>2006</year>
          .
          <article-title>A duality between forward and adjoint MPI communication routines</article-title>
          . .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Douglas</surname>
            ,
            <given-names>C. C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Haase</surname>
            , G.; and Langer,
            <given-names>U.</given-names>
          </string-name>
          <year>2003</year>
          .
          <article-title>A tutorial on elliptic PDE solvers and their parallelization</article-title>
          .
          <source>SIAM.</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Falgout</surname>
          </string-name>
          , R. D.; and
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>U. M.</given-names>
          </string-name>
          <year>2002</year>
          .
          <article-title>hypre: A library of high performance preconditioners</article-title>
          .
          <source>In International Conference on Computational Science</source>
          ,
          <volume>632</volume>
          -
          <fpage>641</fpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Pathak</surname>
          </string-name>
          , J.; and
          <string-name>
            <surname>Darve</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>Solving Inverse Problems in Steady State Navier-Stokes Equations using Deep Neural Networks</article-title>
          . arXiv preprint arXiv:
          <year>2008</year>
          .13074 .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Gabriel</surname>
          </string-name>
          , E.;
          <string-name>
            <surname>Fagg</surname>
            ,
            <given-names>G. E.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Bosilca</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ; Angskun,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Dongarra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            ;
            <surname>Squyres</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            ;
            <surname>Sahay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            ;
            <surname>Kambadur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ;
            <surname>Barrett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ;
            <surname>Lumsdaine</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          ; et al.
          <year>2004</year>
          .
          <string-name>
            <surname>Open</surname>
            <given-names>MPI</given-names>
          </string-name>
          :
          <article-title>Goals, concept, and design of a next generation MPI implementation</article-title>
          . In European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting,
          <fpage>97</fpage>
          -
          <lpage>104</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Griebel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and Zumbusch,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <year>1999</year>
          .
          <article-title>Parallel multigrid in an adaptive PDE solver based on hashing and space-filling curves</article-title>
          .
          <source>Parallel Computing</source>
          <volume>25</volume>
          (
          <issue>7</issue>
          ):
          <fpage>827</fpage>
          -
          <lpage>843</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Gropp</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ; Thakur, R.; and
          <string-name>
            <surname>Lusk</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <year>1999</year>
          .
          <article-title>Using MPI-2: advanced features of the message passing interface</article-title>
          . MIT press.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Hewett</surname>
          </string-name>
          , R. J.; and
          <string-name>
            <surname>Grady</surname>
            <given-names>II</given-names>
          </string-name>
          ,
          <string-name>
            <surname>T. J.</surname>
          </string-name>
          <year>2020</year>
          .
          <article-title>A Linear Algebraic Approach to Model Parallelism in Deep Learning</article-title>
          . arXiv preprint arXiv:
          <year>2006</year>
          .03108 .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>D. Z.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Farhat</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Darve</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>Learning constitutive relations from indirect observations using deep neural networks</article-title>
          .
          <source>Journal of Computational Physics</source>
          <volume>109491</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>M. I.</given-names>
          </string-name>
          ; and Mitchell,
          <string-name>
            <surname>T. M.</surname>
          </string-name>
          <year>2015</year>
          .
          <article-title>Machine learning: Trends, perspectives, and prospects</article-title>
          .
          <source>Science</source>
          <volume>349</volume>
          (
          <issue>6245</issue>
          ):
          <fpage>255</fpage>
          -
          <lpage>260</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          2014.
          <article-title>Scaling distributed machine learning with the parameter server</article-title>
          .
          <source>In 11th fUSENIXg Symposium on Operating Systems Design and Implementation (fOSDIg 14)</source>
          ,
          <fpage>583</fpage>
          -
          <lpage>598</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Xia</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Andersen</surname>
            ,
            <given-names>D. G.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Smola</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2013</year>
          .
          <article-title>Parameter server for distributed machine learning</article-title>
          .
          <source>In Big Learning NIPS Workshop</source>
          , volume
          <volume>6</volume>
          ,
          <fpage>2</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>Meng</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ; Zhang, D.; and Karniadakis,
          <string-name>
            <surname>G. E.</surname>
          </string-name>
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <article-title>PPINN: Parareal physics-informed neural network for timedependent PDEs</article-title>
          .
          <source>Computer Methods in Applied Mechanics and Engineering</source>
          <volume>370</volume>
          :
          <fpage>113250</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>Notay</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <year>1995</year>
          .
          <article-title>An efficient parallel discrete PDE solver</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <source>Parallel computing</source>
          <volume>21</volume>
          (
          <issue>11</issue>
          ):
          <fpage>1725</fpage>
          -
          <lpage>1748</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>