ADCME MPI: Distributed Machine Learning for Computational Engineering
                                     Kailai Xu,1 Eric Darve1, 2
                                  {kailaix, darve}@stanford.edu
  1
    Institute for Computational and Mathematical Engineering, Stanford University, Stanford, CA
                                            94305, USA;
              2
                Mechanical Engineering, Stanford University, Stanford, CA 94305, USA

                            Abstract                                  ple, in this paper we use the square loss function L(u) =
                                                                      ku − uobs k22 .
     We propose a framework for training deep neural net-
                                                                         One advantage of such a formulation is that the known
     works (DNNs) that are coupled with partial differen-
     tial equations (PDEs) in a parallel computing environ-           physics, such as the physical laws described by PDEs,
     ment. Unlike most distributed computing frameworks               are preserved to the largest extent and solved with well-
     for DNNs, our focus is to parallelize both numerical             developed and efficient numerical solvers. Meanwhile, we
     solvers and DNNs in forward and adjoint computa-                 can leverage the approximation power of DNNs.
     tions. Our parallel computing model views data com-                 To solve this optimization problem, one can first solve the
     munication as a node in the computational graph for              physics constraint F (Nθ , u) = 0 in Equation (1) numeri-
     numerical simulations. The advantage of our model is             cally and then plug the solution u(θ) into L(u). This leads
     that data communication and computing are cleanly                to an unconstrained optimization problem
     separated, which enables better flexibility, modularity,
     and testability of the software. We demonstrate our ap-                              min L̃(θ) = L(u(θ))
     proach on a large-scale problem and show that we can                                  θ
     achieve substantial acceleration by using parallel nu-           We developed a Julia (Bezanson et al. 2017) library, AD-
     merical PDE solvers while training DNNs that are cou-
     pled with PDEs.
                                                                      CME (Xu and Darve 2020a), with a TensorFlow (Abadi
                                                                      et al. 2016) automatic differentiation backend to solve prob-
                                                                      lems of this type. ADCME expresses both the numeri-
                        Introduction                                  cal solver and DNNs using computational graphs. There-
Deep neural networks (DNNs) have been demonstrated to                 fore, the gradient ∇θ L̃(θ) can be calculated automatically
be very effective for solving inverse problems in compu-              by back-propagating gradients through both the numerical
tational engineering (Raissi, Perdikaris, and Karniadakis             solvers1 and DNNs. In this paper, we use “operator” and
2019; Meng et al. 2020; Pakravan et al. 2020). In our pre-            “node” interchangeably to refer to a node in the computa-
vious work (Xu, Huang, and Darve 2020; Xu and Darve                   tional graph, which is a function that takes incoming edges
2019; Huang et al. 2020; Fan et al. 2020), we successfully            (intermediate data) as inputs and outputs outgoing edges (in-
combined numerical solvers and deep neural networks for               termediate data).
data-driven inverse modeling. Mathematically, we consider                However, one challenge with this approach is that for
an implicit model F (f, u) = 0 where f is an unknown func-            large-scale problems, the memory and computational costs
tion, which we approximate using a DNN. We denote the                 for the numerical solver are prohibitive. The de-facto stan-
DNN Nθ , where θ is the neural network weights and biases;            dard for solving such large-scale problems on modern dis-
u is a function which depends on f through F (f, u) = 0. We           tributed memory high performance computing (HPC) archi-
are given some (partial) observations uobs of the function u,         tectures is the Message Passing Interface (MPI) (Gabriel
which are used to optimize θ and minimize the difference              et al. 2004; Gropp, Thakur, and Lusk 1999). The TensorFlow
between f and Nθ . The optimization problem is formulated             backend used by ADCME was originally designed for deep
as a PDE-constrained optimization problem                             learning/machine learning. Despite that there are much work
                                                                      on extending TensorFlow for distributed training of machine
           min L(u) (u is indirectly a function of θ)                 learning models, some of the key capabilities, such as dis-
             θ                                                  (1)   tributed linear algebra and domain decomposition, for solv-
      such that F (Nθ , u) = 0
                                                                      ing scientific computing problems are still lacking. This pa-
L(u) is a loss function, which measures the discrepancy               per is about incorporating MPI functionalities into ADCME
between hypothetical and actual observations. For exam-               to achieve scalability and flexibility for distributed memory.
                                                                         1
Copyright © 2021 for this paper by its authors. Use permitted under       For details on how the gradient back-propagation works for
Creative Commons License Attribution 4.0 International (CC BY         numerical solvers in ADCME, we refer readers to (Xu and Darve
4.0)                                                                  2020b).
   The main idea is to parallelize numerical solvers by split-                                                         and mini-batch optimization algorithms, such as stochastic
ting the mesh or matrices onto different MPI ranks (or pro-                                                            gradient descent (Bottou 2010), are used. Each processor
cessors). Then, data communication nodes are inserted into                                                             calculates predictions and gradients, which are aggregated
the computational graph. Because for our computational en-                                                             on one or more processors. To further scale out in a limited
gineering applications DNNs are typically small, they are                                                              bandwidth environment, parameter servers (Li et al. 2013,
duplicated on each processor. Each computational graph in-                                                             2014), where each server stores a part of the parameters, are
cludes a set of “communication” nodes (Fig. 1-top), which                                                              implemented. There are also extensive work on model par-
are absent in a single processor computational graph. These                                                            allism, which parallelizes the computation by splitting the
operators invoke MPI calls and are in charge of data com-                                                              DNNs into multiple parts (Chen, Yang, and Cheng 2018;
munication between different computer nodes. During the                                                                Hewett and Grady II 2020).
gradient back-propagation, we need to reverse the data-flow                                                               The parallel computing model in computational engineer-
direction and operation of the data communication operators                                                            ing is quite different from deep learning. The computational
(Wang and Pothen 2015; Utke et al. 2009) (Fig. 1-bottom).                                                              engineering applications feature data communication across
                                                                                                                       neighboring points in a mesh (domain decomposition) or
        NN
                           Compute
                          Coefficients                       Computational Graph                                       different parts of a matrix. In terms of computational graph,
     Parameter                                                  on Processor 0
                                             Update
                                          State Variable
                                                                                                                       this pattern indicates that there are many more communica-
 Last Time Step
                                                                                                                       tions besides the reduction of gradients at the end. This mo-
                                                                    Next Time Step                                     tivates us to design new distributed computing models for
                     Data Transfer
                                                                                                                       computational engineering inverse modeling applications.
                    MPI Communication


        NN
     Parameter
                           Compute
                          Coefficients                       Computational Graph
                                                                on Processor 1
                                                                                                                                             Methodology
                                             Update

 Last Time Step
                                          State Variable                                                               ADCME MPI aims at providing a modular, efficient, and
                                                                                                                       flexible implementation for distributed computing in inverse
                                                                                             Forward Computation
                      Data Transfer
                                                                    Next Time Step
                                                                                                                       modeling. Conceptually, we can treat data communication
                                                                                                                       operations as a node in the computational graph: they are
                          Compute
       NN
    Parameter
                         Coefficients                      Computational Graph
                                                              on Processor 0
                                                                                                                       similar to computational nodes (e.g., a linear solver), ex-
                                           Update
 Last Time Step
                                        State Variable                                                                 cept that their responsibility is to invoke MPI calls and back-
                                                                                                                       propagate the gradients in the reverse mode automatic differ-
                    Data Transfer
                                                                 Next Time Step
                                                                                                                       entiation (Baydin et al. 2017). This solution provides an ele-
                      Manual             Send <-> Recv
                                                                                                                       gant enhancement to the ADCME library because to convert
                     MPI Adjoint
                          Compute
                                        Bcast <-> Reduce
                                                                                                                       a single processor program to multiple processor one, users
                                                           Computational Graph
       NN
    Parameter
                         Coefficients
                                           Update
                                                              on Processor 1                                           only need to insert data communication nodes as needed and
 Last Time Step
                                        State Variable                                                                 most parts of the original codes are unchanged. In this sec-
                                                                                        Gradient Backpropagation
                                                                                                                       tion, we briefly describe our contributions in ADCME MPI
                                                                 Next Time Step
                     Data Transfer                                                    via Automatic Differentiation
                                                                                     or Physics Constrained Learning
                                                                                                                       to extend its distributed computing capabilities to couple
                                                                                                                       DNNs and PDE solvers.
Figure 1: Paradigm for distributed machine learning for                                                                MPI APIs
computational engineering. The data communication oper-
                                                                                                                       ADCME MPI provides a set of commonly used MPI prim-
ations are treated as nodes in the computational graph. They
                                                                                                                       itives, such as mpi bcast, mpi gather, mpi send,
are separate from the other computing operators. This makes
                                                                                                                       etc. These operators are wrappers for standard MPI APIs.
it easy to reuse existing serial implementations of the numer-
                                                                                                                       However, these operators are also “differentiable,” in the
ical PDE solver.
                                                                                                                       sense that they can handle gradient back-propagation. The
                                                                                                                       gradient back-propagation functionality uses the fact that
                                                                                                                       there exists one-to-one correspondence between forward
                  Distributed Computing Models                                                                         and backward MPI calls. For example, the forward “send”
There are many existing work and software for distributed                                                              corresponds to the backward “receive” (Cheng 2006; To-
computing with deep neural networks (Griebel and Zum-                                                                  wara, Schanen, and Naumann 2015).
busch 1999; Notay 1995; Douglas, Haase, and Langer 2003)
and numerical PDEs (Bekkerman, Bilenko, and Langford                                                                   Halo Exchange
2011; Jordan and Mitchell 2015). The two domains have                                                                  To enable the communication between adjacent patches
quite different distributed computing models due to distinct                                                           in a mesh, ADCME provides efficient implementations of
features of targeted applications. ADCME MPI embraces a                                                                data communication operators for halo exchange patterns
hybrid model that is suitable for inverse modeling in compu-                                                           (Fig. 2). Halo exchange patterns are very common in sci-
tational engineering because our method for solving Equa-                                                              entific computing because numerical solvers often involve
tion (1) requires a combination of the above two models.                                                               communication between neighboring points (Bianco 2014).
   In deep learning, one major challenge is that datasets are                                                          We use a nonblocking send/receive strategy. In the gradi-
too large to fit into memory. Therefore, both datasets and                                                             ent back-propagation phase, this order of send/receive is re-
computational loads are distributed onto different machines                                                            versed.
                                                                                       Original Matrix             MPI_Isend/MPI_Irecv          Transpose Each Block
                           Send          Send                Isend          Isend


                                                             Irecv          Irecv


                           Recv          Recv                Wait           Wait


      Data Exchange               Deadlock                          No Deadlock

Figure 2: Left: domain decomposition and halo exchange                              Figure 3: Transposition of a distributed sparse matrix. Each
pattern. Two adjacent patches exchange boundary informa-                            block exchanges nonzero entries with the corresponding
tion. Right: using nonblocking sends/receives to avoid dead-                        transposed block. The original and resulting matrices are
locks.                                                                              both stored in CSR formats.


Matrix Transposition                                                                   Despite many existing distributed optimization algorithm,
Sparse matrices are very important tools in computational                           in this work we adopt a simple approach: aggregating gradi-
engineering. ADCME MPI stores large sparse matrices as                              ents ∇θ fi (θ) and updating θ on the root processors. Fig. 4
CSR matrices and each MPI processor stores a portion of                             shows how we can convert an existing optimizer to an MPI-
rows with continuous row indices.                                                   enabled optimizer. The basic idea is to let the root processor
   For reverse-mode automatic differentiation, matrix trans-                        notify worker processors whether to compute the loss func-
position is an operator that is common in gradient back-                            tion or the gradient. Then the root processor and workers
propagation. For example, assume the forward computation                            will collaborate on executing the same routines and thus en-
is (x is the input, y is the output, and A is a matrix)                             suring the correctness of collective MPI calls.

                                   y = Ax                                   (2)                                                flag
                                                                                                  Master                                       Worker
Given a loss function L(y), the gradient back-propagation                              for k = 1, 2, 3, …                             while true
calculates                                                                                 flag = COMPUTE_OBJ                            mpi_sync!(flag)
                                                                                           mpi_sync!(flag)                               if (flag==COMPUTE_OBJ)
          ∂L(y(x)))   ∂L(y) ∂y(x)   ∂L(y)                                                  f = compute_objective_function(x)                 compute_objective_function(x)
                    =             =       A                                                …                                             elseif (flag==COMPUTE_GRAD)
             ∂x        ∂y    ∂x      ∂y                                                    flag = COMPUTE_GRAD                               compute_gradient(x)
                                                                                           mpi_sync!(flag)                               else
                                                                                           dx = compute_gradient(x)                          break
Here ∂L(y)
      ∂y is a row vector, and therefore                                                    …                                             end
                                                                                           x = x – alpha * dx                         end
                                 T                       T                         flag = OPTIMIZATION_STOP
                 ∂L(y(x)))                          ∂L(y)                              mpi_sync!(flag)
                                       = AT
                    ∂x                               ∂y
                                                                                    Figure 4: Refactoring a serial optimizer to an MPI-based op-
requires a matrix vector multiplication, where the matrix is                        timizer in our framework.
AT .
  The transposition of a distributed sparse matrix is imple-
mented in three steps (Fig. 3):
                                                                                                        Numerical Benchmarks
1. The submatrix owned by each MPI processor is split into                          As a demonstration, we present a benchmark result with
   subblocks. Meta information (e.g., number of nonzeros in                         ADCME MPI. The example shows that the overhead intro-
   each block) is collected.                                                        duced by ADCME is very small compared to the actual com-
2. Each block B exchanges the meta information with the                             putation.
   target block, where B T should be placed.                                          The governing equation is given by Poisson’s equation
3. Each subblock is transposed and the data are transferred                                         ∇ · (κ(x)∇u(x)) = f (x)  x∈Ω
   to the target block.                                                                                                                                                 (3)
                                                                                                               u(x) = 0     x ∈ ∂Ω
                                                                                    Here f (x) ≡ 1, and κ(x) is approximated by a deep neural
Distributed Optimization                                                            network
In general, the objective function of our problem can be writ-                                             κ(x) = Nθ (x)
ten as a sum of local objective functions                                           where θ is the neural network weights and biases. The equa-
                                         N                                          tion is discretized using the finite difference method on a
                                                                                    uniform grid and the discretization leads to a linear system
                                         X
                      min L̃(θ) =               fi (θ)
                       θ
                                         i=1                                                                          A(θ) u ≈ f                                        (4)
where fi is the local objective function, N is the number of                        u is the solution vector and f is the source vector. Note that
processors.                                                                         A is a sparse matrix and its entries depend on θ.
   The sparse matrix is constructed using the differentiable                                                               Forward
                                                                                                                           Backward
halo exchange operator and stored as a distributed CSR ma-                                                     40


trix. Equation (4) is solved using the algebraic multigrid


                                                                                              Time (seconds)
                                                                                                               30

method in Hypre (Falgout and Yang 2002). During the gradi-                                                     20
ent back-propagation, we need to solve a linear system with
the coefficient matrix AT . This matrix is obtained using the                                                  10

technique described in the last section.                                                                        0
                                                                                                                          1      4                     9         16        25            36 49 64 81 100 400 900 1600 2500 3600
   In the strong scaling experiments, we consider a fixed                                                                                                                                  Number of Processors
                                                                                                                                                           10
problem size 1,800 × 1,800 (mesh size, which implies the                                                                                                                   Forward
                                                                                                                                                                           Backward
matrix size is around 32 million × 32 million). In the weak                                                                                                 8


                                                                                                                                     Time (seconds)
scaling experiments, each MPI processor owns a 300 × 300                                                                                                    6

block. For example, a problem with 3,600 processors has the                                                                                                 4
problem size 90,000 × 3,600 ≈ 0.3 billion.
                                                                                                                                                            2
   We first consider the weak scaling case. We consider
                                                                                                                                                            0
two cases: each MPI rank has 1 core or 4 cores. In the                                                                                                                 9         16 25 36 49 64 81 100 400 900
                                                                                                                                                                                     Number of Processors
latter case, the TensorFlow backend enjoys the benefit of
inter-parallelism, where independent operators in the com-             Figure 5: Weak scaling runtime for forward computation and
putational graph can be executed simultaneously. However,              gradient back-propagation. Top: each MPI processor has 1
4 cores do not guarantee a 4 times acceleration; the per-              core; bottom: each MPI processor has 4 cores. We see a
formance depends on the availability of independent tasks,             jump near 64 cores because this is where network commu-
scheduling conflicts, resource contention, etc. Fig. 5 shows           nications start to take place (recall each of our CPUs has 32
the runtime for the forward computation as well as the gradi-          cores, and each node has two CPUs). The plots show results
ent back-propagation. There are two important observations:            of weak scaling, each MPI processor solves a local problem
                                                                       of the same size.
1. By using more cores per processor, the runtime is reduced
   significantly. For example, the runtime for the backward is
   reduced to around 10 seconds from 30 seconds by switch-                                                     0.7         Forward
                                                                                                                           Backward
                                                                                                               0.6
   ing from 1 core to 4 cores per processor.
                                                                                              Time (seconds)


                                                                                                               0.5

                                                                                                               0.4
2. The runtime for the backward pass is typically less than
                                                                                                               0.3
   twice the forward computation. Although the backward                                                        0.2
   pass requires solving two linear systems (one of them is in                                                 0.1
   the forward computation), the AMG (algebraic multigrid)                                                     0.0
                                                                                                                          1      4                     9         16        25            36 49 64 81 100 400 900 1600 2500 3600
   linear solver in the back-propagation may converge faster,                                                                                                                              Number of Processors

   and therefore may cost less than during the forward pass.                                                                                           0.15                                          Forward
                                                                                                                                                                                                     Backward

    Additionally, we show the overhead in Fig. 6, which is
                                                                                                                                      Time (seconds)


                                                                                                                                                           0.1
defined as the difference between total runtime and Hypre
linear solver time, for both the forward and backward calcu-                                                                         5 · 10−2
lation.
    We see that the overhead is quite small compared to the                                                                                                 0
                                                                                                                                                                       9      16 25 36 49 64 81 100 400 900
total time, especially when the problem size is large. This                                                                                                                       Number of Processors

indicates that the ADCME MPI implementation is very ef-
fective.                                                               Figure 6: Overhead of ADCME for matrix solving in Hypre.
                                                                       The top and bottom bar plots correspond to the top and bot-
    In Fig. 7, we consider the strong scaling. In this case, we        tom ones in Fig. 5.
fixed the whole problem size and split the mesh onto differ-
ent MPI processors. Fig. 7 shows the runtime for the forward                                                     Forward Computation                                                                  Gradient Back-propagation
computation and the gradient back-propagation. We can re-                             101                                                                                                                                               Total Time (1 Core)
                                                                                                                                                                                                                                        Total Time (4 Cores)
duce the runtime by more than 20 times for the expensive                                                                                                                                 101                                            Hypre Linear Solver (1 Core)
                                                                                                                                                                                                                                        Hypre Linear Solver (4 Cores)

gradient back-propagation by utilizing more than 100 MPI
                                                                        Time (sec)


                                                                                                                                                                            Time (sec)


                                                                                      100
processors. Fig. 8 shows the speedup and efficiency. We can                                                                                                                              100

see that the 4 cores have smaller runtime compared to 1 core
2                                                                                    10−1
  .                                                                                         100                          101                                     102                           100              101               102
                                                                                                                 Number of Processors                                                                   Number of Processors


   2
     Finding a scaling sweet spot for a mixed programming model        Figure 7: Runtime for forward computation and gradient
(MPI and OpenMP) of Hypre AMG solvers on multicore clusters is         back-propagation. The plots show results of strong scaling,
challenging (Baker, Schulz, and Yang 2010). The intra- and inter-      the whole problem size is fixed and we increase the number
parallelism of the TensorFlow backend also add difficulties to find-   of MPI processors (each processor has 1 or 4 cores).
ing a scaling strategy.
             30
                                                                100                                    Forward (1 Core)
                                                                                                       Backward (1 Core)
                                                                                                       Forward (4 Cores)
                                                                                                                            Bottou, L. 2010. Large-scale machine learning with stochas-
             20
                                                                                                       Backward (4 Cores)
                                                                                                                            tic gradient descent. In Proceedings of COMPSTAT’2010,


                                                   Efficiency
   Speedup


                                                                                                                            177–186. Springer.
             10
                                                                                                                            Chen, C.-C.; Yang, C.-L.; and Cheng, H.-Y. 2018. Efficient
              0                                                                                                             and robust parallel dnn training through model parallelism
                  0   20      40         60   80                      0   20      40         60   80
                       Number of Processors                                Number of Processors                             on multi-gpu platform. arXiv preprint arXiv:1809.02839 .
Figure 8: Speedup and efficiency for parallel computing of                                                                  Cheng, B. N. 2006. A duality between forward and adjoint
the Poisson’s equation.                                                                                                     MPI communication routines. .
                                                                                                                            Douglas, C. C.; Haase, G.; and Langer, U. 2003. A tutorial
                                                                                                                            on elliptic PDE solvers and their parallelization. SIAM.
                                              Conclusion
                                                                                                                            Falgout, R. D.; and Yang, U. M. 2002. hypre: A library of
We presented the functionalities of ADCME MPI. Our                                                                          high performance preconditioners. In International Confer-
benchmark results show that the overhead introduced by                                                                      ence on Computational Science, 632–641. Springer.
ADCME for distributed computing programs is very small
compared with the computing time. The ADCME MPI dis-                                                                        Fan, T.; Xu, K.; Pathak, J.; and Darve, E. 2020. Solv-
tributed computing solution is quite flexible, allowing users                                                               ing Inverse Problems in Steady State Navier-Stokes Equa-
to use custom parallel algorithms or libraries at their discre-                                                             tions using Deep Neural Networks.             arXiv preprint
tion. With the advent of experimental techniques that enable                                                                arXiv:2008.13074 .
gathering large amounts of data, deep neural network based                                                                  Gabriel, E.; Fagg, G. E.; Bosilca, G.; Angskun, T.; Dongarra,
data-driven modeling will become essential tools for scien-                                                                 J. J.; Squyres, J. M.; Sahay, V.; Kambadur, P.; Barrett, B.;
tific discovery. The growing dataset and problem size add                                                                   Lumsdaine, A.; et al. 2004. Open MPI: Goals, concept, and
another level of challenges. Therefore, ongoing work on dis-                                                                design of a next generation MPI implementation. In Eu-
tributed computing in machine learning for computational                                                                    ropean Parallel Virtual Machine/Message Passing Interface
engineering remains an important and promising direction                                                                    Users’ Group Meeting, 97–104. Springer.
in the foreseeable future.                                                                                                  Griebel, M.; and Zumbusch, G. 1999. Parallel multigrid in
                                                                                                                            an adaptive PDE solver based on hashing and space-filling
                                      Acknowledgements                                                                      curves. Parallel Computing 25(7): 827–843.
This work is supported by the Applied Mathematics Pro-
                                                                                                                            Gropp, W.; Thakur, R.; and Lusk, E. 1999. Using MPI-2:
gram within the Department of Energy (DOE) Office of Ad-
                                                                                                                            advanced features of the message passing interface. MIT
vanced Scientific Computing Research (ASCR), through the
                                                                                                                            press.
Collaboratory on Mathematics and Physics-Informed Learn-
ing Machines for Multiscale and Multiphysics Problems Re-                                                                   Hewett, R. J.; and Grady II, T. J. 2020. A Linear Algebraic
search Center (DE-SC0019453).                                                                                               Approach to Model Parallelism in Deep Learning. arXiv
                                                                                                                            preprint arXiv:2006.03108 .
                                               References                                                                   Huang, D. Z.; Xu, K.; Farhat, C.; and Darve, E. 2020. Learn-
Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean,                                                                 ing constitutive relations from indirect observations using
J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al.                                                                  deep neural networks. Journal of Computational Physics
2016. Tensorflow: A system for large-scale machine learn-                                                                   109491.
ing. In 12th {USENIX} symposium on operating systems                                                                        Jordan, M. I.; and Mitchell, T. M. 2015. Machine learn-
design and implementation ({OSDI} 16), 265–283.                                                                             ing: Trends, perspectives, and prospects. Science 349(6245):
Baker, A. H.; Schulz, M.; and Yang, U. M. 2010. On the                                                                      255–260.
performance of an algebraic multigrid solver on multicore                                                                   Li, M.; Andersen, D. G.; Park, J. W.; Smola, A. J.; Ahmed,
clusters. In International Conference on High Performance                                                                   A.; Josifovski, V.; Long, J.; Shekita, E. J.; and Su, B.-Y.
Computing for Computational Science, 102–115. Springer.                                                                     2014. Scaling distributed machine learning with the param-
Baydin, A. G.; Pearlmutter, B. A.; Radul, A. A.; and Siskind,                                                               eter server. In 11th {USENIX} Symposium on Operating
J. M. 2017. Automatic differentiation in machine learning:                                                                  Systems Design and Implementation ({OSDI} 14), 583–598.
a survey. The Journal of Machine Learning Research 18(1):                                                                   Li, M.; Zhou, L.; Yang, Z.; Li, A.; Xia, F.; Andersen, D. G.;
5595–5637.                                                                                                                  and Smola, A. 2013. Parameter server for distributed ma-
Bekkerman, R.; Bilenko, M.; and Langford, J. 2011. Scaling                                                                  chine learning. In Big Learning NIPS Workshop, volume 6,
up machine learning: Parallel and distributed approaches.                                                                   2.
Cambridge University Press.                                                                                                 Meng, X.; Li, Z.; Zhang, D.; and Karniadakis, G. E. 2020.
Bezanson, J.; Edelman, A.; Karpinski, S.; and Shah, V. B.                                                                   PPINN: Parareal physics-informed neural network for time-
2017. Julia: A fresh approach to numerical computing.                                                                       dependent PDEs. Computer Methods in Applied Mechanics
SIAM review 59(1): 65–98.                                                                                                   and Engineering 370: 113250.
Bianco, M. 2014. An interface for halo exchange pattern.                                                                    Notay, Y. 1995. An efficient parallel discrete PDE solver.
www. prace-ri. eu/IMG/pdf/wp86. pdf .                                                                                       Parallel computing 21(11): 1725–1748.
Pakravan, S.; Mistani, P. A.; Aragon-Calvo, M. A.; and Gi-
bou, F. 2020. Solving inverse-PDE problems with physics-
aware neural networks. arXiv preprint arXiv:2001.03608 .
Raissi, M.; Perdikaris, P.; and Karniadakis, G. E. 2019.
Physics-informed neural networks: A deep learning frame-
work for solving forward and inverse problems involving
nonlinear partial differential equations. Journal of Compu-
tational Physics 378: 686–707.
Towara, M.; Schanen, M.; and Naumann, U. 2015. MPI-
Parallel Discrete Adjoint OpenFOAM. In ICCS, 19–28.
Utke, J.; Hascoet, L.; Heimbach, P.; Hill, C.; Hovland, P.;
and Naumann, U. 2009. Toward adjoinable MPI. In 2009
IEEE International Symposium on Parallel & Distributed
Processing, 1–8. IEEE.
Wang, M.; and Pothen, A. 2015. High Order Automatic Dif-
ferentiation with MPI. Dep 3: v0.
Xu, K.; and Darve, E. 2019. The neural network approach
to inverse problems in differential equations. arXiv preprint
arXiv:1901.07758 .
Xu, K.; and Darve, E. 2020a. ADCME: Learning Spatially-
varying Physical Fields using Deep Neural Networks.
Xu, K.; and Darve, E. 2020b. Physics constrained learning
for data-driven inverse modeling from sparse observations.
arXiv preprint arXiv:2002.10521 .
Xu, K.; Huang, D. Z.; and Darve, E. 2020. Learning con-
stitutive relations using symmetric positive definite neural
networks. arXiv preprint arXiv:2004.00265 .