ADCME MPI: Distributed Machine Learning for Computational Engineering Kailai Xu,1 Eric Darve1, 2 {kailaix, darve}@stanford.edu 1 Institute for Computational and Mathematical Engineering, Stanford University, Stanford, CA 94305, USA; 2 Mechanical Engineering, Stanford University, Stanford, CA 94305, USA Abstract ple, in this paper we use the square loss function L(u) = ku − uobs k22 . We propose a framework for training deep neural net- One advantage of such a formulation is that the known works (DNNs) that are coupled with partial differen- tial equations (PDEs) in a parallel computing environ- physics, such as the physical laws described by PDEs, ment. Unlike most distributed computing frameworks are preserved to the largest extent and solved with well- for DNNs, our focus is to parallelize both numerical developed and efficient numerical solvers. Meanwhile, we solvers and DNNs in forward and adjoint computa- can leverage the approximation power of DNNs. tions. Our parallel computing model views data com- To solve this optimization problem, one can first solve the munication as a node in the computational graph for physics constraint F (Nθ , u) = 0 in Equation (1) numeri- numerical simulations. The advantage of our model is cally and then plug the solution u(θ) into L(u). This leads that data communication and computing are cleanly to an unconstrained optimization problem separated, which enables better flexibility, modularity, and testability of the software. We demonstrate our ap- min L̃(θ) = L(u(θ)) proach on a large-scale problem and show that we can θ achieve substantial acceleration by using parallel nu- We developed a Julia (Bezanson et al. 2017) library, AD- merical PDE solvers while training DNNs that are cou- pled with PDEs. CME (Xu and Darve 2020a), with a TensorFlow (Abadi et al. 2016) automatic differentiation backend to solve prob- lems of this type. ADCME expresses both the numeri- Introduction cal solver and DNNs using computational graphs. There- Deep neural networks (DNNs) have been demonstrated to fore, the gradient ∇θ L̃(θ) can be calculated automatically be very effective for solving inverse problems in compu- by back-propagating gradients through both the numerical tational engineering (Raissi, Perdikaris, and Karniadakis solvers1 and DNNs. In this paper, we use “operator” and 2019; Meng et al. 2020; Pakravan et al. 2020). In our pre- “node” interchangeably to refer to a node in the computa- vious work (Xu, Huang, and Darve 2020; Xu and Darve tional graph, which is a function that takes incoming edges 2019; Huang et al. 2020; Fan et al. 2020), we successfully (intermediate data) as inputs and outputs outgoing edges (in- combined numerical solvers and deep neural networks for termediate data). data-driven inverse modeling. Mathematically, we consider However, one challenge with this approach is that for an implicit model F (f, u) = 0 where f is an unknown func- large-scale problems, the memory and computational costs tion, which we approximate using a DNN. We denote the for the numerical solver are prohibitive. The de-facto stan- DNN Nθ , where θ is the neural network weights and biases; dard for solving such large-scale problems on modern dis- u is a function which depends on f through F (f, u) = 0. We tributed memory high performance computing (HPC) archi- are given some (partial) observations uobs of the function u, tectures is the Message Passing Interface (MPI) (Gabriel which are used to optimize θ and minimize the difference et al. 2004; Gropp, Thakur, and Lusk 1999). The TensorFlow between f and Nθ . The optimization problem is formulated backend used by ADCME was originally designed for deep as a PDE-constrained optimization problem learning/machine learning. Despite that there are much work on extending TensorFlow for distributed training of machine min L(u) (u is indirectly a function of θ) learning models, some of the key capabilities, such as dis- θ (1) tributed linear algebra and domain decomposition, for solv- such that F (Nθ , u) = 0 ing scientific computing problems are still lacking. This pa- L(u) is a loss function, which measures the discrepancy per is about incorporating MPI functionalities into ADCME between hypothetical and actual observations. For exam- to achieve scalability and flexibility for distributed memory. 1 Copyright © 2021 for this paper by its authors. Use permitted under For details on how the gradient back-propagation works for Creative Commons License Attribution 4.0 International (CC BY numerical solvers in ADCME, we refer readers to (Xu and Darve 4.0) 2020b). The main idea is to parallelize numerical solvers by split- and mini-batch optimization algorithms, such as stochastic ting the mesh or matrices onto different MPI ranks (or pro- gradient descent (Bottou 2010), are used. Each processor cessors). Then, data communication nodes are inserted into calculates predictions and gradients, which are aggregated the computational graph. Because for our computational en- on one or more processors. To further scale out in a limited gineering applications DNNs are typically small, they are bandwidth environment, parameter servers (Li et al. 2013, duplicated on each processor. Each computational graph in- 2014), where each server stores a part of the parameters, are cludes a set of “communication” nodes (Fig. 1-top), which implemented. There are also extensive work on model par- are absent in a single processor computational graph. These allism, which parallelizes the computation by splitting the operators invoke MPI calls and are in charge of data com- DNNs into multiple parts (Chen, Yang, and Cheng 2018; munication between different computer nodes. During the Hewett and Grady II 2020). gradient back-propagation, we need to reverse the data-flow The parallel computing model in computational engineer- direction and operation of the data communication operators ing is quite different from deep learning. The computational (Wang and Pothen 2015; Utke et al. 2009) (Fig. 1-bottom). engineering applications feature data communication across neighboring points in a mesh (domain decomposition) or NN Compute Coefficients Computational Graph different parts of a matrix. In terms of computational graph, Parameter on Processor 0 Update State Variable this pattern indicates that there are many more communica- Last Time Step tions besides the reduction of gradients at the end. This mo- Next Time Step tivates us to design new distributed computing models for Data Transfer computational engineering inverse modeling applications. MPI Communication NN Parameter Compute Coefficients Computational Graph on Processor 1 Methodology Update Last Time Step State Variable ADCME MPI aims at providing a modular, efficient, and flexible implementation for distributed computing in inverse Forward Computation Data Transfer Next Time Step modeling. Conceptually, we can treat data communication operations as a node in the computational graph: they are Compute NN Parameter Coefficients Computational Graph on Processor 0 similar to computational nodes (e.g., a linear solver), ex- Update Last Time Step State Variable cept that their responsibility is to invoke MPI calls and back- propagate the gradients in the reverse mode automatic differ- Data Transfer Next Time Step entiation (Baydin et al. 2017). This solution provides an ele- Manual Send <-> Recv gant enhancement to the ADCME library because to convert MPI Adjoint Compute Bcast <-> Reduce a single processor program to multiple processor one, users Computational Graph NN Parameter Coefficients Update on Processor 1 only need to insert data communication nodes as needed and Last Time Step State Variable most parts of the original codes are unchanged. In this sec- Gradient Backpropagation tion, we briefly describe our contributions in ADCME MPI Next Time Step Data Transfer via Automatic Differentiation or Physics Constrained Learning to extend its distributed computing capabilities to couple DNNs and PDE solvers. Figure 1: Paradigm for distributed machine learning for MPI APIs computational engineering. The data communication oper- ADCME MPI provides a set of commonly used MPI prim- ations are treated as nodes in the computational graph. They itives, such as mpi bcast, mpi gather, mpi send, are separate from the other computing operators. This makes etc. These operators are wrappers for standard MPI APIs. it easy to reuse existing serial implementations of the numer- However, these operators are also “differentiable,” in the ical PDE solver. sense that they can handle gradient back-propagation. The gradient back-propagation functionality uses the fact that there exists one-to-one correspondence between forward Distributed Computing Models and backward MPI calls. For example, the forward “send” There are many existing work and software for distributed corresponds to the backward “receive” (Cheng 2006; To- computing with deep neural networks (Griebel and Zum- wara, Schanen, and Naumann 2015). busch 1999; Notay 1995; Douglas, Haase, and Langer 2003) and numerical PDEs (Bekkerman, Bilenko, and Langford Halo Exchange 2011; Jordan and Mitchell 2015). The two domains have To enable the communication between adjacent patches quite different distributed computing models due to distinct in a mesh, ADCME provides efficient implementations of features of targeted applications. ADCME MPI embraces a data communication operators for halo exchange patterns hybrid model that is suitable for inverse modeling in compu- (Fig. 2). Halo exchange patterns are very common in sci- tational engineering because our method for solving Equa- entific computing because numerical solvers often involve tion (1) requires a combination of the above two models. communication between neighboring points (Bianco 2014). In deep learning, one major challenge is that datasets are We use a nonblocking send/receive strategy. In the gradi- too large to fit into memory. Therefore, both datasets and ent back-propagation phase, this order of send/receive is re- computational loads are distributed onto different machines versed. Original Matrix MPI_Isend/MPI_Irecv Transpose Each Block Send Send Isend Isend Irecv Irecv Recv Recv Wait Wait Data Exchange Deadlock No Deadlock Figure 2: Left: domain decomposition and halo exchange Figure 3: Transposition of a distributed sparse matrix. Each pattern. Two adjacent patches exchange boundary informa- block exchanges nonzero entries with the corresponding tion. Right: using nonblocking sends/receives to avoid dead- transposed block. The original and resulting matrices are locks. both stored in CSR formats. Matrix Transposition Despite many existing distributed optimization algorithm, Sparse matrices are very important tools in computational in this work we adopt a simple approach: aggregating gradi- engineering. ADCME MPI stores large sparse matrices as ents ∇θ fi (θ) and updating θ on the root processors. Fig. 4 CSR matrices and each MPI processor stores a portion of shows how we can convert an existing optimizer to an MPI- rows with continuous row indices. enabled optimizer. The basic idea is to let the root processor For reverse-mode automatic differentiation, matrix trans- notify worker processors whether to compute the loss func- position is an operator that is common in gradient back- tion or the gradient. Then the root processor and workers propagation. For example, assume the forward computation will collaborate on executing the same routines and thus en- is (x is the input, y is the output, and A is a matrix) suring the correctness of collective MPI calls. y = Ax (2) flag Master Worker Given a loss function L(y), the gradient back-propagation for k = 1, 2, 3, … while true calculates flag = COMPUTE_OBJ mpi_sync!(flag) mpi_sync!(flag) if (flag==COMPUTE_OBJ) ∂L(y(x))) ∂L(y) ∂y(x) ∂L(y) f = compute_objective_function(x) compute_objective_function(x) = = A … elseif (flag==COMPUTE_GRAD) ∂x ∂y ∂x ∂y flag = COMPUTE_GRAD compute_gradient(x) mpi_sync!(flag) else dx = compute_gradient(x) break Here ∂L(y) ∂y is a row vector, and therefore … end x = x – alpha * dx end  T  T flag = OPTIMIZATION_STOP ∂L(y(x))) ∂L(y) mpi_sync!(flag) = AT ∂x ∂y Figure 4: Refactoring a serial optimizer to an MPI-based op- requires a matrix vector multiplication, where the matrix is timizer in our framework. AT . The transposition of a distributed sparse matrix is imple- mented in three steps (Fig. 3): Numerical Benchmarks 1. The submatrix owned by each MPI processor is split into As a demonstration, we present a benchmark result with subblocks. Meta information (e.g., number of nonzeros in ADCME MPI. The example shows that the overhead intro- each block) is collected. duced by ADCME is very small compared to the actual com- 2. Each block B exchanges the meta information with the putation. target block, where B T should be placed. The governing equation is given by Poisson’s equation 3. Each subblock is transposed and the data are transferred ∇ · (κ(x)∇u(x)) = f (x) x∈Ω to the target block. (3) u(x) = 0 x ∈ ∂Ω Here f (x) ≡ 1, and κ(x) is approximated by a deep neural Distributed Optimization network In general, the objective function of our problem can be writ- κ(x) = Nθ (x) ten as a sum of local objective functions where θ is the neural network weights and biases. The equa- N tion is discretized using the finite difference method on a uniform grid and the discretization leads to a linear system X min L̃(θ) = fi (θ) θ i=1 A(θ) u ≈ f (4) where fi is the local objective function, N is the number of u is the solution vector and f is the source vector. Note that processors. A is a sparse matrix and its entries depend on θ. The sparse matrix is constructed using the differentiable Forward Backward halo exchange operator and stored as a distributed CSR ma- 40 trix. Equation (4) is solved using the algebraic multigrid Time (seconds) 30 method in Hypre (Falgout and Yang 2002). During the gradi- 20 ent back-propagation, we need to solve a linear system with the coefficient matrix AT . This matrix is obtained using the 10 technique described in the last section. 0 1 4 9 16 25 36 49 64 81 100 400 900 1600 2500 3600 In the strong scaling experiments, we consider a fixed Number of Processors 10 problem size 1,800 × 1,800 (mesh size, which implies the Forward Backward matrix size is around 32 million × 32 million). In the weak 8 Time (seconds) scaling experiments, each MPI processor owns a 300 × 300 6 block. For example, a problem with 3,600 processors has the 4 problem size 90,000 × 3,600 ≈ 0.3 billion. 2 We first consider the weak scaling case. We consider 0 two cases: each MPI rank has 1 core or 4 cores. In the 9 16 25 36 49 64 81 100 400 900 Number of Processors latter case, the TensorFlow backend enjoys the benefit of inter-parallelism, where independent operators in the com- Figure 5: Weak scaling runtime for forward computation and putational graph can be executed simultaneously. However, gradient back-propagation. Top: each MPI processor has 1 4 cores do not guarantee a 4 times acceleration; the per- core; bottom: each MPI processor has 4 cores. We see a formance depends on the availability of independent tasks, jump near 64 cores because this is where network commu- scheduling conflicts, resource contention, etc. Fig. 5 shows nications start to take place (recall each of our CPUs has 32 the runtime for the forward computation as well as the gradi- cores, and each node has two CPUs). The plots show results ent back-propagation. There are two important observations: of weak scaling, each MPI processor solves a local problem of the same size. 1. By using more cores per processor, the runtime is reduced significantly. For example, the runtime for the backward is reduced to around 10 seconds from 30 seconds by switch- 0.7 Forward Backward 0.6 ing from 1 core to 4 cores per processor. Time (seconds) 0.5 0.4 2. The runtime for the backward pass is typically less than 0.3 twice the forward computation. Although the backward 0.2 pass requires solving two linear systems (one of them is in 0.1 the forward computation), the AMG (algebraic multigrid) 0.0 1 4 9 16 25 36 49 64 81 100 400 900 1600 2500 3600 linear solver in the back-propagation may converge faster, Number of Processors and therefore may cost less than during the forward pass. 0.15 Forward Backward Additionally, we show the overhead in Fig. 6, which is Time (seconds) 0.1 defined as the difference between total runtime and Hypre linear solver time, for both the forward and backward calcu- 5 · 10−2 lation. We see that the overhead is quite small compared to the 0 9 16 25 36 49 64 81 100 400 900 total time, especially when the problem size is large. This Number of Processors indicates that the ADCME MPI implementation is very ef- fective. Figure 6: Overhead of ADCME for matrix solving in Hypre. The top and bottom bar plots correspond to the top and bot- In Fig. 7, we consider the strong scaling. In this case, we tom ones in Fig. 5. fixed the whole problem size and split the mesh onto differ- ent MPI processors. Fig. 7 shows the runtime for the forward Forward Computation Gradient Back-propagation computation and the gradient back-propagation. We can re- 101 Total Time (1 Core) Total Time (4 Cores) duce the runtime by more than 20 times for the expensive 101 Hypre Linear Solver (1 Core) Hypre Linear Solver (4 Cores) gradient back-propagation by utilizing more than 100 MPI Time (sec) Time (sec) 100 processors. Fig. 8 shows the speedup and efficiency. We can 100 see that the 4 cores have smaller runtime compared to 1 core 2 10−1 . 100 101 102 100 101 102 Number of Processors Number of Processors 2 Finding a scaling sweet spot for a mixed programming model Figure 7: Runtime for forward computation and gradient (MPI and OpenMP) of Hypre AMG solvers on multicore clusters is back-propagation. The plots show results of strong scaling, challenging (Baker, Schulz, and Yang 2010). The intra- and inter- the whole problem size is fixed and we increase the number parallelism of the TensorFlow backend also add difficulties to find- of MPI processors (each processor has 1 or 4 cores). ing a scaling strategy. 30 100 Forward (1 Core) Backward (1 Core) Forward (4 Cores) Bottou, L. 2010. Large-scale machine learning with stochas- 20 Backward (4 Cores) tic gradient descent. In Proceedings of COMPSTAT’2010, Efficiency Speedup 177–186. Springer. 10 Chen, C.-C.; Yang, C.-L.; and Cheng, H.-Y. 2018. Efficient 0 and robust parallel dnn training through model parallelism 0 20 40 60 80 0 20 40 60 80 Number of Processors Number of Processors on multi-gpu platform. arXiv preprint arXiv:1809.02839 . Figure 8: Speedup and efficiency for parallel computing of Cheng, B. N. 2006. A duality between forward and adjoint the Poisson’s equation. MPI communication routines. . Douglas, C. C.; Haase, G.; and Langer, U. 2003. A tutorial on elliptic PDE solvers and their parallelization. SIAM. Conclusion Falgout, R. D.; and Yang, U. M. 2002. hypre: A library of We presented the functionalities of ADCME MPI. Our high performance preconditioners. In International Confer- benchmark results show that the overhead introduced by ence on Computational Science, 632–641. Springer. ADCME for distributed computing programs is very small compared with the computing time. The ADCME MPI dis- Fan, T.; Xu, K.; Pathak, J.; and Darve, E. 2020. Solv- tributed computing solution is quite flexible, allowing users ing Inverse Problems in Steady State Navier-Stokes Equa- to use custom parallel algorithms or libraries at their discre- tions using Deep Neural Networks. arXiv preprint tion. With the advent of experimental techniques that enable arXiv:2008.13074 . gathering large amounts of data, deep neural network based Gabriel, E.; Fagg, G. E.; Bosilca, G.; Angskun, T.; Dongarra, data-driven modeling will become essential tools for scien- J. J.; Squyres, J. M.; Sahay, V.; Kambadur, P.; Barrett, B.; tific discovery. The growing dataset and problem size add Lumsdaine, A.; et al. 2004. Open MPI: Goals, concept, and another level of challenges. Therefore, ongoing work on dis- design of a next generation MPI implementation. In Eu- tributed computing in machine learning for computational ropean Parallel Virtual Machine/Message Passing Interface engineering remains an important and promising direction Users’ Group Meeting, 97–104. Springer. in the foreseeable future. Griebel, M.; and Zumbusch, G. 1999. Parallel multigrid in an adaptive PDE solver based on hashing and space-filling Acknowledgements curves. Parallel Computing 25(7): 827–843. This work is supported by the Applied Mathematics Pro- Gropp, W.; Thakur, R.; and Lusk, E. 1999. Using MPI-2: gram within the Department of Energy (DOE) Office of Ad- advanced features of the message passing interface. MIT vanced Scientific Computing Research (ASCR), through the press. Collaboratory on Mathematics and Physics-Informed Learn- ing Machines for Multiscale and Multiphysics Problems Re- Hewett, R. J.; and Grady II, T. J. 2020. A Linear Algebraic search Center (DE-SC0019453). Approach to Model Parallelism in Deep Learning. arXiv preprint arXiv:2006.03108 . References Huang, D. Z.; Xu, K.; Farhat, C.; and Darve, E. 2020. Learn- Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, ing constitutive relations from indirect observations using J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. deep neural networks. Journal of Computational Physics 2016. Tensorflow: A system for large-scale machine learn- 109491. ing. In 12th {USENIX} symposium on operating systems Jordan, M. I.; and Mitchell, T. M. 2015. Machine learn- design and implementation ({OSDI} 16), 265–283. ing: Trends, perspectives, and prospects. Science 349(6245): Baker, A. H.; Schulz, M.; and Yang, U. M. 2010. On the 255–260. performance of an algebraic multigrid solver on multicore Li, M.; Andersen, D. G.; Park, J. W.; Smola, A. J.; Ahmed, clusters. In International Conference on High Performance A.; Josifovski, V.; Long, J.; Shekita, E. J.; and Su, B.-Y. Computing for Computational Science, 102–115. Springer. 2014. Scaling distributed machine learning with the param- Baydin, A. G.; Pearlmutter, B. A.; Radul, A. A.; and Siskind, eter server. In 11th {USENIX} Symposium on Operating J. M. 2017. Automatic differentiation in machine learning: Systems Design and Implementation ({OSDI} 14), 583–598. a survey. The Journal of Machine Learning Research 18(1): Li, M.; Zhou, L.; Yang, Z.; Li, A.; Xia, F.; Andersen, D. G.; 5595–5637. and Smola, A. 2013. Parameter server for distributed ma- Bekkerman, R.; Bilenko, M.; and Langford, J. 2011. Scaling chine learning. In Big Learning NIPS Workshop, volume 6, up machine learning: Parallel and distributed approaches. 2. Cambridge University Press. Meng, X.; Li, Z.; Zhang, D.; and Karniadakis, G. E. 2020. Bezanson, J.; Edelman, A.; Karpinski, S.; and Shah, V. B. PPINN: Parareal physics-informed neural network for time- 2017. Julia: A fresh approach to numerical computing. dependent PDEs. Computer Methods in Applied Mechanics SIAM review 59(1): 65–98. and Engineering 370: 113250. Bianco, M. 2014. An interface for halo exchange pattern. Notay, Y. 1995. An efficient parallel discrete PDE solver. www. prace-ri. eu/IMG/pdf/wp86. pdf . Parallel computing 21(11): 1725–1748. Pakravan, S.; Mistani, P. A.; Aragon-Calvo, M. A.; and Gi- bou, F. 2020. Solving inverse-PDE problems with physics- aware neural networks. arXiv preprint arXiv:2001.03608 . Raissi, M.; Perdikaris, P.; and Karniadakis, G. E. 2019. Physics-informed neural networks: A deep learning frame- work for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Compu- tational Physics 378: 686–707. Towara, M.; Schanen, M.; and Naumann, U. 2015. MPI- Parallel Discrete Adjoint OpenFOAM. In ICCS, 19–28. Utke, J.; Hascoet, L.; Heimbach, P.; Hill, C.; Hovland, P.; and Naumann, U. 2009. Toward adjoinable MPI. In 2009 IEEE International Symposium on Parallel & Distributed Processing, 1–8. IEEE. Wang, M.; and Pothen, A. 2015. High Order Automatic Dif- ferentiation with MPI. Dep 3: v0. Xu, K.; and Darve, E. 2019. The neural network approach to inverse problems in differential equations. arXiv preprint arXiv:1901.07758 . Xu, K.; and Darve, E. 2020a. ADCME: Learning Spatially- varying Physical Fields using Deep Neural Networks. Xu, K.; and Darve, E. 2020b. Physics constrained learning for data-driven inverse modeling from sparse observations. arXiv preprint arXiv:2002.10521 . Xu, K.; Huang, D. Z.; and Darve, E. 2020. Learning con- stitutive relations using symmetric positive definite neural networks. arXiv preprint arXiv:2004.00265 .