Introduction

Parallel Numerical Methods for Ordinary Di erential Equations: a Survey

Svyatoslav I. Solodushkin

s@mail.ru 0 1

Irina F. Iumanova

1 0 Krasovskii Institute of Mathematics and Mechanics , Yekaterinburg , Russia solodushkin 1 Ural Federal University , Yekaterinburg , Russia

Tremendous e orts have been made in the eld of parallel numerical methods for ODE. Special volumes are dedicated to the theme we are going to discuss [1]. The brief survey of some classical results and recent developments is the aim of this paper. We particulary focus on small scale parallelism across the method, but not large scale parallelism across the system.

parallel numerical methods parallelism across the method small scale parallelism

Introduction

Modern science poses a lot of computationally intensive problems, among them weather forecast or heart modeling. It is widely believed that the only appropriate means to solve these problems is to use parallel computers. This fact dictates the necessity to elaborate special classes of numerical algorithms which are inherently parallel and designed for use on parallel computers.

In this paper we consider parallel numerical methods for initial value problem (IVP) for ordinary di erential equations y0(x) = f (x; y(x)); y(x0) = y0; ( 1 ) 1. parallelism across the system or, that is the same, parallelism across the space; 2. parallelism across the method or, that is the same, parallelism across the time.

Methods fallen into the rst group are usually based on more or less obvious techniques such as exploiting parallelism of evaluation of function f or solving linear equations which arise at each step of an otherwise standard IVP solver. For traditional forward-step methods, e. g. explicit Runge{Kutta methods, this approach implies simple dividing the original system into a number of subsystems which are processed concurrently by separated computing nodes with interprocessor communications at the end of each step of IVP solver. It is clear that this approach is e ective for high dimensional systems only. In practice such systems appear in molecular dynamics modelling, where thousands of particles should be taken into account, or after the discretization of time dependent PDE.

The second group includes numerical algorithms, which allow several simultaneous function evaluations within each step of IVP solver. These methods are suitable for multi-core processors or computers with a few processors with fast interprocessor communication which use shared memory.

The second group also includes methods which search solution for many step simultaneously, they are also known as time parallel methods and can be classi ed into four di erent groups: methods based on multiple shooting, methods based on domain decomposition and waveform relaxation, space-time multigrid methods and direct time parallel methods [3].

A dedicated survey should be devoted to parallel across the space numerical methods, and this is the reason why they are beyond the scope of present article which is focused on parallel across the time numerical methods. 2

Predictor-Corrector, Runge{Kutta and Block Type Methods Miranker and Linger [5] were probably the rst who proposed small-scale parallel method of predictor-corrector type. They mentioned that at rst sight the sequential nature of the numerical methods like ( 2 ) does not permit parallel computation on all of the processors to be performed, in other words the front of computation is too narrow to take advantage of more than one processor.

p yn+1 = ync + h2 3f (xn; ync)

f (xn 1; ync 1) ; yn+1 = ync + h2 f (xn; ync) + f (xn+1; ynp+1) ; c p Indeed, ync must be computed before yn+1; which in turn must be computed c before yn+1 etc.

However, if the predictor is taken in another form, as represented in ( 3 ), then p ync and yn+1 can be evaluated in parallel by two processors.

ynp+1 = ync 1 + 2hf (xn; ynp); ync = ync 1 + h2 f (xn 1; ync 1) + f (xn; ynp) ; ( 2 ) ( 3 )

Method ( 2 ) is based on couple of Adams{Bashfort and Adams{Moulton methods of second order, where the rst one is used to estimate yn+1 which is substituted to the second one to avoid the implicity and, consequently, the necessity of solving a nonlinear equation. In ( 3 ) the predictor outruns the corrector and stands one step ahead; this made the parallel implementation possible.

As a development of this idea Miranker and Liniger showed how to systematically derive a general class of numerical integration methods which can be executed on multiple processors in parallel, and present a stability and convergence analysis for those methods. More precisely, they considered predictor-corrector methods in a PECE mode (one predicted derivative evaluation and one corrected derivative evaluation) where calculation advances s steps at a time. So, if 2s processors are available it is possible to arrange tasks in such a way that each processor performs either a predictor or a corrector calculation.

Continuing the research started by Mirankel Katz, Franklin and Sen [6] studied the stability properties of predictor-corrector parallel methods. Speci cally, they considered the case where there are two processors, i.e. s = 1: They showed that for a xed Adams{Moulton corrector the optimally stable parallel predictor (in the sense that the parallel scheme has a maximum stability interval on the negative real axis) is the Adams{Bashforth predictor shifted to the right by one integration step. Methods constructed in [6] are not A-stable.

Another type of numerical methods which possess an internal parallelism are block methods. Block method which nds solution in r points at one step is called r-dots block method. By analogy with the classi cation into one-step and multistep for classical methods, block methods could also be divided into these two groups. Shampine and Watts studied a family of one-step block methods, and gave one speci c example ( 4 ): yn+1 yn+2 = h Shampine and Watts took advantages of this formula which is more accurate at the end of the block than in the middle and proved ( 4 ) has global convergence like O(h4) instead of O(h3) one might expect. They also proved the A-stability of ( 4 ), so, in a sense they overcame the Dahlquist barrier which states that of the linear multistep methods the trapezoidal rule is the most accurate A-stable method.

To avoid the necessity of solving nonlinear equation which arise at each step Shampine and Watts supplied a predictor-corrector form of one-step r-dots block methods. Good prediction allow them to compute the corrector two times only rather than iterate until it is satis ed, furthermore suitable predictor-corrector scheme asymptotically gives the same result as iterating to completion. Unfortunately block implicit methods when used as a predictor-corrector combination are even not A( )-stable.

To expand the stability region Lee and Song [8] proposed a family of explicit and implicit multistep block methods as well as block predictor corrector schemes based on couples of these methods where explicit one is used as predictor.

Feldman used collocation approach to derive implicit multistep block methods of general form [9]. He presented formulas for A-stable one-step 4-dots block implicit method and mentioned that collocation approach allow to construct Astable one-step block methods of higher order as well. At the same time multistep methods constructed in such a way are not A-stable.

Deep analysis of predictor-corrector methods suggests a simple general paradigm for achieving parallelism across the time in traditional forward-step methods: group the stages of a the method into blocks for which all evaluations associated with each block can be performed simultaneously. This idea could be illustrated by parallel explicit and diagonally implicit Runge{Kutta (ERK and DIRK, respectively) methods in a very natural way.

An s-stage RK method has the following form

s yn+1 = yn + h X bikn;i; i=1

s kn;i = f xn + cih; yn + h X aij kn;j ; i = 1; :::; s j=1 ( 5 ) where bi; ci; ai;j are coe cients; ai;j = 0; i j for explicit RK formulas and ai;j = 0; i < j and at least one aii 6= 0 for diagonally implicit RK formulas. Let us denote by A = fa)ij g the s s RK matrix.

Suppose, for example, that the coe cients of explicit RK method have the zero-pattern presented in Table 1.

Each arrow in the corresponding "production graph" depicted in Fig. 1, pointing from vertex i to vertex j; stands for a non-zero aji: Here the vertices 2 and 3 are independent and can be processed in parallel. The number of vertices in the longest chain of successive arrows (in this case 3) is called the number of e ective sages [13] or number of sequential function evaluations [12].

In general, let us consider RK matrix A which could be partitioned (possibly after a permutation of the stages) as 2 D1 where, for ERK methods, Dl; l = 1; : : : ; ; are zero matrixes and, for DIRK methods, Dl; l = 1; : : : ; ; are (possibly di erent) diagonal matrixes. For any l = 1; : : : ; all kn;i in l-th block of ERK method could be computed concurrently as soon as all kn;j in all previous blocks are available. Similarly, for any l = 1; : : : ; each kn;i in l-th block of DIRK method depends on itself and the previously competed kn;j in blocks 1; : : : ; l 1: Thus, all kn;i in l-th block could be computed in parallel by solving the system of uncoupled equations.

Unfortunately, these ERK schemes do not have a high potential of parallelism. The following theorem is a severe restriction on parallel methods [12] Theorem 1. The order p of the ERK method with sequential stages satis es the inequality p for any number of available processors.

Let us remark that stability function of these ERK methods is a polinimial and therefore they can not be A-stable.

On the other hand DIRK methods have better stability properties and allow one to achive higher order of convergency. For example, method ( 6 ) is L-stable (but not B-stable) and has a 4-th order.

1/2 2/3 1/2 1/3 1/2 0 -5/2 -5/3 -1 0 2/3 5/2 4/3 3/2 0 0 1/2 0 -1 0 0 0 2/3 3/2 ( 6 ) This method consists of two uncoupled blocks, i. e. = 2; of two equations in each, hence kn;1 and kn;2 could be found in parallel, after that kn;3 and kn;4 could be found in parallel too.

The theorem 1 motivated to pose the question whether it is always possible to nd ERK method of order p using not more than p e ective stages, assuming that su ucient number of processors are available, these methods are called P-optimal. For sequential ERK of order p 4 it is possible to choose coe cient of the method in such a way that the number of stages s is equal to p; so these sequential ERK are P-optimal. At the same time even for p 5 the number of stages is greater than p and increase rapidly, see table 2. Also the number of stages smin; the number of stages S for which these RK methods have actually been constructed, the number of e ective stages Seff are reported. Special techniques allow one to construct methods which are P-optimal, because some stages are performed in parallel (Table 2). So, these methods allow one to decrease the computation time with the help of parallelism. where sequential stages are performed and s processors are available is assumed. The following theorem discover the connection between the order and the number of iterations.

Theorem 2. The parallel iterated Runge{Kutta method ( 5 ) in form ( 7 ) is of order p = min(p0; ); where p0 denotes the order of the basic method. This theorem shows that the choice = p0 yields P-optimal ERK methods.

The next question is the least number of processors needed to implement an optimal ERK method. Houwen and Sommejeir [13] took as basic method the s-stage Gaussian{Legendre type RK method, which has the smallest number of stages with respect to their order. This allowed them to construct the method of order p = 2s which is P-optimal on s processors.

In this scheme one may use linear multistep (LM) predictors reducing the number of e ective stages. First results based on LM predictors are reported by Lie [14], using a 4th-order, 2-stage Gauss{Legendre corrector and a 3rd-order Hermite extrapolation predictor. Future investigation of LM correctors in this scheme was made in [13].

In conclusion it should be said that predictor-corrector and block as well as Runge{Kutta methods are ideally suited to be used on the few cores of a multicore processor, but they do not have the potential for large scale parallelism.

Extrapolation Methods

Many authors note that extrapolation methods posses a high degree of parallelism and o er an extremely simple technique to obtain for generating high-order methods [15{18].

Let us consider a basic method of order p for integrating ( 1 ) from x0 until x1 = x0 + h: As usual simple method of low order, e. g. explicit Euler method or Crank{Nicolson method, is taken as a basic. Denote the numerical approximation to the exact solution value y(x1) by y(x1; h): Let the error of approximation allows the series expansion in powers of hq; where q = 2 for symmetric basic method and q = 1 otherwise. Let us consider a sequence of integration steps hi = h=i; i = 1; : : : ; r; then the corresponding r-point extrapolation method is de ned as follow

r X i = 1; i=1

r y1 = X i y(x0 + h; hi);

i=1 i=1

i r X ji = 0; j = p; p + q; : : : ; p + (r 2)q: ( 8 ) Theorem 3. Let the basic method providing the values y(x1; hi) be of order p; then the extrapolation method ( 8 ) has order p + q(r 1):

As soon as y1 is computed one can perform the second step using the y1 as a new initial value at t1; etc. Obviously the di erent terms y(x1; hi) in ( 8 ) could be computed in parallel. It is clear that the time needed to compute y(x1; ; hi) is proportional to i; hence to balance the load it is recommended to compute y(x1; h1) and y(x1; hr) on the rst processor, y(x1; h2) and y(x1; hr 1) on the second processor and so on. In this way b(r + 1)=2c processors are used.

Based on Richardson extrapolation technique Houwen in [15] constructed method which requires less processors to be optimal that the Gauss{Legendrebased parallel iterated RK method. For example, an optimal RK method of order ten requires only three processors when using Richardson extrapolation and ve processors when using predictor-corrector iteration of the tenth-order Gauss{Legendre method.

Korch et al. considered in [17] the parallel explicit extrapolation methods for high-dimensional ODEs, up to 108: They analyze and compare the scalability of several implementations for distributed-memory architectures which use di erent load balancing strategies and di erent loop structures. By exploiting the special structure of a large class of ODE systems, the communications costs can be reduced signi cantly. More then, by processing the micro-steps using a pipelinelike structure, the locality of memory references could be increased and better utilization of the cache memory can be achieved. 4

Multiple Shooting and Parareal Algorithm According to Gander [3] time parallel time integration methods have received renewed interest over the last decade because of the advent of massively parallel computers, which is mainly due to the clock speed limit reached on today's processors. When solving time dependent partial di erential equations, the time direction is usually not used for parallelization. But when parallelization in space saturates, the time direction o ers itself as a further direction for parallelization. The time direction is however special, and for evolution problems there is a causality principle: the solution later in time is a ected (it is even determined) by the solution earlier in time, but not the other way round. Algorithms trying to use the time direction for parallelization must therefore be special, and take this very di erent property of the time dimension into account.

Below we overview only two parallel in time methods.

The basic idea of multiple shooting method proposed in [19] is to apply a shooting method which was originally developed for boundary value problems to an initial value problem ( 1 ). To do this one splits the time interval into subintervals [x0; x1]; [x1; x2]; : : : ; [xn 1; xn]; and then solves on each subinterval the underlying initial value problem y00(x) = f (x; y0(x)); y0(0) = y0 y10(x) = f (x; y1(x)); : : : y1(x1) = U1 yn0 1(x) = f (x; yn 1(x)); yn 1(xn 1) = Un 1 ( 9 ) together with the matching conditions U1 = y0(x1); : : : ; Un 1 = yn 2(xn 1); where Ui; i = 1; : : : ; n 1 are so called shooting parameters which are unknown. This leads to the system of non-linear equations 8 U1 > >< U2 > : : : >: Un 1 y0(x1) = 0 y1(x2) = 0 yn 1(xn 1) = 0 : ( 10 )

To determine the shooting parameters the Newton's method could be applied to this system. Fortunately the Jacobian matrix has a very special band form and could be inverted easily. The corresponding iteration process has a following form

U k+1 = y0

0 ( 11 ) and allow parallel implementation. Chartier and Philippe prove that ( 11 ) converges locally quadratically. Note, that ( 11 ) has a form of Bellen and Zennaro method [20] who used multiple shooting technique to parallelize discrete recurrent relations yn+1 = Fn+1(yn) and reported that the speedup of 53 for a problem with 400 time steps.

Following Lions, Maday and Turinici [21] let us explain the parareal algorithms on the simple scalar model Uik); i = 0; 1; 2; : : : ; n 2 y0(x) = ay(x); The solution is rst approximated using Backward Euler on the time grid xi with coarse time step ;

Yi1+1

Yi1 = a Yi1+1;

Y01 = y0: The approximate solution values Yi1 are then used to compute on each time interval [xi; xi+1] exactly and in parallel the solution of Theorem 4. [21] The parareal scheme is of order k; i. e. there exists Ck such that kYik y(xi)k +

max x2[xi;xi+1] kyik(x) y(x)k

Ck k:

In particular, this result means that with each iteration of the parareal algorithm, one obtains a numerical time stepping scheme which has a truncation error that is one order higher than before. So for a xed iteration number k; one can obtain high order time integration methods that are naturally parallel. The same proposition also holds for Forward Euler. Unfortunately in both discretization schemes the stability of the higher order methods obtained with the parareal correction scheme degrades with iterations.

Lions et al. gave two numerical examples: one for a heat equation where they obtain a simulated speedup of a factor 8 with 500 processors, and one for a semi-linear advection di usion problem where the obtained speedup was 18. Acknowledgements This research is supported by RFBR 17-01-00033 and 1701-00392, Russian Science Foundation (RSF) 14-35-00005. We acknowledge the support by the program 02.A03.21.0006 on 27.08.2013.

1. Advances in Computational Mathematics Volume 7, Issue 1 { 2 , 1997 .

2. Gear , C.W.: The parallel methods for ordinary di erential equations , Tech. Rep . UIUCDCS R{ 87 { 1369 , Comp . Sci. Dept., Univ. of Illinois , Urbana, IL, 1987 .

3. Gander , M.J.: 50 Years of Time Parallel Time Integration . In: in Multiple Shooting and Time Domain Decomposition , Springer, 2015 .

4. Jackson , K.R.: A Survey of Parallel Numerical Methods for Initial Value Problems for Ordinary Di erential Equations . In: IEEE Transactions on Magnetics , 1991 . Vol. 27 . No. 5, P. 3792 { 3797 .

5. Miranker , W.L. , Linger , W. : Parallel methods for the numerical integration of ordinary di erential equations . In: Math. Comp., 1967 . Vol. 21 . P. 303 { 320 .

6. Katz , I.N. , Franklin , M.A. , Sen , A. : Optimally stable parallel predictors for AdamsMoulton correctors . In: Camp. and Moth . with Appl. , 1977 . Vol 3. pp 217{ 233 .

7. Shampine , L.F. , Watts , H.A. : Block Implicit One-Step Methods . In: Math. Comp., 1969 . V. 23. P. 731 { 740 .

8. Lee , M.G. , Song , R.W.: Computation of Stability Region of some Block Methods and Their Application to Numerical Solutions of ODEs . In: Proceedings of XIV Baikal International Conference, Baikal, Siberia, Russia, July 2008 .

9. Feldman , L.P. , Nazarova , I.A. : Parallel algorithms for the numerical decision of Caushy's problem for ordinary di erential equations systems . In: Mathematical modelling , 2006 . Vol. 18 . No. 9. P. 17 { 31 .

10. Iserles , A. , N rsett, S.P.: On the theory of parallel Runge{Kutta methods . IMA J. Numer. Anal. , 1990 . Vol. 10 . P. 463 { 488 .

11. Jackson , K.R. , N rsett, S.P.: The potential of parallelism in Runge{Kutta methods. Part 1: RK formulas in standard form . Report 239/90 , 1992 .

12. Hairer , E. , Wanner , G. , N rsett , S.P.: Solving Ordinary Di erential Equations

, 1993 .

13. van Der Houwen , P.J. , Sommeijer , B.P. : Parallel iteration of high-order RungeKutta methods with stepsize control . In: Journal of Computational and Applied Mathematics , 1990 . Volume 29 . Issue 1. P. 111 { 127 .

14. Lie , I.: Some aspects of parallel Runge-Kutta methods , Report No. 3/87 , University of Trondheim, Division Numerical Mathematics , 1988 .

15. van Der Houwen , P.J.: Parallel step-by-step methods . In: Applied Numerical Mathematics , 1993 . Volume 29 . Issue 1. P. 69 { 81 .

16. Rauber , T. , Runger , G.: Load balancing schemes for extrapolation methods . In: Concurrency: Practice and Experience , 1997 . Volume 9 . Issue 3. P. 181 { 202 .

17. Korch , M. , Rauber , T. , Scholtes , C. : Scalability and locality of extrapolation methods on large parallel systems . In: Euro-Par 2010 , Part

, LNCS 6272 , 2010 . P. 65 { 76 .

18. Ketcheson , D. , bin Waheed, U.: A comparison of high-order explicit Runge{Kutta, extrapolation, and deferred correction methods in serial and parallel . In: Communications in Applied Mathematics and Computational Science , 2014 , Volume 9 . Issue 2. P. 175 { 200 .

19. Chartier , P. , Philippe , B.A. : parallel shooting technique for solving dissipative ODEs . In: Computing, 1993 . Volume 51 . Issue 3. P. 209 { 236 .

20. Bellen , A. , Zennaro , M. : Parallel algorithms for initial-value problems for di erence and di erential equations . In: J. Comput. Appl . Math., 1989 . Volume 25 . Issue 3. P. 341 - 350 .

21. Lions , J.-L. , Maday , Y. , Turinici , G.: A parareal in time discretization of PDEs . In: C.R. Acad. Sci. Paris, Serie

, 2001332 , 2001 . P. 661 { 668 .