On the Efficient Parallel Computing of Long Term Reliable Trajectories for the Lorenz System Ivan Hristov1, Radoslava Hristova1, Stefka Dimova1, Peter Armyanov1, Nikolay Shegunov1, Igor Puzynin2, Taisia Puzynina2, Zarif Sharipov2, Zafar Tukhliev 2 1 Sofia University, Faculty of Mathematics and Informatics, Bulgaria 2 JINR, Laboratory of Information Technologies, Dubna, Russia ivanh@fmi.uni-sofia.bg, zarif@jinr.ru Abstract. In this work, we propose an efficient parallelization of multiple-precision Taylor series method with variable stepsize and fixed order. For given level of accuracy the optimal variable stepsize determines higher order of the method than in the case of optimal fixed stepsize. Although the used order of the method is greater than that in the case of fixed stepsize, and hence the computational work per step is greater, the reduced number of steps gives less overall work. In addition, the greater order of the method is beneficial in the sense that it increases the parallel efficiency. As a model problem, we use the paradigmatic Lorenz system. With 256 CPU cores in Nestum cluster, Sofia, Bulgaria, we succeed to obtain a correct reference solution in the rather long time interval – [0,11000]. To get this solution we perform two large computations: one computation with 4566 decimal digits of precision and 5240-th order method, and second computation for verification – with 4778 decimal digits of precision and 5490-th order method. Keywords: Parallel Computing, Multiple Precision, Variable Stepsize Taylor Se- ries Method, Lorenz System. 1 Introduction Multiple precision Taylor series method is an affordable and very efficient numerical method for integration of some classes of low dimensional dynamical systems in the case of high precision demands [1], [2]. The method gives a new powerful tool for theoretical investigation of such systems. A numerical procedure for computing reliable trajectories of chaotic systems, called Clean Numerical Simulation (CNS), is proposed by Shijun Liao in [3] and applied for different systems [4], [5], [6]. The procedure is based on multiple pre- cision Taylor series method. The main concept for CNS is the critical predictable time Tc, which is a kind of practical Lyapunov time. Tc is defined as the time for decoupling of two trajectories computed by two different numerical schemes. The CNS works as follows. An optimal fixed stepsize is chosen. Then estimates of the Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). required order of the method N and the required precision (the number of exact decimal digits K of the floating-point numbers) are obtained. The optimal order N is estimated by computing the Tc – N dependence by means of the numerical solu- tions for fixed large enough K. The estimate of K is obtained by computing the Tc – K dependence by means of the numerical solutions for fixed large enough N. This estimate of K is in fact an estimate for the Lyapunov exponent [7]. For given Tc the solution is then computed with the estimated N and K and after that one more computation with higher N and K is performed for verification. The choice of N and K ensures that the round-off error and the truncation error are of the same order. When very high precision and very long integration interval are needed, the computational problem can become large. In this case, the parallelization of the Taylor series method is an important task and needs to be carefully developed. The first parallelization of CNS is reported in [8] and later improved in [9]. A pretty long reference solution for the paradigmatic Lorenz system, namely in the time interval [0,10000], obtained in about 9 days and 5 hours by using the computational resource of 1200 CPU cores, is given in [10]. However, no details of the paralleliza- tion process are given in [8], [9], [10]. In our recent work [11] we reported in details a simple and efficient hybrid MPI + OpenMP parallelization of CNS for the Lorenz system and tested it for the same parameters as those in [10]. The results show very good efficiency and very good parallel performance scalability of our program. This work can be regarded as a continuation of our previous work [11], where fixed stepsize is used. Here we make a modification of CNS with a variable stepsize and fixed order following the simple approach given in [12]. Although the used order of the method is greater than that in the case of fixed stepsize, and hence the computational work per step is greater, the reduced number of steps gives less overall work. In addition, the greater order of the method is beneficial in the sense that it increases the parallel efficiency. With 256 CPU cores in Nes- tum cluster, Sofia, Bulgaria, we succeed to obtain a correct reference solution in [0,11000] and in this way we improve the results from [10]. To obtain this solu- tion we performed two large computations: one computation with 4566 decimal digits of precision and 5240-th order method, and second computation for veri- fication – with 4778 decimal digits of precision and 5490-th order method for verification. The computations lasted ≈ 9 days and 18 hours and ≈ 11 days and 7 hours, respectively. Let us note that the improvement of the numerical algorithm does not change the parallelization strategy from our previous work [11], where the parallelization process is explained in more details. The difference from the previous parallel program is one additional OpenMP single section with negligi- ble computational work, which computes the optimal step. It is important to mention that although our test model is the classical Lorenz system, the proposed parallelization strategy is rather general – it could be ap- plied as well to a large class of chaotic dynamical systems. 79 2 Taylor series method and CNS for the Lorenz system We consider as a model problem the classical Lorenz system [13]: (1) where R = 28, σ = 10, b = 8/3 are the standard Salztman’s parameters. For these parameters, the system is chaotic. Let us denote with xi, yi, zi, i = 0, ..., N the normal- ized derivatives (the derivatives divided by i!) of the approximate solution at the current time t. Then the N-th order Taylor series method for (1) with stepsize τ is: (2) The i-th Taylor coefficients (the normalized derivatives) are computed as fol- lows. From system (1) we have By applying the Leibniz rule for the derivatives of the product of two func- tions, we have the following recursive procedure for computing xi+1, yi+1, zi+1 for i = 0, ..., N-1: (3) 80 To compute the i+1-st coefficient in the Taylor series we need all previous coefficients from 0 to i. In fact, this algorithm for computing the coefficients of the Taylor series is called automatic differentiation, or also algorithmic differen- tiation [14]. It is obvious that we need O(N2) floating point operations for comput- ing all coefficients. The subsequent evaluation of Taylor series with Horner’s rule needs only O(N) operations. Let us now explain how we choose the stepsize τ. We use a variable stepsize strategy, which makes the method much more robust then in the fixed stepsize case. We use a simple strategy taken from [12], which ensures both the conver- gence of the Taylor series and the minimization of the computational work per unit time. If we denote the vector of the normalized derivatives of the solution with Xi = (xi, yi, zi) and take a safety factor 0.993, then the stepsize τ is determined by the last two terms of the Taylor expansions [12]: (4) Fig. 1. Tc – N dependencies for fixed and variable stepsize. 81 In [12] the order of the method is determined by the local error tolerance. However, we do not work explicitly with some local error tolerance and, we do not use any explicit dependence between the local and the global error. Instead of this, as in [3], we compute an a priori estimate of the needed order of the method for a reliable solution. As said before, the critical predictable time Tc is defined as the time for decoupling of two trajectories computed by two different numerical schemes (in this case – by different N). The solutions are computed with large enough precision to ensure that the truncation error is the leading one. As a criterion for decoupling time, we choose the time for establishing only 30 correct digits. The obtained Tc – N dependencies for fixed stepsize τ = 0.01 and variable stepsize are shown in Figure 1. As seen from this figure, the computational work for one-step in the case of variable stepsize is ≈ 80% greater than in the case of fixed stepsize – (2.98/2.22)2 ≈ 1.80. However, the re- duced number of steps gives less overall work. In addition, the greater order of the method is beneficial in the sense that it increases the parallel efficiency. The reason is that with increasing the order N of the method, the parallelizable part of the work becomes relatively even larger than the serial part and the parallel overhead part. Similarly, we compute an a priori estimate of the needed precision by means of computing the Tc – K dependence. In this case, we compare the solutions for different K and large enough N. We obtain the dependence Tc = 2.55K – 81, which is the same, as expected, for fixed and for variable stepsize. 3 Parallelization of the algorithm The improvement of the numerical algorithm does not change the parallelization strategy from our previous work [11], where the parallelization process is explained in more details. However, as we will see, the variable stepsize not only decreases the computational work for a given accuracy, but also gives a higher parallel efficiency. Let us store the Taylor coefficients in the arrays x, y, z of lengths N+1. The values of xi are stored in x[i], those of yi in y[i] and those of zi in z[i]. As explained in [8], [9], the crucial decision for parallelization is to make a parallel reduction for the two sums in (3). However, in order to reduce the remaining serial part of the code and hence to improve the parallel speedup from the Amdal’s law, we should utilize some limited, but important parallelism. We compute x[i+1], y[i+1], z[i+1] in parallel. Moreover, we compute x[i+1] in advance, before com- puting the sums in (3), when during the reduction process some of the computa- tional resource is free. In the same way we compute in advance Rx[i] – y[i] from the formula for y[i + 1] and bz[i] from the formula for z[i + 1]. These computa- tions are taken in advance; because multiplication is much more expensive than 82 the other used operations, such as division by an integer number is not so expen- sive. The three evaluations by Horner’s rule for the new x[0], y[0], z[0] are also done in parallel. In this work we consider a hybrid MPI + OpenMP strategy [15], [16], i.e., every MPI process creates a team of OpenMP threads. For multiple precision floating-point arithmetic, we use GMP library (GNU Multiple Precision library) [17]. The main reason to consider a hybrid strategy, rather than a pure MPI one, is that OpenMP performs slightly better than MPI on one computational node. For packing and unpacking of the GMP multiple precision types for the MPI messages, we rely on the tiny MPIGMP library of Tomonori Kouya [18], [19], [20], [21]. It is important to note that for our problem the pure OpenMP parallelization has its own importance. First, the programming with OpenMP is easier, because it avoids the usage of libraries like MPIGMP. Second, since the algorithm does not allow domain decomposition, the memory needed for one computational node is multiplied by the number of the MPI processes per that node, while OpenMP needs only one copy of the computational domain and thus some memory is saved. The sketch of our parallel program is given in Figure 2. Every thread gets its id and stores it in tid and then the loop with index i is performed. Every MPI process takes its portion – the first and the last index controlled by the process. After that the directive #pragma omp for shares the work for the loop between threads. Although OpenMP has a build-in reduction clause, we cannot use it, be- cause we use user-defined types for multiple precisions number and user-de- fined operations. A manual reduction by applying a standard tree based parallel reduction is done. We use containers for the partial sums of every thread and these containers are shared. The containers are stored in the array sum. We have in addition an array of temporary variables tempv for storing the intermediate results of the multiplications. To avoid false sharing, a padding strategy is ap- plied [16]. At the point where each process has computed its partial sums, we perform MPI_ALLREDUCE between the master threads [15]. It is useful to regard MPI_ALLREDUCE as a continuation of the tree based reduction pro- cess, which starts with the OpenMP reduction. Communications between mas- ter threads are overlapped with some computations for x[i+1], y[i+1], z[i+1] that can be taken before the computation of the sums in (3) is finished. When the MPI_ALLREDUCE is finished, we compute in parallel the remaining op- erations for x[i+1], y[i+1], z[i+1]. In between the block which computes the Taylor coefficients and the block which computes the new values of x[0], y[0], z[0] in parallel, we compute the new optimal stepsize within an omp single section. While the block for comput- 83 ing the Taylor coefficients is O(N2) and the block for evaluations of the polyno- mials is O(N), this block is only O(1) and hence the work is negligible. Let us note that the GMP library does not offer a power function for the computations from formula (4). The good thing is that we do not need to compute the stepsize with multiple precision and double precision is enough. Therefore, we use the C standard library function pow in double precision. We do a normalization of the large GMP floating point numbers in order to work in the range of the standard double precision numbers. The C-code in terms of GMP library of our hybrid MPI + OpenMP program can be downloaded from [22]. Let us mention that if one half of the OpenMP threads computes one of the sums in (3) and the other half computes the other sum, one could also expect some small performance benefit, because for the small indexes i the unused threads will be less and the difference from the perfect load balance between threads will be less. However, the last approach is not general because it strongly depends on the number of sums for reduction (two in the particular case of the Lorenz system) and the number of available threads. 84 #pragma omp parallel private(i,j,tid) { tid = omp_get_thread_num(); for (i = 0; i