-

Computing ψ-Caputo Fractional Derivative Values Using CUDA 10

0 V.M. Glushkov Institute of cybernetics of NAS of Ukraine , Kyiv , Ukraine

The paper addresses the issues of efficient GPU-implementation of ψ-Caputo fractional derivative values computation on NVIDIA GPU's with compute capability 7.5 using CUDA 10 SDK on both CUDA and OpenCL languages. We consider a three-dimensional time-fractional diffusion equation solved by a locally one-dimensional finite difference scheme. To compute nonlocal part of the derivative a rectangle rule quadrature is used and a summation algorithm of linear computational complexity is considered along with a constant complexity order approximating algorithm based on integral kernel expansion into series. For the approximating algorithm we present a computational scheme that uses NVidia GPU's tensor cores. For both algorithms, we study the influence of the used scalar and vector data types on performance and accuracy. Studying the summation algorithm, comparing to the usage of 64-bit doubleprecision floating-point data type, the computations were ~2 times faster for 32bit single-precision data type and ~3 times faster for 16-bit half-precision data type without significant loss of accuracy. For the approximated algorithm that was up to 5-times faster than the summation algorithm, the usage of lowprecision data types slightly influence the performance reducing the accuracy during long-term simulations. The usage of vectorized operations in the approximation algorithm allowed up to 6-19% speed-up compared with non-vectorized implementations for a single-precision data type. The usage of tensor cores that operate with a half-precision data type allowed performing calculations 12% faster compared to the case when the same data type was used.

GPU algorithms finite-difference method diffusion equation ψ-Caputo fractional derivative tensor cores data types CUDA OpenCL

Memory effects in diffusion processes can be efficiently simulated using timefractional differential equations [1-3].

Such equations contain the so-called fractional derivatives that are integraldifferential operators.

The need to numerically calculate integrals while solving time-fractional differential equations increase the computational complexity order compared to the traditional differential equations.

Approaches to lower computational complexity include parallel computing techniques [4,5,6], particularly using GPUs [5,6], and approximation of integral kernels [7,8,9].

In this paper we consider a three-dimensional diffusion equation with the generalized ψ-Caputo derivative with functional parameter solved by a locally onedimensional finite-difference scheme similarly to [10].

We focus on efficient GPU implementation of ψ-Caputo derivative’s values computation, which is one the most time-consuming operation while performing simulations, on NVidia GPUs using CUDA 10 SDK.

We study GPU algorithms’ performance when scalar and vector data types of different length and precision are used in CUDA and OpenCL algorithms’ implementations along with the possibility to speed up computations making use of 16x16 matrix-matrix multiplication operations that are hardware-implemented in the so-called tensor cores of the NVidia GPUs with compute capabilities 7.x. ( 1 ) ( 2 ) 2

Problem statement and finite-difference scheme

We consider the following three-dimensional time-fractional diffusion equation [10]:   2C( x, y, z,t)  2C( x, y, z,t) σDt(,gβ)C( x, y, z,t) = D x 2 + y 2 + F ( x, y, z,t), 0  x  1, 0  y  1, 0  z  1, t > 0, 0 < β  1 +  2C(xz,2y, z,t)  + where C ( x, y, z,t) is the diffusive substance concentration, σ is the porosity, D is the diffusion coefficient, F is the given source-term function, and Dt(,g ) is the generalized ψ-Caputo fractional derivative with respect to the time variable t functional parameter g (t) that has the following form [11]: with the Dt(,gβ)C(x, y, z,t) =

1 t C(x, y, z,t) (1 − β) 0 t

(g(t) − g(τ))− β dτ , g(t)  C1 0, +) , g (t)  0 (t  0), g(0) = 0 .

For the equation ( 1 ) we pose first-type initial and boundary conditions. The initial boundary value problem for the equation ( 1 ) is discretized using locally onedimensional finite-difference scheme [10,12] on the uniform grid

 = {(xi , y j , zk , tl ) : xi = ih1 (i = 0, n1 +1), y j = jh2 ( j = 0, n2 +1), zk = kh3 ( j = 0, n3 +1), tl = l where h1 , h2 , h3 ,τ are the steps with respect to the space and the time variables.

The finite-difference scheme, similarly to [10,12], is as follows:

Ci(jlk+2 / 3) −  D (Ci(,lj+−21,/k3) − 2Ci(jkl+2 / 3) + Ci(,lj++21,/k3) ) = C (l+1/ 3) +  ijk

Fij(kl) , Ci(jlk+1) − (Ci(,lj+,1k)−1 − 2Ci(jlk+1) + Ci(,lj+,1k)+1 ) = C(l+2 / 3) + ijk

Fij(kl) , (Ci(−l1+,1j/,3k) − 2Ci(jkl+1/ 3) + Ci(+l1+,1j/,3k) ) = Ci(jkl) +

Fij(kl) ,  3k1  3k1 3k1 Ci(jlk+1/ 3) −  D

k1h12 k1h22  D k1h32 ( 3 ) ( 4 ) (5) (6) (7)

, Fij(kl) = Fijk +

1 l−2 b(l) Ci(jks+1) − C(s) (1 −  ) s=0 s 

ijk , ts+1 bs(i) =  (g(ti ) − g( ))− d .

The equations ( 3 )-(5) can be with the addition of discretized forms of boundary conditions represented as three-diagonal linear equation systems, which are solved by the Thomas algorithm [12].

The summation in the right-hand side of (6) is the first-order discretization of a non-local part of the fractional derivative obtained applying the rectangle quadrature to the integral in ( 2 ). 3

Approximating algorithm for computing ψ-Caputo fractional derivative values

Computation of non-local part of fractional derivative according to (6) has O(l) computational complexity and requires storing all previously obtained solutions making it time and memory consuming while performing simulations over large time intervals. Hence, we consider the constant complexity order algorithm based on series expansion of the integrals (7) [13, 14] that requires storing solutions on only the two last time steps.

When g (t) has an infinitely differentiated inverse function f (t) , b (i) can be reps resented in the form [13,14]

   b(j) =   (−1)n  −n g(t j )− β −n Sn , s

n=0   Sn (ts ,ts+1 ) =    Bm m=0 f (m+1) ( g (ts+1)) 

, m!  g(ts+1) g(ts+1) Bm =  xn (x − g(ts+1))m dx, B0 =  xndx = g(ts ) g(ts )

1 n +1 Bi+1 = − n + i + 2 

 Bi − g(ts+1)(i +1)  g(ts )n+1(g(ts ) − g(ts+1))i+1 

 , g(ts+1)(i +1)  (g (ts+1)n+1 − g (ts )n+1),

Truncating the infinite series, the summation in (6) using such representation can be organized [14] as follows: l−2 b(l) Ci(jks+1) − Ci(jks)  1 K   s=0 s   n=0  (−1)n  −n  g(tl )− −n Sn,l−1 , Sn,l = Sn,l−1 + (C(l) − C(l−1) )Sn (tl−1 ,tl ), Sn,0 = 0. (8) where K is the given number of terms in series.

So, to compute in a specific node (i, j, k ) the value of the term Fij(kl) , which is the most time-consuming part when computing right-hand sides of three-diagonal linear equations systems ( 3 )-(5), we need to store the values of Sn,• and on the time step l • update Sn,l−1 into Sn,l according to the second equation in (8); • calculate the value of the sum according to the first equation in (8); • substitute the obtained value of the sum into (6) getting the value of Fij(kl) . 4

Parallel implementation

We perform the solution of linear systems ( 3 )-(5) using multithreaded OpenMP implementation with the calculation of a non-local part of the ψ-Caputo fractional derivative upon (6) or (8) performed for every node of the grid using GPU.

The calculations upon the series approximating algorithm (8) are organized following the scheme described in [15]: • the values of Sn(tl −1 ,tl ) are calculated on CPU on the step l −1 in parallel with GPU computations upon (8) and uploaded on the step l into GPU memory along with the solution C (l−1) ; • each GPU thread calculates the value of Fij(kl) for a specific node (i, j, k ) and, after GPU performed the computations, values of Fij(kl) are copied back into CPU memory.

To use GPUs local memory [15], we form thread groups of a controllable size  with computations organized in l /   substeps. On the substep s , the values of Sn (tl−1, tl ), sl /    n  (s + 1)l /   are loaded into local memory and a corresponding part of Sn,l is updated by each thread. Further, the resulting K -terms sum in (8) is computed with (−1)n  −n g(tl−2 )− −n in advance calculated in parallel and loaded into local memory.

To make use of vector data types present in CUDA and OpenCL languages, the summations in (8) are split into m-terms vector operations obtaining l−2 b(l) Ci(jks+1) − Ci(jks)  1 K / m s=0 s   n=0 (vn(1,m) ( Snm,l−1,..., S(n+1)m,l−1 )), vn(1,m) =  (−1)nm  n−m  g(tl )− −nm ,..., (−1)(n+1)m  (n +−1)  m  g(tl )− −(n+1)m  ( Snm,l ,..., S(n+1)m,l ) = ( Snm,l−1,..., S(n+1)m,l−1 )+ +(C(l) − C(l−1) ) ( Snm (tl−1 ,tl ),..., S(n+1)m (tl−1 ,tl )). (9)

The performance of the algorithm (9) was studied for 16-component doubleprecision vectors in [15].

Continuing this work, we consider the calculation upon (9) and the saving of Sn,l using floating-point data types of different precision (16, 32, and 64 bits) and different vector sizes ( 2, 4, 8, 16 ).

In all cases, we save the values of C and S n in a double-precision floating-point format in both CPU and GPU memories. The summation upon (6) can also be represented in vectorized form, and we consider it with the same data types as the algorithm (9).

As the newest NVidia GPUs implement 16x16 matrix-matrix (tensor) multiplication operations in hardware in the so-called tensor cores, we transform the summation in (8) to make use of them.

The considered tensor operation Tma ( A, B, C) multiplies two 16x16 half-precision matrices A , B and adds a 16x16 single-precision matrix C to the result of multiplication. The result of this operation also has a single-precision data type. Considering the vector of sums S (16) for 16 grid cells C( 1 ) ,...,C(16) , the summation part of (9) for K = 16 can be performed as follows: Sn,l coefficient value for the grid cell i , and ()1 denotes the first column of the matrix.

The algorithm (10) makes use only of one matrix-vector multiplication of 16 performed in Tma operation. 5

Numerical experiments

We consider the test problem for the equation ( 1 ) with g(t) = t γ and

F (x, y,z,t) =

Γ (1+ 2 / γ) Γ (1 − β + 2 / γ)

t 2−βγ − 2D(x 2 y 2 + y 2 z 2 + x 2 z 2 ) .

Then, the solution of ( 1 ) has the form C(x, y, z,t) = C0 (x, y, z,t) = x 2 y 2 z 2 + t 2 for the case of σ = 1 and the first-type initial and boundary conditions derived from C(x, y, z,t) .

The system ( 3 )-(5) was solved on 200 time steps of size τ = 0.0025 , γ = 0.8, β = 0.8 , grid size of N  N  N, N = 40,...,80 , and K = 16,...,64 on a single node of SCIT-4 cluster of VM Glushkov Institute of Cybernetics of NAS of Ukraine. The used node contains two Intel(R) Xeon(R) Bronze 3104 CPUs with total of 12 cores and NVidia RTX 2080 Ti GPU.

The algorithms were implemented in both CUDA and OpenCL languages and the corresponding source code can be accessed through https://github.com/sevaboh/FPDE_HPC.

5.1. Performance for a fixed N

The available set of data types for CUDA implementation was double, float, half, double[2,4], float[2,4], and half2.

In the case of OpenCL we considered the implemented in CUDA10 double, float, double[2,4,8,16], and float[2,4,8,16] data types. The times spent on computation upon (6), (9), and (10) on 200 iterations for N = 40, K = 16 and all the considered combinations of languages and data types are given in Table.1. In all experiments thread block size  was equal to K .

The times spent on a single iteration are near to equal for the approximating algorithm (9) and tend to increase linearly for the summation upon (6). This linear increase is non-monotone (Fig.1) due to incomplete GPU resources usage when t%K  0 where t is the time step number.

Doing calculations upon (6) in CUDA implementation, comparing to the usage of a double-precision data type, the algorithms that were implemented using data types of lower precision performed the computations ~2 times faster for 32-bit single-precision data type and ~3 times faster for 16-bit half-precision data type.

The usage of the approximating algorithm (9) up to ~5 times decreased the computation time compared to the summation upon (6).

Here the usage of a single-precision data type decreased the time only by <12% when the algorithm was implemented in CUDA.

This can be explained by a high amount of data type conversion operations in the implementations of (9) due to the storage of the solutions C and the coefficients S n using 64-bit double-precision data type.

The usage of a half-precision data type in CUDA implementation slowed the computations compared to the usage of a single-precision data type.

However, the half-precision implementation that uses tensor cores was the fastest of all the considered variants. The usage of a single-precision data type in OpenCL implementation slowed down the computations compared to the usage of a doubleprecision data type.

For K = 16 , vectorization accelerated computations only for the case of a singleprecision floating-point data type in CUDA implementation.

To explain the observed behavior, we compare PTX assembly sources generated by CUDA and OpenCL compilers with the generation of FMA (floating-point multiply-add) instruction enabled.

The following conclusions can be made from the performed comparison: • both compilers perform automatic unwinding of 16 loop iterations; • both compilers do not generate vector arithmetic instructions when vector data types are used; • both compilers generate vectorized 128-bit load/store instructions [ld,st].[shared,global].v2.f64 and [ld,st].[shared,global].v4.f32 and this is their usage that accelerated computations when vector data types are used; • when both compilers perform loop unwinding, memory is addressed through fixed offsets from the base for both shared and global memory (e.g. ld.global.f64 %fd77, [%rd23+40]). For the vectorized data types, offsets to global memory are computed for every memory access to the corresponding vector (e.g. mul.wide.s32 %rd28, %r67, 128; add.s64 %rd29, %rd3, %rd28; ld.global.v2.f64 {%fd172, %fd173}, [%rd29+112]; ld.global.v2.f64 {%fd176, %fd177}, [%rd29+96];...). For both algorithms (6) and (8), such additional operations slowed down the computations. However, the increase of K leads to the decrease of the number of auxiliary operations allowing obtaining up to 15% speed-up for double16 data type usage in OpenCL; • CUDA compiler automatically vectorizes the reading of 32-bit floating-point data stored in the local memory using 64-bit instruction ld.shared.v2.f32 making the calculation upon (9) faster than in the case of double-precision data type usage. The absence of such automatic vectorization in OpenCL makes the single-precision OpenCL implementation to perform slower than in the case of a double-precision data type. The same applies to half-precision data type usage in CUDA that yields slower code than for the case of a single-precision data type; •

While performing calculation upon (6), CUDA compiler implements division operation in div.rn instruction, while OpenCL compiler uses approximated div.full instruction making OpenCL code faster.

To check how the tendencies of vectorization efficiency observed for K = 16 changes for its bigger values, the series of computational experiments were conducted for N = 40 and K = 16,...,64 for the case of the approximating algorithm (9).

Here, automatic loop unwinding leads to the decrease of the number of auxiliary operations with the increase of K . The computation time also decreases by up to 15% for double16 data type usage in OpenCL, up to 5.3% for double8 data type usage in OpenCL, up to 2.6% for float16 data type usage in OpenCL, and up to 19% for float4 data type usage in CUDA. The usage of double[2,4] and float[2,8] data types in OpenCL slowed down the computations.

Thus, while OpenCL language supports larger vector data types than CUDA language (16 elements compared to 4 elements), its inefficient compilation by NVidia compiler leads to the situation when OpenCL vectorized implementations of the algorithm (9) allow accelerating computations only for large K and the largest vector data types.

5.2 Performance in the case of variable grid size

The influence of the time of fractional derivative values calculation on the total time spent on simulation along with the speed-ups of different GPU implementations are given in Table 2 for the summation upon (6) and in Table 3 for the approximating algorithm (9) for N = 40, ... , 80 .

CUDA float CUDA float4 to CUDA float

OpenCL float to CUDA float

N Percentage of summation to total time Total time Summation time Percentage of summation to total time Percentage of auxiliary hostGPU operations Summation time Summation time

40 31,93% 40,74% 60

80 36,64%

36,41% 43,58%

51,15% 939,46%

1374,70% 1568,14% 4,32% 3,57%

3,30% 28,62% 22,25%

22,22% 7,52% 5,40%

8,80%

In the algorithm (6), time spent on fractional derivative values computation (summation time) comprises up to 36% of total computation time on CPU. This value lowers to ~4% on GPU when this operation became 9-30 times faster and the total computation time became ~50% less. As can be seen from the Table 2, GPU algorithms’ efficiency increases with the increase of N while the efficiency of vectorization remains on the same level. For the approximating algorithm (9) and K = 16 , fractional derivative values computation comprises up to 40% of the total computation time while executing on CPU with slight decrease of this percentage with the increase of N . GPUs efficiency here is close to the case of the summation upon (6): 15-27 times acceleration of the calculations upon the algorithm (9), 51-63% decrease of the total computation time compared to CPU, ~3% of time spent by GPU algorithms for fractional derivative values computation. Comparing to the slower algorithm (6), the time spent on auxiliary CPU-GPU memory copying operations while calling GPU kernels is here bigger: ~60% compared to ~25%. Efficiency of vectorization do not depend on N with the usage of tensor cores yielding the highest 12% increase compared with the usage of a half-precision data type and 8% increase compared to the usage of a single-precision data type. Efficiency of the usage of tensor operations compared to the fastest case of vectorized float4 implementation however is only 2%. CCUUDDAA ffllooaatt4 to Kernel execution time COUpeDnACLfloflaotat to Kernel execution time fCloUaDtA half to CUDA Kernel execution time CCUUDDAA fhlaolaft2 to Kernel execution time cCoUreDsA,tohaClUf, DteAnsfolroat Kernel execution time cCoUreDsAtohCaUlf,DteAnshoarlf Kernel execution time 40 60

80 39,85% 38,04%

36,85% Changes in maximal square absolute error on the iterations while using summation upon (6) and series expansion upon (9) for different data types are given in Fig. 2. Lowering of data type precision has a little influence on the quality of solution for the summation upon (6) due to monotone increase of b (i) when i → s that minimizes s summation errors. The error for the algorithm (6) decreases with the increase of time step number.

0,0018 0,0016 0,0014 0,0012 0,001 0,0008 0,0006 0,0004 0,0002 0 1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 129 137 145 153 161 169 177 185 193 Full summation (all data types) Series expansion (half, tensor cores)

Series expansion (float,double)

Series expansion (half)

Close values of errors between the summation upon (6) and the approximating algorithm (9) with K = 16 on the initial time steps show the adequacy of a chosen value of K . However, for the algorithm (9), the recurrently changing values of Sn,l overflow the data type values range starting from some time step leading to linear increase of error. In the conducted experiments such behavior was observed for a half- and single-precision data types. 6

Conclusions

Calculation of time-fractional ψ-Caputo derivatives’ values significantly influence time needed to solve fractional derivative equations that contain them. For the considered case of the diffusion equation solved using a finite-difference scheme, this operation comprises up to 40% of time spent while performing calculation on CPU. Its parallelization on GPU is efficient because the computations on the cells of the grid are independent.

We studied additional possibilities to speedup GPU implementations considering the summation algorithm of linear computational complexity and the approximating recurrent algorithm of constant complexity. Considering the usage of different precision floating-point data types, their vectors, and matrix-matrix multiplication operations implemented in tensor cores of the latest NVidia GPUs the following conclusion can be made: • for the summation algorithm (6), comparing to the usage of a double-precision data type, the computations were ~2 times faster using 32-bit single-precision data type and ~3 times faster using 16-bit half-precision data type without significant loss of accuracy; • for the approximating algorithm (9) that was up to 5-times faster than the summation algorithm, the usage of different data types slightly influence the performance due to the high amount of data type conversion operations. However, the usage of low-precision data types here leads to linear increase of solution error starting from some time step due to data type overflow; • Inefficient compilation of vectorized memory access in OpenCL code leads to the situation when OpenCL vectorized implementations of the algorithm (9) allow accelerating computations only for large number K of terms in truncated series and the largest vector data types. Thus, the usage of CUDA is preferable for the approximating algorithm (9). However, for the summation algorithm (6), OpenCL implementations were faster due to the differences in compiling the division arithmetic operation; • GPU implementations 9-30 times accelerate fractional derivative values computation with the efficiency that increases with the increase of grid size; • the usage of tensor cores that operate with 16-bit half-precision floating-point data type allowed performing calculations 12% faster than the usage of a half-precision data type without vectorization. However, comparing to the efficiency of the fastest vectorized implementation that uses float4 data type, only 2% acceleration was obtained. 5. Liu, J., Gong, C., Bao, W., Tang, G., Jiang, Y.: Solving the Caputo Fractional ReactionDiffusion Equation on GPU. Discrete Dynamics in Nature and Society 2014, 820162 (2014). doi: 10.1155/2014/820162 6. Golev, A., Penev, A., Stefanova, K., Hristova, S.: Using GPU to speed up calculation of some approximate methods for fractional differential equations. International Journal of Pure and Applied Mathematics 119( 3 ), 391-401 (2018). doi: 10.12732/ijpam.v119i3.1 7. Gong, C., Bao, W., Liu, J.: A piecewise memory principle for fractional derivatives. Fract.

Calc. Appl. Anal. 20( 4 ), 1010–1022 (2017). doi: 10.1515/fca-2017-0052 8. Ford, N.J., Simpson, A.C.: The numerical solution of fractional differential equations: Speed versus accuracy. Numerical Algorithms, 26( 4 ), 333–346 (2001). doi: 10.1023/A:1016601312158 9. Baffet, D., Hesthaven, J.S.: A kernel compression scheme for fractional differential equations. Siam J. Numer. Anal 55( 2 ), 496–520 (2017). doi: 10.1137/15M1043960 10. Bogaenko, V.A., Bulavatsky V.М.: Computer modeling of the dynamics of migration processes of soluble substances in the case of groundwater filtration with free surface on the base of the fractional derivative approach (in Russian). Dopov. Nac. Akad. Nauk Ukr. 12, 21–29 (2018). doi: 10.15407/dopovidi2018.12.021 11. Almeida, R.: A Caputo fractional derivative of a function with respect to another function.

Communications in Nonlinear Science and Numerical Simulation 44, 460–481 (2017). doi: 10.1016/j.cnsns.2016.09.006 12. Samarskii A.: The Theory of Difference Schemes. CRC Press, New York (2001). 13. Bohaienko, V.O.: A fast finite-difference algorithm for solving space-fractional filtration equation with a generalised Caputo derivative. Computational and Applied Mathematics 38( 3 ), 105 (2019). doi: 10.1007/s40314-019-0878-5 14. Bohaienko V.O.: Efficient computation schemes for generalized two-dimensional timefractional diffusion equation. In: Papers of the International scientific and practical conference ІТСМ – 2019, Ivano-Frankivsk, Ukraine, pp. 238–241 (2019). 15. Bohaienko, V.O.: Performance of vectorized GPU-algorithm for computing ψ-Caputo derivative values. In: Hu Z., Petoukhov S., Dychka I., He M. (eds) Advances in Computer Science for Engineering and Education III. ICCSEEA 2020. Advances in Intelligent Systems and Computing, vol 1247. Springer, Cham (2021). https://doi.org/10.1007/978-3030-55506-1_24

1. Gorenflo , R. , Mainardi , F. : Fractional calculus: integral and differential equations of fractional order . In: Fractals and Fractional Calculus in Continuum Mechanics , pp. 223 - 276 , Springer Verlag, Wien, Austria ( 1997 )

2. Podlubny , I. : Fractional differential equations , Academic Press, New York ( 1999 )

3. Bulavatsky , V.M. : Mathematical modeling of dynamics of the process of filtration convection diffusion under the condition of time nonlocality . Journal of Automation and Information Science 44 ( 4 ), 13 - 22 ( 2012 ). doi: 10 .1615/JAutomatInfScien.v44. i4 . 20

4. Bonchis , C. , Kaslik , E. , Rosu , F. : HPC optimal parallel communication algorithm for the simulation of fractional-order systems . J Supercomput 75 , 1014 - 1025 ( 2019 ). doi: 10 .1007/s11227-018-2267-z