Computing ψ-Caputo Fractional Derivative Values Using CUDA 10 Vsevolod Bohaienko[0000-0002-3317-9022] V.M. Glushkov Institute of cybernetics of NAS of Ukraine, Kyiv, Ukraine sevab@ukr.net Abstract. The paper addresses the issues of efficient GPU-implementation of ψ-Caputo fractional derivative values computation on NVIDIA GPU’s with compute capability 7.5 using CUDA 10 SDK on both CUDA and OpenCL lan- guages. We consider a three-dimensional time-fractional diffusion equation solved by a locally one-dimensional finite difference scheme. To compute non- local part of the derivative a rectangle rule quadrature is used and a summation algorithm of linear computational complexity is considered along with a con- stant complexity order approximating algorithm based on integral kernel expan- sion into series. For the approximating algorithm we present a computational scheme that uses NVidia GPU’s tensor cores. For both algorithms, we study the influence of the used scalar and vector data types on performance and accuracy. Studying the summation algorithm, comparing to the usage of 64-bit double- precision floating-point data type, the computations were ~2 times faster for 32- bit single-precision data type and ~3 times faster for 16-bit half-precision data type without significant loss of accuracy. For the approximated algorithm that was up to 5-times faster than the summation algorithm, the usage of low- precision data types slightly influence the performance reducing the accuracy during long-term simulations. The usage of vectorized operations in the approx- imation algorithm allowed up to 6-19% speed-up compared with non-vectorized implementations for a single-precision data type. The usage of tensor cores that operate with a half-precision data type allowed performing calculations 12% faster compared to the case when the same data type was used. Keywords: GPU algorithms, finite-difference method, diffusion equation, ψ-Caputo fractional derivative, tensor cores, data types, CUDA, OpenCL. 1 Introduction Memory effects in diffusion processes can be efficiently simulated using time- fractional differential equations [1-3]. Such equations contain the so-called fractional derivatives that are integral- differential operators. The need to numerically calculate integrals while solving time-fractional differen- tial equations increase the computational complexity order compared to the traditional differential equations. Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). ICST-2020 Approaches to lower computational complexity include parallel computing tech- niques [4,5,6], particularly using GPUs [5,6], and approximation of integral kernels [7,8,9]. In this paper we consider a three-dimensional diffusion equation with the general- ized ψ-Caputo derivative with functional parameter solved by a locally one- dimensional finite-difference scheme similarly to [10]. We focus on efficient GPU implementation of ψ-Caputo derivative’s values com- putation, which is one the most time-consuming operation while performing simula- tions, on NVidia GPUs using CUDA 10 SDK. We study GPU algorithms’ performance when scalar and vector data types of dif- ferent length and precision are used in CUDA and OpenCL algorithms’ implementa- tions along with the possibility to speed up computations making use of 16x16 ma- trix-matrix multiplication operations that are hardware-implemented in the so-called tensor cores of the NVidia GPUs with compute capabilities 7.x. 2 Problem statement and finite-difference scheme We consider the following three-dimensional time-fractional diffusion equation [10]:   2C ( x, y, z,t )  2C ( x, y, z,t )  2C ( x, y, z,t )  σDt,( gβ )C ( x, y, z,t ) = D + + +    x 2  y 2  z 2  (1) + F ( x, y, z,t ) , 0  x  1, 0  y  1, 0  z  1, t > 0, 0 < β  1 where C ( x, y, z, t ) is the diffusive substance concentration, σ is the porosity, D is the diffusion coefficient, F is the given source-term function, and Dt,( g ) is the gen- eralized ψ-Caputo fractional derivative with respect to the time variable t with the functional parameter g (t ) that has the following form [11]: C ( x, y,z,t ) t 1 (1 − β ) 0 (β) Dt,g C ( x, y,z,t ) = ( g (t ) − g ( τ )) − β dτ , (2) t g (t )  C1  0, + ) , g (t )  0 (t  0), g (0) = 0 . For the equation (1) we pose first-type initial and boundary conditions. The initial boundary value problem for the equation (1) is discretized using locally one- dimensional finite-difference scheme [10,12] on the uniform grid  = {( xi , y j , zk , tl ) : xi = ih1 (i = 0, n1 + 1), y j = jh2 ( j = 0, n2 + 1), z k = kh3 ( j = 0, n3 + 1), tl = l where h1 , h2 , h3 ,τ are the steps with respect to the space and the time variables. The finite-difference scheme, similarly to [10,12], is as follows: D  Cijk(l +1/ 3) − 2 (Ci(−l 1,+1/j ,3)k − 2Cijk( l +1/ 3) + Ci(+l 1,+1/j ,3)k ) = Cijk( l ) + Fijk( l ) , (3) kh1 1 3k1  D (l + 2 / 3)  (l ) Cijk(l + 2 / 3) − (Ci , j −1, k − 2Cijk(l + 2 / 3) + Ci(,lj++21,/k3) ) = Cijk(l +1/ 3) + Fijk , (4) k1h22 3k1 D  Cijk(l +1) − 2 (Ci(,lj+,1)k −1 − 2Cijk(l +1) + Ci(,lj+,1)k +1 ) = Cijk( l + 2 / 3) + Fijk( l ) , kh1 3 3k1 (5) bl(l +1) k1 = , (1 −  ) 1 l −2 Cijk( s +1) − Cijk( s ) Fijk(l ) = Fijk +  (1 −  ) s = 0 bs(l )  , (6) ts +1 bs(i ) =  ( g (ti ) − g ( )) −  d . ts (7) The equations (3)-(5) can be with the addition of discretized forms of boundary conditions represented as three-diagonal linear equation systems, which are solved by the Thomas algorithm [12]. The summation in the right-hand side of (6) is the first-order discretization of a non-local part of the fractional derivative obtained applying the rectangle quadrature to the integral in (2). 3 Approximating algorithm for computing ψ-Caputo fractional derivative values Computation of non-local part of fractional derivative according to (6) has O (l ) com- putational complexity and requires storing all previously obtained solutions making it time and memory consuming while performing simulations over large time intervals. Hence, we consider the constant complexity order algorithm based on series expan- sion of the integrals (7) [13, 14] that requires storing solutions on only the two last time steps. When g (t ) has an infinitely differentiated inverse function f (t ) , bs(i ) can be rep- resented in the form [13,14]   −    bs(j) =   (−1) n  n  g (t j ) − β − n Sn , n= 0   f ( m+1) ( g (t s+1 ))  S n (t s , t s+1 ) =   Bm  , m= 0  m!  g ( ts +1 ) g ( ts +1 ) 1  x n ( x − g (t s+1 )) m dx, B0 =  x dx = n+1 ( g (t ) − g (t s ) n+1 ), n n+1 Bm = s+1 g ( ts ) g ( ts ) n+i + 2  g (ts ) ( g (ts ) − g (t s+1 ))i+1  n+1 Bi+1 = −  Bi − , g (ts+1 )(i +1)  g (t s+1 )(i +1)  Truncating the infinite series, the summation in (6) using such representation can be organized [14] as follows: l −2 Cijk( s +1) − Cijk( s ) 1 K  n  −    bs(l )     (−1)   n=0   g (tl ) − −n S n ,l −1 , s =0  n   (8) ( l −1) S n,l = S n,l −1 + (C (l ) −C ) S n (tl −1 ,tl ), S n ,0 = 0. where K is the given number of terms in series. (l ) So, to compute in a specific node (i, j , k ) the value of the term Fijk , which is the most time-consuming part when computing right-hand sides of three-diagonal linear equations systems (3)-(5), we need to store the values of S n,• and on the time step l • update S n,l −1 into S n,l according to the second equation in (8); • calculate the value of the sum according to the first equation in (8); • substitute the obtained value of the sum into (6) getting the value of Fijk (l ) . 4 Parallel implementation We perform the solution of linear systems (3)-(5) using multithreaded OpenMP im- plementation with the calculation of a non-local part of the ψ-Caputo fractional deriv- ative upon (6) or (8) performed for every node of the grid using GPU. The calculations upon the series approximating algorithm (8) are organized follow- ing the scheme described in [15]: • the values of S n (t l −1 ,tl ) are calculated on CPU on the step l − 1 in parallel with GPU computations upon (8) and uploaded on the step l into GPU memory along with the solution C (l −1) ; • each GPU thread calculates the value of Fijk (l ) for a specific node (i, j , k ) and, after (l ) GPU performed the computations, values of Fijk are copied back into CPU memory. To use GPUs local memory [15], we form thread groups of a controllable size  with computations organized in l /   substeps. On the substep s , the values of S n (t l −1 , t l ), s l /    n  ( s + 1) l /   are loaded into local memory and a correspond- ing part of S n,l is updated by each thread. Further, the resulting K -terms sum in (8) −  is computed with (−1) n   g (t l −2 ) − −n in advance calculated in parallel and loaded  n  into local memory. To make use of vector data types present in CUDA and OpenCL languages, the summations in (8) are split into m-terms vector operations obtaining l −2 Cijk( s +1) − Cijk( s ) 1  K / m   bs(l ) s =0     (v ( S n =0 (1) n,m n  m , l −1 ) ,..., S( n +1)m ,l −1 ) ,   −  −  − nm  −  −  − ( n +1)  m  vn(1), m =  (−1) nm   g (tl ) ,..., (−1)( n +1)m   g (tl )   nm  (n + 1)  m   (S nm , l ,..., S( n +1)m,l ) = ( Snm,l −1 ,..., S( n +1)m,l −1 ) + +(C − C (l −1 ) ) ( Snm (tl −1 ,tl ),..., S( n +1)m (tl −1 ,tl ) ) . (l) (9) The performance of the algorithm (9) was studied for 16-component double- precision vectors in [15]. Continuing this work, we consider the calculation upon (9) and the saving of S n,l using floating-point data types of different precision (16, 32, and 64 bits) and different vector sizes (2, 4, 8, 16). In all cases, we save the values of C and S n in a double-precision floating-point format in both CPU and GPU memories. The summation upon (6) can also be repre- sented in vectorized form, and we consider it with the same data types as the algo- rithm (9). As the newest NVidia GPUs implement 16x16 matrix-matrix (tensor) multiplica- tion operations in hardware in the so-called tensor cores, we transform the summation in (8) to make use of them. The considered tensor operation Tma ( A, B, C ) multiplies two 16x16 half-precision matrices A , B and adds a 16x16 single-precision matrix C to the result of multipli- cation. The result of this operation also has a single-precision data type. Considering the vector of sums S (16) for 16 grid cells C (1) ,...,C (16) , the summation part of (9) for K = 16 can be performed as follows:  S1,l −1,(0) ... S16,l −1,(0)  1   S (16)  (Tma ( S ,V , A0 ))1 , V = (v (1) T , v ,...., v ), S =  ... T T ... ...  (10)  n ,16 0 0 S ... S16,l −1,(16)   1,l −1,(16) where A0 is the 16x16 zero matrix, v 0 is the 16-component zero vector, S n,l ,(i ) is the S n,l coefficient value for the grid cell i , and ()1 denotes the first column of the ma- trix. The algorithm (10) makes use only of one matrix-vector multiplication of 16 per- formed in Tma operation. 5 Numerical experiments We consider the test problem for the equation (1) with g (t ) = t γ and Γ (1+ 2 / γ) 2− βγ F ( x, y, z,t ) = t − 2 D( x 2 y 2 + y 2 z 2 + x 2 z 2 ) . Γ (1 − β + 2 / γ) Then, the solution of (1) has the form C ( x, y, z, t ) = C0 ( x, y, z, t ) = x 2 y 2 z 2 + t 2 for the case of σ = 1 and the first-type initial and boundary conditions derived from C ( x, y, z, t ) . The system (3)-(5) was solved on 200 time steps of size τ = 0.0025 , γ = 0.8 , β = 0.8 , grid size of N  N  N, N = 40 ,...,80 , and K = 16 ,...,64 on a single node of SCIT-4 cluster of VM Glushkov Institute of Cybernetics of NAS of Ukraine. The used node contains two Intel(R) Xeon(R) Bronze 3104 CPUs with total of 12 cores and NVidia RTX 2080 Ti GPU. The algorithms were implemented in both CUDA and OpenCL languages and the corresponding source code can be accessed through https://github.com/sevaboh/FPDE_HPC. 5.1. Performance for a fixed N The available set of data types for CUDA implementation was double, float, half, double[2,4], float[2,4], and half2. In the case of OpenCL we considered the implemented in CUDA10 double, float, double[2,4,8,16], and float[2,4,8,16] data types. The times spent on computation up- on (6), (9), and (10) on 200 iterations for N = 40 , K = 16 and all the considered com- binations of languages and data types are given in Table.1. In all experiments thread block size  was equal to K . The times spent on a single iteration are near to equal for the approximating algo- rithm (9) and tend to increase linearly for the summation upon (6). This linear in- crease is non-monotone (Fig.1) due to incomplete GPU resources usage when t %K  0 where t is the time step number. Doing calculations upon (6) in CUDA implementation, comparing to the usage of a double-precision data type, the algorithms that were implemented using data types of lower precision performed the computations ~2 times faster for 32-bit single-precision data type and ~3 times faster for 16-bit half-precision data type. The usage of the approximating algorithm (9) up to ~5 times decreased the compu- tation time compared to the summation upon (6). Here the usage of a single-precision data type decreased the time only by <12% when the algorithm was implemented in CUDA. This can be explained by a high amount of data type conversion operations in the S implementations of (9) due to the storage of the solutions C and the coefficients n using 64-bit double-precision data type. The usage of a half-precision data type in CUDA implementation slowed the com- putations compared to the usage of a single-precision data type. However, the half-precision implementation that uses tensor cores was the fastest of all the considered variants. The usage of a single-precision data type in OpenCL implementation slowed down the computations compared to the usage of a double- precision data type. Table 1. Times spent on computation upon (6), (9), and (10) for N = 40 , K = 16 CUDA, CUDA, OpenCL, OpenCL, Data type summation approximating summation approximating upon (6) algorithm (9) upon (6) algorithm (9) double 344,18 78,37 286,86 81,14 float 204,03 74,86 88,27 83,97 half 105,77 77,32 double2 350,24 81,41 287,11 81,90 float2 191,38 73,02 91,08 84,97 half2 98,14 74,70 double4 345,18 79,52 287,57 81,56 float4 182,00 70,19 90,22 84,87 double8 292,92 80,81 float8 90,50 85,18 double16 292,41 80,99 float16 90,76 84,38 Usage of tensor cores 69,49 For K = 16 , vectorization accelerated computations only for the case of a single- precision floating-point data type in CUDA implementation. To explain the observed behavior, we compare PTX assembly sources generated by CUDA and OpenCL compilers with the generation of FMA (floating-point multi- ply-add) instruction enabled. 4000000 3500000 3000000 GPU1d 2500000 GPU1f GPU1h 2000000 GPU4d GPU4f 1500000 GPU2d GPU2f 1000000 GPU2h 500000 0 5 13 21 29 37 45 53 61 69 77 85 93 101 109 117 125 133 141 149 157 165 173 181 189 1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 129 137 145 153 161 169 177 185 193 Fig. 1. CUDA kernel execution time, ns, for the total summation upon (6), N = 40 , K = 16 The following conclusions can be made from the performed comparison: • both compilers perform automatic unwinding of 16 loop iterations; • both compilers do not generate vector arithmetic instructions when vector data types are used; • both compilers generate vectorized 128-bit load/store instructions [ld,st].[shared,global].v2.f64 and [ld,st].[shared,global].v4.f32 and this is their us- age that accelerated computations when vector data types are used; • when both compilers perform loop unwinding, memory is addressed through fixed offsets from the base for both shared and global memory (e.g. ld.global.f64 %fd77, [%rd23+40]). For the vectorized data types, offsets to global memory are comput- ed for every memory access to the corresponding vector (e.g. mul.wide.s32 %rd28, %r67, 128; add.s64 %rd29, %rd3, %rd28; ld.global.v2.f64 {%fd172, %fd173}, [%rd29+112]; ld.global.v2.f64 {%fd176, %fd177}, [%rd29+96];...). For both al- gorithms (6) and (8), such additional operations slowed down the computations. However, the increase of K leads to the decrease of the number of auxiliary op- erations allowing obtaining up to 15% speed-up for double16 data type usage in OpenCL; • CUDA compiler automatically vectorizes the reading of 32-bit floating-point data stored in the local memory using 64-bit instruction ld.shared.v2.f32 making the calculation upon (9) faster than in the case of double-precision data type usage. The absence of such automatic vectorization in OpenCL makes the single-precision OpenCL implementation to perform slower than in the case of a double-precision data type. The same applies to half-precision data type usage in CUDA that yields slower code than for the case of a single-precision data type; • While performing calculation upon (6), CUDA compiler implements division op- eration in div.rn instruction, while OpenCL compiler uses approximated div.full in- struction making OpenCL code faster. To check how the tendencies of vectorization efficiency observed for K = 16 changes for its bigger values, the series of computational experiments were conducted for N = 40 and K = 16,...,64 for the case of the approximating algorithm (9). Here, automatic loop unwinding leads to the decrease of the number of auxiliary operations with the increase of K . The computation time also decreases by up to 15% for double16 data type usage in OpenCL, up to 5.3% for double8 data type usage in OpenCL, up to 2.6% for float16 data type usage in OpenCL, and up to 19% for float4 data type usage in CUDA. The usage of double[2,4] and float[2,8] data types in OpenCL slowed down the computations. Thus, while OpenCL language supports larger vector data types than CUDA lan- guage (16 elements compared to 4 elements), its inefficient compilation by NVidia compiler leads to the situation when OpenCL vectorized implementations of the algo- rithm (9) allow accelerating computations only for large K and the largest vector data types. 5.2 Performance in the case of variable grid size The influence of the time of fractional derivative values calculation on the total time spent on simulation along with the speed-ups of different GPU implementations are given in Table 2 for the summation upon (6) and in Table 3 for the approximating algorithm (9) for N = 40 , ... , 80 . Table 2. Algorithms’ relative characteristics for the summation upon (6), K = 16 , 200 itera- tions N 40 60 80 Percentage of summation to CPU 31,93% 36,64% 36,41% total time CUDA float to Total time 40,74% 43,58% 51,15% CPU Summation time 939,46% 1374,70% 1568,14% Percentage of summation to CUDA float 4,32% 3,57% 3,30% total time Percentage of auxiliary host- 28,62% 22,25% 22,22% GPU operations CUDA float4 to Summation time 7,52% 5,40% 8,80% CUDA float OpenCL float to Summation time 68,19% 104,21% 95,05% CUDA float In the algorithm (6), time spent on fractional derivative values computation (sum- mation time) comprises up to 36% of total computation time on CPU. This value low- ers to ~4% on GPU when this operation became 9-30 times faster and the total com- putation time became ~50% less. As can be seen from the Table 2, GPU algorithms’ efficiency increases with the increase of N while the efficiency of vectorization re- mains on the same level. For the approximating algorithm (9) and K = 16 , fractional derivative values computation comprises up to 40% of the total computation time while executing on CPU with slight decrease of this percentage with the increase of N . GPUs efficiency here is close to the case of the summation upon (6): 15-27 times acceleration of the calculations upon the algorithm (9), 51-63% decrease of the total computation time compared to CPU, ~3% of time spent by GPU algorithms for frac- tional derivative values computation. Comparing to the slower algorithm (6), the time spent on auxiliary CPU-GPU memory copying operations while calling GPU kernels is here bigger: ~60% compared to ~25%. Efficiency of vectorization do not depend on N with the usage of tensor cores yielding the highest 12% increase compared with the usage of a half-precision data type and 8% increase compared to the usage of a single-precision data type. Efficiency of the usage of tensor operations compared to the fastest case of vectorized float4 implementation however is only 2%. Table 3. Algorithms’ relative characteristics for the approximating algorithm (9), K = 16 , average on 200 iterations N 40 60 80 Percentage of summa- CPU 39,85% 38,04% 36,85% tion to total time CUDA float to CPU Total time 61,19% 53,05% 55,29% Summation time 1717,88% 2536,28% 2735,08% Percentage of summa- CUDA float 3,53% 2,21% 2,02% tion to total time Percentage of auxiliary 67,30% 61,63% 59,96% host-GPU operations CUDA float4 to Kernel execution time 6,43% 6,40% 6,60% CUDA float OpenCL float to Kernel execution time -10,84% -9,83% -9,72% CUDA float CUDA half to CUDA Kernel execution time -3,27% -3,31% -3,44% float CUDA half2 to Kernel execution time 0,21% 0,68% 0,69% CUDA float CUDA half, tensor Kernel execution time 7,53% 8,07% 8,24% cores, to CUDA float CUDA half, tensor Kernel execution time 11,16% 11,77% 12,09% cores to CUDA half 5.3 Errors of solution Changes in maximal square absolute error on the iterations while using summation upon (6) and series expansion upon (9) for different data types are given in Fig. 2. Lowering of data type precision has a little influence on the quality of solution for the summation upon (6) due to monotone increase of bs(i ) when i → s that minimizes summation errors. The error for the algorithm (6) decreases with the increase of time step number. 0,0018 0,0016 0,0014 0,0012 0,001 0,0008 0,0006 0,0004 0,0002 0 1 9 17 25 33 41 49 57 65 73 81 89 97 05 13 21 29 37 45 53 61 69 77 85 93 1 1 1 1 1 1 1 1 1 1 1 1 Full summation (all data types) Series expansion (float,double) Series expansion (half) Series expansion (half, tensor cores) Fig. 2. Maximal square absolute error on time steps while solving the problem using different data types Close values of errors between the summation upon (6) and the approximating al- gorithm (9) with K = 16 on the initial time steps show the adequacy of a chosen val- ue of K . However, for the algorithm (9), the recurrently changing values of S n,l overflow the data type values range starting from some time step leading to linear increase of error. In the conducted experiments such behavior was observed for a half- and single-precision data types. 6 Conclusions Calculation of time-fractional ψ-Caputo derivatives’ values significantly influence time needed to solve fractional derivative equations that contain them. For the consid- ered case of the diffusion equation solved using a finite-difference scheme, this opera- tion comprises up to 40% of time spent while performing calculation on CPU. Its par- allelization on GPU is efficient because the computations on the cells of the grid are independent. We studied additional possibilities to speedup GPU implementations considering the summation algorithm of linear computational complexity and the approximating recurrent algorithm of constant complexity. Considering the usage of different preci- sion floating-point data types, their vectors, and matrix-matrix multiplication opera- tions implemented in tensor cores of the latest NVidia GPUs the following conclusion can be made: • for the summation algorithm (6), comparing to the usage of a double-precision data type, the computations were ~2 times faster using 32-bit single-precision data type and ~3 times faster using 16-bit half-precision data type without significant loss of accuracy; • for the approximating algorithm (9) that was up to 5-times faster than the summa- tion algorithm, the usage of different data types slightly influence the performance due to the high amount of data type conversion operations. However, the usage of low-precision data types here leads to linear increase of solution error starting from some time step due to data type overflow; • Inefficient compilation of vectorized memory access in OpenCL code leads to the situation when OpenCL vectorized implementations of the algorithm (9) allow ac- celerating computations only for large number K of terms in truncated series and the largest vector data types. Thus, the usage of CUDA is preferable for the ap- proximating algorithm (9). However, for the summation algorithm (6), OpenCL implementations were faster due to the differences in compiling the division arith- metic operation; • GPU implementations 9-30 times accelerate fractional derivative values computa- tion with the efficiency that increases with the increase of grid size; • the usage of tensor cores that operate with 16-bit half-precision floating-point data type allowed performing calculations 12% faster than the usage of a half-precision data type without vectorization. However, comparing to the efficiency of the fast- est vectorized implementation that uses float4 data type, only 2% acceleration was obtained. References 1. Gorenflo, R., Mainardi, F.: Fractional calculus: integral and differential equations of frac- tional order. In: Fractals and Fractional Calculus in Continuum Mechanics, pp.223–276, Springer Verlag, Wien, Austria (1997) 2. Podlubny, I.: Fractional differential equations, Academic Press, New York (1999) 3. Bulavatsky, V.M.: Mathematical modeling of dynamics of the process of filtration convec- tion diffusion under the condition of time nonlocality. Journal of Automation and Infor- mation Science 44(4), 13–22 (2012). doi: 10.1615/JAutomatInfScien.v44.i4.20 4. Bonchis, C., Kaslik, E., Rosu, F.: HPC optimal parallel communication algorithm for the simulation of fractional-order systems. J Supercomput 75, 1014–1025 (2019). doi: 10.1007/s11227-018-2267-z 5. Liu, J., Gong, C., Bao, W., Tang, G., Jiang, Y.: Solving the Caputo Fractional Reaction- Diffusion Equation on GPU. Discrete Dynamics in Nature and Society 2014, 820162 (2014). doi: 10.1155/2014/820162 6. Golev, A., Penev, A., Stefanova, K., Hristova, S.: Using GPU to speed up calculation of some approximate methods for fractional differential equations. International Journal of Pure and Applied Mathematics 119(3), 391-401 (2018). doi: 10.12732/ijpam.v119i3.1 7. Gong, C., Bao, W., Liu, J.: A piecewise memory principle for fractional derivatives. Fract. Calc. Appl. Anal. 20(4), 1010–1022 (2017). doi: 10.1515/fca-2017-0052 8. Ford, N.J., Simpson, A.C.: The numerical solution of fractional differential equations: Speed versus accuracy. Numerical Algorithms, 26(4), 333–346 (2001). doi: 10.1023/A:1016601312158 9. Baffet, D., Hesthaven, J.S.: A kernel compression scheme for fractional differential equa- tions. Siam J. Numer. Anal 55(2), 496–520 (2017). doi: 10.1137/15M1043960 10. Bogaenko, V.A., Bulavatsky V.М.: Computer modeling of the dynamics of migration pro- cesses of soluble substances in the case of groundwater filtration with free surface on the base of the fractional derivative approach (in Russian). Dopov. Nac. Akad. Nauk Ukr. 12, 21–29 (2018). doi: 10.15407/dopovidi2018.12.021 11. Almeida, R.: A Caputo fractional derivative of a function with respect to another function. Communications in Nonlinear Science and Numerical Simulation 44, 460–481 (2017). doi: 10.1016/j.cnsns.2016.09.006 12. Samarskii A.: The Theory of Difference Schemes. CRC Press, New York (2001). 13. Bohaienko, V.O.: A fast finite-difference algorithm for solving space-fractional filtration equation with a generalised Caputo derivative. Computational and Applied Mathematics 38(3), 105 (2019). doi: 10.1007/s40314-019-0878-5 14. Bohaienko V.O.: Efficient computation schemes for generalized two-dimensional time- fractional diffusion equation. In: Papers of the International scientific and practical confer- ence ІТСМ – 2019, Ivano-Frankivsk, Ukraine, pp. 238–241 (2019). 15. Bohaienko, V.O.: Performance of vectorized GPU-algorithm for computing ψ-Caputo de- rivative values. In: Hu Z., Petoukhov S., Dychka I., He M. (eds) Advances in Computer Science for Engineering and Education III. ICCSEEA 2020. Advances in Intelligent Sys- tems and Computing, vol 1247. Springer, Cham (2021). https://doi.org/10.1007/978-3- 030-55506-1_24