=Paper= {{Paper |id=Vol-2711/paper49 |storemode=property |title=Computing ψ-Caputo Fractional Derivative Values Using CUDA 10 |pdfUrl=https://ceur-ws.org/Vol-2711/paper49.pdf |volume=Vol-2711 |authors=Vsevolod Bohaienko |dblpUrl=https://dblp.org/rec/conf/icst2/Bohaienko20 }} ==Computing ψ-Caputo Fractional Derivative Values Using CUDA 10== https://ceur-ws.org/Vol-2711/paper49.pdf
Computing ψ-Caputo Fractional Derivative Values Using
                    CUDA 10

                          Vsevolod Bohaienko[0000-0002-3317-9022]

         V.M. Glushkov Institute of cybernetics of NAS of Ukraine, Kyiv, Ukraine
                                    sevab@ukr.net



           Abstract. The paper addresses the issues of efficient GPU-implementation
       of ψ-Caputo fractional derivative values computation on NVIDIA GPU’s with
       compute capability 7.5 using CUDA 10 SDK on both CUDA and OpenCL lan-
       guages. We consider a three-dimensional time-fractional diffusion equation
       solved by a locally one-dimensional finite difference scheme. To compute non-
       local part of the derivative a rectangle rule quadrature is used and a summation
       algorithm of linear computational complexity is considered along with a con-
       stant complexity order approximating algorithm based on integral kernel expan-
       sion into series. For the approximating algorithm we present a computational
       scheme that uses NVidia GPU’s tensor cores. For both algorithms, we study the
       influence of the used scalar and vector data types on performance and accuracy.
       Studying the summation algorithm, comparing to the usage of 64-bit double-
       precision floating-point data type, the computations were ~2 times faster for 32-
       bit single-precision data type and ~3 times faster for 16-bit half-precision data
       type without significant loss of accuracy. For the approximated algorithm that
       was up to 5-times faster than the summation algorithm, the usage of low-
       precision data types slightly influence the performance reducing the accuracy
       during long-term simulations. The usage of vectorized operations in the approx-
       imation algorithm allowed up to 6-19% speed-up compared with non-vectorized
       implementations for a single-precision data type. The usage of tensor cores that
       operate with a half-precision data type allowed performing calculations 12%
       faster compared to the case when the same data type was used.

       Keywords: GPU algorithms, finite-difference method, diffusion equation,
       ψ-Caputo fractional derivative, tensor cores, data types, CUDA, OpenCL.


1      Introduction

Memory effects in diffusion processes can be efficiently simulated using time-
fractional differential equations [1-3].
    Such equations contain the so-called fractional derivatives that are integral-
differential operators.
    The need to numerically calculate integrals while solving time-fractional differen-
tial equations increase the computational complexity order compared to the traditional
differential equations.


Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0). ICST-2020
    Approaches to lower computational complexity include parallel computing tech-
niques [4,5,6], particularly using GPUs [5,6], and approximation of integral kernels
[7,8,9].
    In this paper we consider a three-dimensional diffusion equation with the general-
ized ψ-Caputo derivative with functional parameter solved by a locally one-
dimensional finite-difference scheme similarly to [10].
    We focus on efficient GPU implementation of ψ-Caputo derivative’s values com-
putation, which is one the most time-consuming operation while performing simula-
tions, on NVidia GPUs using CUDA 10 SDK.
    We study GPU algorithms’ performance when scalar and vector data types of dif-
ferent length and precision are used in CUDA and OpenCL algorithms’ implementa-
tions along with the possibility to speed up computations making use of 16x16 ma-
trix-matrix multiplication operations that are hardware-implemented in the so-called
tensor cores of the NVidia GPUs with compute capabilities 7.x.


2         Problem statement and finite-difference scheme

We consider the following three-dimensional time-fractional diffusion equation [10]:

                                 2C ( x, y, z,t )  2C ( x, y, z,t )  2C ( x, y, z,t ) 
 σDt,( gβ )C ( x, y, z,t ) = D                   +                  +                   +
                                                                                          
                                      x 2
                                                          y 2
                                                                              z 2
                                                                                            (1)
 + F ( x, y, z,t ) , 0  x  1, 0  y  1, 0  z  1, t > 0, 0 < β  1

where C ( x, y, z, t ) is the diffusive substance concentration, σ is the porosity, D is
the diffusion coefficient, F is the given source-term function, and Dt,( g ) is the gen-
eralized ψ-Caputo fractional derivative with respect to the time variable t with the
functional parameter g (t ) that has the following form [11]:

                                                         C ( x, y,z,t )
                                                         t
                                               1
                                            (1 − β ) 0
                     (β)
                   Dt,g  C ( x, y,z,t ) =                                ( g (t ) − g ( τ )) − β dτ ,                         (2)
                                                               t

                               g (t )  C1  0, + ) , g (t )  0 (t  0), g (0) = 0
                                                                                                .

  For the equation (1) we pose first-type initial and boundary conditions. The initial
boundary value problem for the equation (1) is discretized using locally one-
dimensional finite-difference scheme [10,12] on the uniform grid
     = {( xi , y j , zk , tl ) : xi = ih1 (i = 0, n1 + 1), y j = jh2 ( j = 0, n2 + 1), z k = kh3 ( j = 0, n3 + 1), tl = l


where h1 , h2 , h3 ,τ are the steps with respect to the space and the time variables.
  The finite-difference scheme, similarly to [10,12], is as follows:
                                 D                                                                                 
             Cijk(l +1/ 3) −         2
                                         (Ci(−l 1,+1/j ,3)k − 2Cijk( l +1/ 3) + Ci(+l 1,+1/j ,3)k ) = Cijk( l ) +         Fijk( l ) ,   (3)
                                kh1 1                                                                               3k1

                               D (l + 2 / 3)                                                                (l )
          Cijk(l + 2 / 3) −         (Ci , j −1, k − 2Cijk(l + 2 / 3) + Ci(,lj++21,/k3) ) = Cijk(l +1/ 3) +     Fijk ,                   (4)
                              k1h22                                                                        3k1

                               D                                                                                   
              Cijk(l +1) −        2
                                      (Ci(,lj+,1)k −1 − 2Cijk(l +1) + Ci(,lj+,1)k +1 ) = Cijk( l + 2 / 3) +               Fijk( l ) ,
                              kh1 3                                                                                 3k1
                                                                                                                                        (5)

                                                                        bl(l +1)
                                                                k1 =             ,
                                                                       (1 −  )

                                                           1      l −2         Cijk( s +1) − Cijk( s )
                                 Fijk(l ) = Fijk +                
                                                        (1 −  ) s = 0
                                                                        bs(l )
                                                                                          
                                                                                                       ,                                (6)


                                                        ts +1

                                            bs(i ) =  ( g (ti ) − g ( )) −  d .
                                                         ts
                                                                                                                                        (7)

   The equations (3)-(5) can be with the addition of discretized forms of boundary
conditions represented as three-diagonal linear equation systems, which are solved by
the Thomas algorithm [12].
   The summation in the right-hand side of (6) is the first-order discretization of a
non-local part of the fractional derivative obtained applying the rectangle quadrature
to the integral in (2).


3      Approximating algorithm for computing ψ-Caputo fractional
       derivative values

Computation of non-local part of fractional derivative according to (6) has O (l ) com-
putational complexity and requires storing all previously obtained solutions making it
time and memory consuming while performing simulations over large time intervals.
Hence, we consider the constant complexity order algorithm based on series expan-
sion of the integrals (7) [13, 14] that requires storing solutions on only the two last
time steps.
   When g (t ) has an infinitely differentiated inverse function f (t ) , bs(i ) can be rep-
resented in the form [13,14]
                                                                        −                             
                                 bs(j) =         (−1) n  n  g (t j ) − β − n Sn ,
                                               n= 0
                                                                       f ( m+1) ( g (t s+1 )) 
                                S n (t s , t s+1 ) =                Bm
                                                                    
                                                                                                ,
                                                               m= 0 
                                                                                 m!             

               g ( ts +1 )                                                     g ( ts +1 )
                                                                                                  1
                            x n ( x − g (t s+1 )) m dx, B0 =                       x dx = n+1 ( g (t )               − g (t s ) n+1 ),
                                                                                             n                   n+1
        Bm =                                                                                               s+1
                g ( ts )                                                        g ( ts )

                          n+i + 2            g (ts ) ( g (ts ) − g (t s+1 ))i+1 
                                                                         n+1
        Bi+1 = −                        Bi −                                    ,
                       g (ts+1 )(i +1)               g (t s+1 )(i +1)           

   Truncating the infinite series, the summation in (6) using such representation can
be organized [14] as follows:
                  l −2           Cijk( s +1) − Cijk( s )       1 K        n  −                          
                   bs(l )                 
                                                                    (−1) 
                                                                n=0 
                                                                                   g (tl )
                                                                                            − −n
                                                                                                  S n ,l −1 ,
                  s =0                                                        n                                                         (8)
                                                               ( l −1)
                  S n,l = S n,l −1 + (C            (l )
                                                          −C             ) S n (tl −1 ,tl ), S n ,0 = 0.

where K is the given number of terms in series.
                                                                        (l )
  So, to compute in a specific node (i, j , k ) the value of the term Fijk   , which is the
most time-consuming part when computing right-hand sides of three-diagonal linear
equations systems (3)-(5), we need to store the values of S n,• and on the time step l

• update S n,l −1 into S n,l according to the second equation in (8);
• calculate the value of the sum according to the first equation in (8);
• substitute the obtained value of the sum into (6) getting the value of Fijk
                                                                           (l )
                                                                                .


4       Parallel implementation

We perform the solution of linear systems (3)-(5) using multithreaded OpenMP im-
plementation with the calculation of a non-local part of the ψ-Caputo fractional deriv-
ative upon (6) or (8) performed for every node of the grid using GPU.
   The calculations upon the series approximating algorithm (8) are organized follow-
ing the scheme described in [15]:

• the values of S n (t l −1 ,tl ) are calculated on CPU on the step l − 1 in parallel with
  GPU computations upon (8) and uploaded on the step l into GPU memory along
    with the solution C (l −1) ;
• each GPU thread calculates the value of Fijk
                                            (l )
                                                 for a specific node (i, j , k ) and, after
                                               (l )
   GPU performed the computations, values of Fijk   are copied back into CPU
   memory.
    To use GPUs local memory [15], we form thread groups of a controllable size 
with computations organized in l /   substeps. On the substep s , the values of
 S n (t l −1 , t l ), s l /    n  ( s + 1) l /   are loaded into local memory and a correspond-
ing part of S n,l is updated by each thread. Further, the resulting K -terms sum in (8)
                         − 
is computed with (−1) n      g (t l −2 ) − −n in advance calculated in parallel and loaded
                          n   
into local memory.
   To make use of vector data types present in CUDA and OpenCL languages, the
summations in (8) are split into m-terms vector operations obtaining
         l −2      Cijk( s +1) − Cijk( s )        1 
                                                         K / m 

         bs(l )
         s =0                
                                              
                                                  
                                                          (v ( S
                                                         n =0
                                                                    (1)
                                                                    n,m   n  m , l −1                          )
                                                                                         ,..., S( n +1)m ,l −1 ) ,

                              −          −  − nm                      −                   −  − ( n +1)  m 
        vn(1), m =  (−1) nm      g (tl )           ,..., (−1)( n +1)m               g (tl )                   
                             nm                                         (n + 1)  m                            

                            (S   nm , l   ,..., S( n +1)m,l ) = ( Snm,l −1 ,..., S( n +1)m,l −1 ) +
                            +(C − C (l −1 ) ) ( Snm (tl −1 ,tl ),..., S( n +1)m (tl −1 ,tl ) ) .
                                    (l)
                                                                                                                         (9)



   The performance of the algorithm (9) was studied for 16-component double-
precision vectors in [15].
   Continuing this work, we consider the calculation upon (9) and the saving of S n,l
using floating-point data types of different precision (16, 32, and 64 bits) and different
vector sizes (2, 4, 8, 16).
    In all cases, we save the values of C and S n in a double-precision floating-point
format in both CPU and GPU memories. The summation upon (6) can also be repre-
sented in vectorized form, and we consider it with the same data types as the algo-
rithm (9).
    As the newest NVidia GPUs implement 16x16 matrix-matrix (tensor) multiplica-
tion operations in hardware in the so-called tensor cores, we transform the summation
in (8) to make use of them.
    The considered tensor operation Tma ( A, B, C ) multiplies two 16x16 half-precision
matrices A , B and adds a 16x16 single-precision matrix C to the result of multipli-
cation. The result of this operation also has a single-precision data type. Considering
the vector of sums S (16) for 16 grid cells C (1) ,...,C (16) , the summation part of (9) for
K = 16 can be performed as follows:
                                                                                S1,l −1,(0)   ... S16,l −1,(0) 
                   1                                                                                             
    S   (16)
                      (Tma ( S ,V , A0 ))1 , V = (v
                                                   (1) T
                                                           , v ,...., v ), S =  ...
                                                             T        T
                                                                                               ...     ...           (10)
                   
                                                   n ,16     0        0
                                                                               S              ... S16,l −1,(16) 
                                                                                1,l −1,(16)

where A0 is the 16x16 zero matrix, v 0 is the 16-component zero vector, S n,l ,(i ) is the
S n,l coefficient value for the grid cell i , and ()1 denotes the first column of the ma-
trix.
    The algorithm (10) makes use only of one matrix-vector multiplication of 16 per-
formed in Tma operation.


5              Numerical experiments

We consider the test problem for the equation (1) with g (t ) = t γ and
                                 Γ (1+ 2 / γ) 2− βγ
             F ( x, y, z,t ) =                   t  − 2 D( x 2 y 2 + y 2 z 2 + x 2 z 2 ) .
                               Γ (1 − β + 2 / γ)
    Then, the solution of (1) has the form C ( x, y, z, t ) = C0 ( x, y, z, t ) = x 2 y 2 z 2 + t 2 for
the case of σ = 1 and the first-type initial and boundary conditions derived from
 C ( x, y, z, t ) .
    The system (3)-(5) was solved on 200 time steps of size τ = 0.0025 ,
 γ = 0.8 , β = 0.8 , grid size of N  N  N, N = 40 ,...,80 , and K = 16 ,...,64 on a single
node of SCIT-4 cluster of VM Glushkov Institute of Cybernetics of NAS of Ukraine.
The used node contains two Intel(R) Xeon(R) Bronze 3104 CPUs with total of 12
cores and NVidia RTX 2080 Ti GPU.
    The algorithms were implemented in both CUDA and OpenCL languages and the
corresponding           source      code       can           be         accessed              through
https://github.com/sevaboh/FPDE_HPC.


5.1. Performance for a fixed N

The available set of data types for CUDA implementation was double, float, half,
double[2,4], float[2,4], and half2.
    In the case of OpenCL we considered the implemented in CUDA10 double, float,
double[2,4,8,16], and float[2,4,8,16] data types. The times spent on computation up-
on (6), (9), and (10) on 200 iterations for N = 40 , K = 16 and all the considered com-
binations of languages and data types are given in Table.1. In all experiments thread
block size  was equal to K .
    The times spent on a single iteration are near to equal for the approximating algo-
rithm (9) and tend to increase linearly for the summation upon (6). This linear in-
crease is non-monotone (Fig.1) due to incomplete GPU resources usage when
t %K  0 where t is the time step number.
    Doing calculations upon (6) in CUDA implementation, comparing to the usage of a
double-precision data type, the algorithms that were implemented using data types of
lower precision performed the computations ~2 times faster for 32-bit single-precision
data type and ~3 times faster for 16-bit half-precision data type.
    The usage of the approximating algorithm (9) up to ~5 times decreased the compu-
tation time compared to the summation upon (6).
    Here the usage of a single-precision data type decreased the time only by <12%
when the algorithm was implemented in CUDA.
    This can be explained by a high amount of data type conversion operations in the
                                                                                   S
implementations of (9) due to the storage of the solutions C and the coefficients n
using 64-bit double-precision data type.
    The usage of a half-precision data type in CUDA implementation slowed the com-
putations compared to the usage of a single-precision data type.
    However, the half-precision implementation that uses tensor cores was the fastest
of all the considered variants. The usage of a single-precision data type in OpenCL
implementation slowed down the computations compared to the usage of a double-
precision data type.

      Table 1. Times spent on computation upon (6), (9), and (10) for N = 40 , K = 16

                      CUDA,           CUDA,           OpenCL,       OpenCL,
Data type             summation       approximating summation       approximating
                      upon (6)        algorithm (9) upon (6)        algorithm (9)
double                         344,18           78,37        286,86          81,14
float                          204,03           74,86         88,27          83,97
half                           105,77           77,32
double2                        350,24           81,41        287,11          81,90
float2                         191,38           73,02         91,08          84,97
half2                           98,14           74,70
double4                        345,18           79,52        287,57          81,56
float4                         182,00           70,19         90,22          84,87
double8                                                      292,92          80,81
float8                                                        90,50          85,18
double16                                                     292,41          80,99
float16                                                       90,76          84,38
Usage of tensor cores                           69,49

   For K = 16 , vectorization accelerated computations only for the case of a single-
precision floating-point data type in CUDA implementation.
   To explain the observed behavior, we compare PTX assembly sources generated
by CUDA and OpenCL compilers with the generation of FMA (floating-point multi-
ply-add) instruction enabled.
   4000000


   3500000


   3000000
                                                                                                       GPU1d
   2500000                                                                                             GPU1f
                                                                                                       GPU1h
   2000000                                                                                             GPU4d
                                                                                                       GPU4f
   1500000                                                                                             GPU2d
                                                                                                       GPU2f
   1000000                                                                                             GPU2h


    500000


        0
                 5 13 21 29 37 45 53 61 69 77 85 93 101 109 117 125 133 141 149 157 165 173 181 189
             1    9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 129 137 145 153 161 169 177 185 193



 Fig. 1. CUDA kernel execution time, ns, for the total summation upon (6), N = 40 , K = 16

  The following conclusions can be made from the performed comparison:

• both compilers perform automatic unwinding of 16 loop iterations;
• both compilers do not generate vector arithmetic instructions when vector data
  types are used;
• both compilers generate vectorized 128-bit load/store instructions
  [ld,st].[shared,global].v2.f64 and [ld,st].[shared,global].v4.f32 and this is their us-
  age that accelerated computations when vector data types are used;
• when both compilers perform loop unwinding, memory is addressed through fixed
  offsets from the base for both shared and global memory (e.g. ld.global.f64 %fd77,
  [%rd23+40]). For the vectorized data types, offsets to global memory are comput-
  ed for every memory access to the corresponding vector (e.g. mul.wide.s32 %rd28,
  %r67, 128; add.s64 %rd29, %rd3, %rd28; ld.global.v2.f64 {%fd172, %fd173},
  [%rd29+112]; ld.global.v2.f64 {%fd176, %fd177}, [%rd29+96];...). For both al-
  gorithms (6) and (8), such additional operations slowed down the computations.
  However, the increase of K leads to the decrease of the number of auxiliary op-
  erations allowing obtaining up to 15% speed-up for double16 data type usage in
  OpenCL;
• CUDA compiler automatically vectorizes the reading of 32-bit floating-point data
  stored in the local memory using 64-bit instruction ld.shared.v2.f32 making the
  calculation upon (9) faster than in the case of double-precision data type usage.
  The absence of such automatic vectorization in OpenCL makes the single-precision
  OpenCL implementation to perform slower than in the case of a double-precision
  data type. The same applies to half-precision data type usage in CUDA that yields
  slower code than for the case of a single-precision data type;
• While performing calculation upon (6), CUDA compiler implements division op-
  eration in div.rn instruction, while OpenCL compiler uses approximated div.full in-
  struction making OpenCL code faster.
    To check how the tendencies of vectorization efficiency observed for K = 16
changes for its bigger values, the series of computational experiments were conducted
for N = 40 and K = 16,...,64 for the case of the approximating algorithm (9).
    Here, automatic loop unwinding leads to the decrease of the number of auxiliary
operations with the increase of K . The computation time also decreases by up to
15% for double16 data type usage in OpenCL, up to 5.3% for double8 data type usage
in OpenCL, up to 2.6% for float16 data type usage in OpenCL, and up to 19% for
float4 data type usage in CUDA. The usage of double[2,4] and float[2,8] data types
in OpenCL slowed down the computations.
    Thus, while OpenCL language supports larger vector data types than CUDA lan-
guage (16 elements compared to 4 elements), its inefficient compilation by NVidia
compiler leads to the situation when OpenCL vectorized implementations of the algo-
rithm (9) allow accelerating computations only for large K and the largest vector
data types.


5.2 Performance in the case of variable grid size

The influence of the time of fractional derivative values calculation on the total time
spent on simulation along with the speed-ups of different GPU implementations are
given in Table 2 for the summation upon (6) and in Table 3 for the approximating
algorithm (9) for N = 40 , ... , 80 .

 Table 2. Algorithms’ relative characteristics for the summation upon (6), K = 16 , 200 itera-
                                              tions

                      N                                          40           60          80
                      Percentage of summation to
CPU                                                        31,93%       36,64%       36,41%
                      total time
CUDA float to
                      Total time                           40,74%       43,58%       51,15%
CPU
                      Summation time                      939,46% 1374,70% 1568,14%
                      Percentage of summation to
CUDA float                                                  4,32%         3,57%       3,30%
                      total time
                      Percentage of auxiliary host-
                                                           28,62%       22,25%       22,22%
                      GPU operations
CUDA float4 to
                      Summation time                        7,52%         5,40%       8,80%
CUDA float
OpenCL float to
                      Summation time                       68,19%      104,21%       95,05%
CUDA float
   In the algorithm (6), time spent on fractional derivative values computation (sum-
mation time) comprises up to 36% of total computation time on CPU. This value low-
ers to ~4% on GPU when this operation became 9-30 times faster and the total com-
putation time became ~50% less. As can be seen from the Table 2, GPU algorithms’
efficiency increases with the increase of N while the efficiency of vectorization re-
mains on the same level. For the approximating algorithm (9) and K = 16 , fractional
derivative values computation comprises up to 40% of the total computation time
while executing on CPU with slight decrease of this percentage with the increase of
 N . GPUs efficiency here is close to the case of the summation upon (6): 15-27 times
acceleration of the calculations upon the algorithm (9), 51-63% decrease of the total
computation time compared to CPU, ~3% of time spent by GPU algorithms for frac-
tional derivative values computation. Comparing to the slower algorithm (6), the time
spent on auxiliary CPU-GPU memory copying operations while calling GPU kernels
is here bigger: ~60% compared to ~25%. Efficiency of vectorization do not depend on
 N with the usage of tensor cores yielding the highest 12% increase compared with
the usage of a half-precision data type and 8% increase compared to the usage of a
single-precision data type. Efficiency of the usage of tensor operations compared to
the fastest case of vectorized float4 implementation however is only 2%.

  Table 3. Algorithms’ relative characteristics for the approximating algorithm (9), K = 16 ,
                                  average on 200 iterations

                     N                                          40             60           80
                     Percentage of summa-
CPU                                                       39,85%         38,04%       36,85%
                     tion to total time
CUDA float to CPU Total time                              61,19%        53,05%   55,29%
                     Summation time                     1717,88%      2536,28% 2735,08%
                     Percentage of summa-
CUDA float                                                  3,53%         2,21%        2,02%
                     tion to total time
                     Percentage of auxiliary
                                                          67,30%         61,63%       59,96%
                     host-GPU operations
CUDA float4 to
                     Kernel execution time                  6,43%         6,40%        6,60%
CUDA float
OpenCL float to
                     Kernel execution time                -10,84%        -9,83%       -9,72%
CUDA float
CUDA half to CUDA
                     Kernel execution time                 -3,27%        -3,31%       -3,44%
float
CUDA half2 to
                     Kernel execution time                  0,21%         0,68%        0,69%
CUDA float
CUDA half, tensor
                     Kernel execution time                  7,53%         8,07%        8,24%
cores, to CUDA float
CUDA half, tensor
                     Kernel execution time                11,16%         11,77%       12,09%
cores to CUDA half
5.3 Errors of solution

Changes in maximal square absolute error on the iterations while using summation
upon (6) and series expansion upon (9) for different data types are given in Fig. 2.
Lowering of data type precision has a little influence on the quality of solution for the
summation upon (6) due to monotone increase of bs(i ) when i → s that minimizes
summation errors. The error for the algorithm (6) decreases with the increase of time
step number.
          0,0018


          0,0016


          0,0014


          0,0012


           0,001


          0,0008


          0,0006


          0,0004


          0,0002


              0
                   1   9 17 25 33 41 49 57 65 73 81 89 97 05 13 21 29 37 45 53 61 69 77 85 93
                                                          1 1 1 1 1 1 1 1 1 1 1 1
               Full summation (all data types)         Series expansion (float,double)   Series expansion (half)
               Series expansion (half, tensor cores)



 Fig. 2. Maximal square absolute error on time steps while solving the problem using different
                                          data types

  Close values of errors between the summation upon (6) and the approximating al-
gorithm (9) with K = 16 on the initial time steps show the adequacy of a chosen val-
ue of K . However, for the algorithm (9), the recurrently changing values of S n,l
overflow the data type values range starting from some time step leading to linear
increase of error. In the conducted experiments such behavior was observed for a
half- and single-precision data types.


6      Conclusions

Calculation of time-fractional ψ-Caputo derivatives’ values significantly influence
time needed to solve fractional derivative equations that contain them. For the consid-
ered case of the diffusion equation solved using a finite-difference scheme, this opera-
tion comprises up to 40% of time spent while performing calculation on CPU. Its par-
allelization on GPU is efficient because the computations on the cells of the grid are
independent.
    We studied additional possibilities to speedup GPU implementations considering
the summation algorithm of linear computational complexity and the approximating
recurrent algorithm of constant complexity. Considering the usage of different preci-
sion floating-point data types, their vectors, and matrix-matrix multiplication opera-
tions implemented in tensor cores of the latest NVidia GPUs the following conclusion
can be made:

• for the summation algorithm (6), comparing to the usage of a double-precision data
  type, the computations were ~2 times faster using 32-bit single-precision data type
  and ~3 times faster using 16-bit half-precision data type without significant loss of
  accuracy;
• for the approximating algorithm (9) that was up to 5-times faster than the summa-
  tion algorithm, the usage of different data types slightly influence the performance
  due to the high amount of data type conversion operations. However, the usage of
  low-precision data types here leads to linear increase of solution error starting from
  some time step due to data type overflow;
• Inefficient compilation of vectorized memory access in OpenCL code leads to the
  situation when OpenCL vectorized implementations of the algorithm (9) allow ac-
  celerating computations only for large number K of terms in truncated series and
  the largest vector data types. Thus, the usage of CUDA is preferable for the ap-
  proximating algorithm (9). However, for the summation algorithm (6), OpenCL
  implementations were faster due to the differences in compiling the division arith-
  metic operation;
• GPU implementations 9-30 times accelerate fractional derivative values computa-
  tion with the efficiency that increases with the increase of grid size;
• the usage of tensor cores that operate with 16-bit half-precision floating-point data
  type allowed performing calculations 12% faster than the usage of a half-precision
  data type without vectorization. However, comparing to the efficiency of the fast-
  est vectorized implementation that uses float4 data type, only 2% acceleration was
  obtained.


References
 1. Gorenflo, R., Mainardi, F.: Fractional calculus: integral and differential equations of frac-
    tional order. In: Fractals and Fractional Calculus in Continuum Mechanics, pp.223–276,
    Springer Verlag, Wien, Austria (1997)
 2. Podlubny, I.: Fractional differential equations, Academic Press, New York (1999)
 3. Bulavatsky, V.M.: Mathematical modeling of dynamics of the process of filtration convec-
    tion diffusion under the condition of time nonlocality. Journal of Automation and Infor-
    mation Science 44(4), 13–22 (2012). doi: 10.1615/JAutomatInfScien.v44.i4.20
 4. Bonchis, C., Kaslik, E., Rosu, F.: HPC optimal parallel communication algorithm for the
    simulation of fractional-order systems. J Supercomput 75, 1014–1025 (2019). doi:
    10.1007/s11227-018-2267-z
 5. Liu, J., Gong, C., Bao, W., Tang, G., Jiang, Y.: Solving the Caputo Fractional Reaction-
    Diffusion Equation on GPU. Discrete Dynamics in Nature and Society 2014, 820162
    (2014). doi: 10.1155/2014/820162
 6. Golev, A., Penev, A., Stefanova, K., Hristova, S.: Using GPU to speed up calculation of
    some approximate methods for fractional differential equations. International Journal of
    Pure and Applied Mathematics 119(3), 391-401 (2018). doi: 10.12732/ijpam.v119i3.1
 7. Gong, C., Bao, W., Liu, J.: A piecewise memory principle for fractional derivatives. Fract.
    Calc. Appl. Anal. 20(4), 1010–1022 (2017). doi: 10.1515/fca-2017-0052
 8. Ford, N.J., Simpson, A.C.: The numerical solution of fractional differential equations:
    Speed versus accuracy. Numerical Algorithms, 26(4), 333–346 (2001). doi:
    10.1023/A:1016601312158
 9. Baffet, D., Hesthaven, J.S.: A kernel compression scheme for fractional differential equa-
    tions. Siam J. Numer. Anal 55(2), 496–520 (2017). doi: 10.1137/15M1043960
10. Bogaenko, V.A., Bulavatsky V.М.: Computer modeling of the dynamics of migration pro-
    cesses of soluble substances in the case of groundwater filtration with free surface on the
    base of the fractional derivative approach (in Russian). Dopov. Nac. Akad. Nauk Ukr. 12,
    21–29 (2018). doi: 10.15407/dopovidi2018.12.021
11. Almeida, R.: A Caputo fractional derivative of a function with respect to another function.
    Communications in Nonlinear Science and Numerical Simulation 44, 460–481 (2017). doi:
    10.1016/j.cnsns.2016.09.006
12. Samarskii A.: The Theory of Difference Schemes. CRC Press, New York (2001).
13. Bohaienko, V.O.: A fast finite-difference algorithm for solving space-fractional filtration
    equation with a generalised Caputo derivative. Computational and Applied Mathematics
    38(3), 105 (2019). doi: 10.1007/s40314-019-0878-5
14. Bohaienko V.O.: Efficient computation schemes for generalized two-dimensional time-
    fractional diffusion equation. In: Papers of the International scientific and practical confer-
    ence ІТСМ – 2019, Ivano-Frankivsk, Ukraine, pp. 238–241 (2019).
15. Bohaienko, V.O.: Performance of vectorized GPU-algorithm for computing ψ-Caputo de-
    rivative values. In: Hu Z., Petoukhov S., Dychka I., He M. (eds) Advances in Computer
    Science for Engineering and Education III. ICCSEEA 2020. Advances in Intelligent Sys-
    tems and Computing, vol 1247. Springer, Cham (2021). https://doi.org/10.1007/978-3-
    030-55506-1_24