<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Computing ψ-Caputo Fractional Derivative Values Using CUDA 10</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>V.M. Glushkov Institute of cybernetics of NAS of Ukraine</institution>
          ,
          <addr-line>Kyiv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The paper addresses the issues of efficient GPU-implementation of ψ-Caputo fractional derivative values computation on NVIDIA GPU's with compute capability 7.5 using CUDA 10 SDK on both CUDA and OpenCL languages. We consider a three-dimensional time-fractional diffusion equation solved by a locally one-dimensional finite difference scheme. To compute nonlocal part of the derivative a rectangle rule quadrature is used and a summation algorithm of linear computational complexity is considered along with a constant complexity order approximating algorithm based on integral kernel expansion into series. For the approximating algorithm we present a computational scheme that uses NVidia GPU's tensor cores. For both algorithms, we study the influence of the used scalar and vector data types on performance and accuracy. Studying the summation algorithm, comparing to the usage of 64-bit doubleprecision floating-point data type, the computations were ~2 times faster for 32bit single-precision data type and ~3 times faster for 16-bit half-precision data type without significant loss of accuracy. For the approximated algorithm that was up to 5-times faster than the summation algorithm, the usage of lowprecision data types slightly influence the performance reducing the accuracy during long-term simulations. The usage of vectorized operations in the approximation algorithm allowed up to 6-19% speed-up compared with non-vectorized implementations for a single-precision data type. The usage of tensor cores that operate with a half-precision data type allowed performing calculations 12% faster compared to the case when the same data type was used.</p>
      </abstract>
      <kwd-group>
        <kwd>GPU algorithms</kwd>
        <kwd>finite-difference method</kwd>
        <kwd>diffusion equation</kwd>
        <kwd>ψ-Caputo fractional derivative</kwd>
        <kwd>tensor cores</kwd>
        <kwd>data types</kwd>
        <kwd>CUDA</kwd>
        <kwd>OpenCL</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Memory effects in diffusion processes can be efficiently simulated using
timefractional differential equations [1-3].</p>
      <p>Such equations contain the so-called fractional derivatives that are
integraldifferential operators.</p>
      <p>The need to numerically calculate integrals while solving time-fractional
differential equations increase the computational complexity order compared to the traditional
differential equations.</p>
      <p>Approaches to lower computational complexity include parallel computing
techniques [4,5,6], particularly using GPUs [5,6], and approximation of integral kernels
[7,8,9].</p>
      <p>In this paper we consider a three-dimensional diffusion equation with the
generalized ψ-Caputo derivative with functional parameter solved by a locally
onedimensional finite-difference scheme similarly to [10].</p>
      <p>We focus on efficient GPU implementation of ψ-Caputo derivative’s values
computation, which is one the most time-consuming operation while performing
simulations, on NVidia GPUs using CUDA 10 SDK.</p>
      <p>
        We study GPU algorithms’ performance when scalar and vector data types of
different length and precision are used in CUDA and OpenCL algorithms’
implementations along with the possibility to speed up computations making use of 16x16
matrix-matrix multiplication operations that are hardware-implemented in the so-called
tensor cores of the NVidia GPUs with compute capabilities 7.x.
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Problem statement and finite-difference scheme</title>
      <p>We consider the following three-dimensional time-fractional diffusion equation [10]:
  2C( x, y, z,t)  2C( x, y, z,t)
σDt(,gβ)C( x, y, z,t) = D x 2 + y 2
+ F ( x, y, z,t), 0  x  1, 0  y  1, 0  z  1, t &gt; 0, 0 &lt; β  1
+  2C(xz,2y, z,t)  +
where C ( x, y, z,t) is the diffusive substance concentration, σ is the porosity, D is
the diffusion coefficient, F is the given source-term function, and Dt(,g ) is the
generalized ψ-Caputo fractional derivative with respect to the time variable t
functional parameter g (t) that has the following form [11]:
with the
Dt(,gβ)C(x, y, z,t) =</p>
      <p>1 t C(x, y, z,t)
(1 − β) 0 t</p>
      <p>(g(t) − g(τ))− β dτ ,
g(t)  C1 0, +) , g (t)  0 (t  0), g(0) = 0
.</p>
      <p>
        For the equation (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) we pose first-type initial and boundary conditions. The initial
boundary value problem for the equation (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) is discretized using locally
onedimensional finite-difference scheme [10,12] on the uniform grid
      </p>
      <p> = {(xi , y j , zk , tl ) : xi = ih1 (i = 0, n1 +1), y j = jh2 ( j = 0, n2 +1), zk = kh3 ( j = 0, n3 +1), tl = l
where h1 , h2 , h3 ,τ are the steps with respect to the space and the time variables.</p>
      <p>The finite-difference scheme, similarly to [10,12], is as follows:</p>
      <p>Ci(jlk+2 / 3) −  D
(Ci(,lj+−21,/k3) − 2Ci(jkl+2 / 3) + Ci(,lj++21,/k3) ) = C (l+1/ 3) + 
ijk</p>
      <p>Fij(kl) ,
Ci(jlk+1) −
(Ci(,lj+,1k)−1 − 2Ci(jlk+1) + Ci(,lj+,1k)+1 ) = C(l+2 / 3) +
ijk</p>
      <p>Fij(kl) ,
(Ci(−l1+,1j/,3k) − 2Ci(jkl+1/ 3) + Ci(+l1+,1j/,3k) ) = Ci(jkl) +</p>
      <p>Fij(kl) ,

3k1

3k1
3k1
Ci(jlk+1/ 3) −  D</p>
      <p>
        k1h12
k1h22
 D
k1h32
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
(5)
(6)
(7)
      </p>
      <p>,
Fij(kl) = Fijk +</p>
      <p>1 l−2 b(l) Ci(jks+1) − C(s)
(1 −  ) s=0 s </p>
      <p>ijk ,
ts+1
bs(i) =  (g(ti ) − g( ))− d .</p>
      <p>ts</p>
      <p>
        The equations (
        <xref ref-type="bibr" rid="ref3">3</xref>
        )-(5) can be with the addition of discretized forms of boundary
conditions represented as three-diagonal linear equation systems, which are solved by
the Thomas algorithm [12].
      </p>
      <p>
        The summation in the right-hand side of (6) is the first-order discretization of a
non-local part of the fractional derivative obtained applying the rectangle quadrature
to the integral in (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ).
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Approximating algorithm for computing ψ-Caputo fractional derivative values</title>
      <p>Computation of non-local part of fractional derivative according to (6) has O(l)
computational complexity and requires storing all previously obtained solutions making it
time and memory consuming while performing simulations over large time intervals.
Hence, we consider the constant complexity order algorithm based on series
expansion of the integrals (7) [13, 14] that requires storing solutions on only the two last
time steps.</p>
      <p>When g (t) has an infinitely differentiated inverse function f (t) , b (i) can be
reps
resented in the form [13,14]</p>
      <p>  
b(j) =   (−1)n  −n g(t j )− β −n Sn ,
s</p>
      <p>n=0  
Sn (ts ,ts+1 ) =
 
 Bm
m=0
f (m+1) ( g (ts+1)) </p>
      <p>,
m! 
g(ts+1) g(ts+1)
Bm =  xn (x − g(ts+1))m dx, B0 =  xndx =
g(ts ) g(ts )</p>
      <p>1
n +1
Bi+1 = −
n + i + 2 </p>
      <p> Bi −
g(ts+1)(i +1) 
g(ts )n+1(g(ts ) − g(ts+1))i+1 </p>
      <p> ,
g(ts+1)(i +1) 
(g (ts+1)n+1 − g (ts )n+1),</p>
      <p>Truncating the infinite series, the summation in (6) using such representation can
be organized [14] as follows:
l−2 b(l) Ci(jks+1) − Ci(jks)  1 K  
s=0 s   n=0  (−1)n  −n  g(tl )− −n Sn,l−1 ,
Sn,l = Sn,l−1 + (C(l) − C(l−1) )Sn (tl−1 ,tl ), Sn,0 = 0.
(8)
where K is the given number of terms in series.</p>
      <p>
        So, to compute in a specific node (i, j, k ) the value of the term Fij(kl) , which is the
most time-consuming part when computing right-hand sides of three-diagonal linear
equations systems (
        <xref ref-type="bibr" rid="ref3">3</xref>
        )-(5), we need to store the values of Sn,• and on the time step l
• update Sn,l−1 into Sn,l according to the second equation in (8);
• calculate the value of the sum according to the first equation in (8);
• substitute the obtained value of the sum into (6) getting the value of Fij(kl) .
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Parallel implementation</title>
      <p>
        We perform the solution of linear systems (
        <xref ref-type="bibr" rid="ref3">3</xref>
        )-(5) using multithreaded OpenMP
implementation with the calculation of a non-local part of the ψ-Caputo fractional
derivative upon (6) or (8) performed for every node of the grid using GPU.
      </p>
      <p>The calculations upon the series approximating algorithm (8) are organized
following the scheme described in [15]:
• the values of Sn(tl −1 ,tl ) are calculated on CPU on the step l −1 in parallel with
GPU computations upon (8) and uploaded on the step l into GPU memory along
with the solution C (l−1) ;
• each GPU thread calculates the value of Fij(kl) for a specific node (i, j, k ) and, after
GPU performed the computations, values of Fij(kl) are copied back into CPU
memory.</p>
      <p>To use GPUs local memory [15], we form thread groups of a controllable size 
with computations organized in l /   substeps. On the substep s , the values of
Sn (tl−1, tl ), sl /    n  (s + 1)l /   are loaded into local memory and a
corresponding part of Sn,l is updated by each thread. Further, the resulting K -terms sum in (8)
is computed with (−1)n  −n g(tl−2 )− −n in advance calculated in parallel and loaded
into local memory.</p>
      <p>To make use of vector data types present in CUDA and OpenCL languages, the
summations in (8) are split into m-terms vector operations obtaining
l−2 b(l) Ci(jks+1) − Ci(jks)  1 K / m
s=0 s   n=0 (vn(1,m) ( Snm,l−1,..., S(n+1)m,l−1 )),
vn(1,m) =  (−1)nm  n−m  g(tl )− −nm ,..., (−1)(n+1)m  (n +−1)  m  g(tl )− −(n+1)m 
( Snm,l ,..., S(n+1)m,l ) = ( Snm,l−1,..., S(n+1)m,l−1 )+
+(C(l) − C(l−1) ) ( Snm (tl−1 ,tl ),..., S(n+1)m (tl−1 ,tl )).
(9)</p>
      <p>The performance of the algorithm (9) was studied for 16-component
doubleprecision vectors in [15].</p>
      <p>
        Continuing this work, we consider the calculation upon (9) and the saving of Sn,l
using floating-point data types of different precision (16, 32, and 64 bits) and different
vector sizes (
        <xref ref-type="bibr" rid="ref2 ref4">2, 4, 8, 16</xref>
        ).
      </p>
      <p>In all cases, we save the values of C and S n in a double-precision floating-point
format in both CPU and GPU memories. The summation upon (6) can also be
represented in vectorized form, and we consider it with the same data types as the
algorithm (9).</p>
      <p>As the newest NVidia GPUs implement 16x16 matrix-matrix (tensor)
multiplication operations in hardware in the so-called tensor cores, we transform the summation
in (8) to make use of them.</p>
      <p>
        The considered tensor operation Tma ( A, B, C) multiplies two 16x16 half-precision
matrices A , B and adds a 16x16 single-precision matrix C to the result of
multiplication. The result of this operation also has a single-precision data type. Considering
the vector of sums S (16) for 16 grid cells C(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) ,...,C(16) , the summation part of (9) for
K = 16 can be performed as follows:
Sn,l coefficient value for the grid cell i , and ()1 denotes the first column of the
matrix.
      </p>
      <p>The algorithm (10) makes use only of one matrix-vector multiplication of 16
performed in Tma operation.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Numerical experiments</title>
      <p>
        We consider the test problem for the equation (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) with g(t) = t γ and
      </p>
      <p>F (x, y,z,t) =</p>
      <p>Γ (1+ 2 / γ)
Γ (1 − β + 2 / γ)</p>
      <p>t 2−βγ − 2D(x 2 y 2 + y 2 z 2 + x 2 z 2 ) .</p>
      <p>
        Then, the solution of (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) has the form C(x, y, z,t) = C0 (x, y, z,t) = x 2 y 2 z 2 + t 2 for
the case of σ = 1 and the first-type initial and boundary conditions derived from
C(x, y, z,t) .
      </p>
      <p>
        The system (
        <xref ref-type="bibr" rid="ref3">3</xref>
        )-(5) was solved on 200 time steps of size τ = 0.0025 ,
γ = 0.8, β = 0.8 , grid size of N  N  N, N = 40,...,80 , and K = 16,...,64 on a single
node of SCIT-4 cluster of VM Glushkov Institute of Cybernetics of NAS of Ukraine.
The used node contains two Intel(R) Xeon(R) Bronze 3104 CPUs with total of 12
cores and NVidia RTX 2080 Ti GPU.
      </p>
      <p>The algorithms were implemented in both CUDA and OpenCL languages and the
corresponding source code can be accessed through
https://github.com/sevaboh/FPDE_HPC.</p>
      <sec id="sec-5-1">
        <title>5.1. Performance for a fixed N</title>
        <p>The available set of data types for CUDA implementation was double, float, half,
double[2,4], float[2,4], and half2.</p>
        <p>In the case of OpenCL we considered the implemented in CUDA10 double, float,
double[2,4,8,16], and float[2,4,8,16] data types. The times spent on computation
upon (6), (9), and (10) on 200 iterations for N = 40, K = 16 and all the considered
combinations of languages and data types are given in Table.1. In all experiments thread
block size  was equal to K .</p>
        <p>The times spent on a single iteration are near to equal for the approximating
algorithm (9) and tend to increase linearly for the summation upon (6). This linear
increase is non-monotone (Fig.1) due to incomplete GPU resources usage when
t%K  0 where t is the time step number.</p>
        <p>Doing calculations upon (6) in CUDA implementation, comparing to the usage of a
double-precision data type, the algorithms that were implemented using data types of
lower precision performed the computations ~2 times faster for 32-bit single-precision
data type and ~3 times faster for 16-bit half-precision data type.</p>
        <p>The usage of the approximating algorithm (9) up to ~5 times decreased the
computation time compared to the summation upon (6).</p>
        <p>Here the usage of a single-precision data type decreased the time only by &lt;12%
when the algorithm was implemented in CUDA.</p>
        <p>This can be explained by a high amount of data type conversion operations in the
implementations of (9) due to the storage of the solutions C and the coefficients S n
using 64-bit double-precision data type.</p>
        <p>The usage of a half-precision data type in CUDA implementation slowed the
computations compared to the usage of a single-precision data type.</p>
        <p>However, the half-precision implementation that uses tensor cores was the fastest
of all the considered variants. The usage of a single-precision data type in OpenCL
implementation slowed down the computations compared to the usage of a
doubleprecision data type.</p>
        <p>For K = 16 , vectorization accelerated computations only for the case of a
singleprecision floating-point data type in CUDA implementation.</p>
        <p>To explain the observed behavior, we compare PTX assembly sources generated
by CUDA and OpenCL compilers with the generation of FMA (floating-point
multiply-add) instruction enabled.</p>
        <p>The following conclusions can be made from the performed comparison:
• both compilers perform automatic unwinding of 16 loop iterations;
• both compilers do not generate vector arithmetic instructions when vector data
types are used;
• both compilers generate vectorized 128-bit load/store instructions
[ld,st].[shared,global].v2.f64 and [ld,st].[shared,global].v4.f32 and this is their
usage that accelerated computations when vector data types are used;
• when both compilers perform loop unwinding, memory is addressed through fixed
offsets from the base for both shared and global memory (e.g. ld.global.f64 %fd77,
[%rd23+40]). For the vectorized data types, offsets to global memory are
computed for every memory access to the corresponding vector (e.g. mul.wide.s32 %rd28,
%r67, 128; add.s64 %rd29, %rd3, %rd28; ld.global.v2.f64 {%fd172, %fd173},
[%rd29+112]; ld.global.v2.f64 {%fd176, %fd177}, [%rd29+96];...). For both
algorithms (6) and (8), such additional operations slowed down the computations.
However, the increase of K leads to the decrease of the number of auxiliary
operations allowing obtaining up to 15% speed-up for double16 data type usage in
OpenCL;
• CUDA compiler automatically vectorizes the reading of 32-bit floating-point data
stored in the local memory using 64-bit instruction ld.shared.v2.f32 making the
calculation upon (9) faster than in the case of double-precision data type usage.
The absence of such automatic vectorization in OpenCL makes the single-precision
OpenCL implementation to perform slower than in the case of a double-precision
data type. The same applies to half-precision data type usage in CUDA that yields
slower code than for the case of a single-precision data type;
•</p>
        <p>While performing calculation upon (6), CUDA compiler implements division
operation in div.rn instruction, while OpenCL compiler uses approximated div.full
instruction making OpenCL code faster.</p>
        <p>To check how the tendencies of vectorization efficiency observed for K = 16
changes for its bigger values, the series of computational experiments were conducted
for N = 40 and K = 16,...,64 for the case of the approximating algorithm (9).</p>
        <p>Here, automatic loop unwinding leads to the decrease of the number of auxiliary
operations with the increase of K . The computation time also decreases by up to
15% for double16 data type usage in OpenCL, up to 5.3% for double8 data type usage
in OpenCL, up to 2.6% for float16 data type usage in OpenCL, and up to 19% for
float4 data type usage in CUDA. The usage of double[2,4] and float[2,8] data types
in OpenCL slowed down the computations.</p>
        <p>Thus, while OpenCL language supports larger vector data types than CUDA
language (16 elements compared to 4 elements), its inefficient compilation by NVidia
compiler leads to the situation when OpenCL vectorized implementations of the
algorithm (9) allow accelerating computations only for large K and the largest vector
data types.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2 Performance in the case of variable grid size</title>
        <p>The influence of the time of fractional derivative values calculation on the total time
spent on simulation along with the speed-ups of different GPU implementations are
given in Table 2 for the summation upon (6) and in Table 3 for the approximating
algorithm (9) for N = 40, ... , 80 .</p>
        <sec id="sec-5-2-1">
          <title>CUDA float</title>
        </sec>
        <sec id="sec-5-2-2">
          <title>CUDA float4 to CUDA float</title>
          <p>OpenCL float to
CUDA float</p>
          <p>N
Percentage of summation to
total time
Total time
Summation time
Percentage of summation to
total time
Percentage of auxiliary
hostGPU operations
Summation time
Summation time</p>
          <p>40
31,93%
40,74%
60</p>
          <p>80
36,64%</p>
          <p>36,41%
43,58%</p>
          <p>51,15%
939,46%</p>
          <p>1374,70% 1568,14%
4,32%
3,57%</p>
          <p>3,30%
28,62%
22,25%</p>
          <p>22,22%
7,52%
5,40%</p>
          <p>8,80%</p>
          <p>In the algorithm (6), time spent on fractional derivative values computation
(summation time) comprises up to 36% of total computation time on CPU. This value
lowers to ~4% on GPU when this operation became 9-30 times faster and the total
computation time became ~50% less. As can be seen from the Table 2, GPU algorithms’
efficiency increases with the increase of N while the efficiency of vectorization
remains on the same level. For the approximating algorithm (9) and K = 16 , fractional
derivative values computation comprises up to 40% of the total computation time
while executing on CPU with slight decrease of this percentage with the increase of
N . GPUs efficiency here is close to the case of the summation upon (6): 15-27 times
acceleration of the calculations upon the algorithm (9), 51-63% decrease of the total
computation time compared to CPU, ~3% of time spent by GPU algorithms for
fractional derivative values computation. Comparing to the slower algorithm (6), the time
spent on auxiliary CPU-GPU memory copying operations while calling GPU kernels
is here bigger: ~60% compared to ~25%. Efficiency of vectorization do not depend on
N with the usage of tensor cores yielding the highest 12% increase compared with
the usage of a half-precision data type and 8% increase compared to the usage of a
single-precision data type. Efficiency of the usage of tensor operations compared to
the fastest case of vectorized float4 implementation however is only 2%.
CCUUDDAA ffllooaatt4 to Kernel execution time
COUpeDnACLfloflaotat to Kernel execution time
fCloUaDtA half to CUDA Kernel execution time
CCUUDDAA fhlaolaft2 to Kernel execution time
cCoUreDsA,tohaClUf, DteAnsfolroat Kernel execution time
cCoUreDsAtohCaUlf,DteAnshoarlf Kernel execution time
40
60</p>
          <p>80
39,85%
38,04%</p>
          <p>36,85%
Changes in maximal square absolute error on the iterations while using summation
upon (6) and series expansion upon (9) for different data types are given in Fig. 2.
Lowering of data type precision has a little influence on the quality of solution for the
summation upon (6) due to monotone increase of b (i) when i → s that minimizes
s
summation errors. The error for the algorithm (6) decreases with the increase of time
step number.</p>
          <p>0,0018
0,0016
0,0014
0,0012
0,001
0,0008
0,0006
0,0004
0,0002
0
1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 129 137 145 153 161 169 177 185 193
Full summation (all data types)
Series expansion (half, tensor cores)</p>
          <p>Series expansion (float,double)</p>
          <p>Series expansion (half)</p>
          <p>Close values of errors between the summation upon (6) and the approximating
algorithm (9) with K = 16 on the initial time steps show the adequacy of a chosen
value of K . However, for the algorithm (9), the recurrently changing values of Sn,l
overflow the data type values range starting from some time step leading to linear
increase of error. In the conducted experiments such behavior was observed for a
half- and single-precision data types.
6</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>Calculation of time-fractional ψ-Caputo derivatives’ values significantly influence
time needed to solve fractional derivative equations that contain them. For the
considered case of the diffusion equation solved using a finite-difference scheme, this
operation comprises up to 40% of time spent while performing calculation on CPU. Its
parallelization on GPU is efficient because the computations on the cells of the grid are
independent.</p>
      <p>
        We studied additional possibilities to speedup GPU implementations considering
the summation algorithm of linear computational complexity and the approximating
recurrent algorithm of constant complexity. Considering the usage of different
precision floating-point data types, their vectors, and matrix-matrix multiplication
operations implemented in tensor cores of the latest NVidia GPUs the following conclusion
can be made:
• for the summation algorithm (6), comparing to the usage of a double-precision data
type, the computations were ~2 times faster using 32-bit single-precision data type
and ~3 times faster using 16-bit half-precision data type without significant loss of
accuracy;
• for the approximating algorithm (9) that was up to 5-times faster than the
summation algorithm, the usage of different data types slightly influence the performance
due to the high amount of data type conversion operations. However, the usage of
low-precision data types here leads to linear increase of solution error starting from
some time step due to data type overflow;
• Inefficient compilation of vectorized memory access in OpenCL code leads to the
situation when OpenCL vectorized implementations of the algorithm (9) allow
accelerating computations only for large number K of terms in truncated series and
the largest vector data types. Thus, the usage of CUDA is preferable for the
approximating algorithm (9). However, for the summation algorithm (6), OpenCL
implementations were faster due to the differences in compiling the division
arithmetic operation;
• GPU implementations 9-30 times accelerate fractional derivative values
computation with the efficiency that increases with the increase of grid size;
• the usage of tensor cores that operate with 16-bit half-precision floating-point data
type allowed performing calculations 12% faster than the usage of a half-precision
data type without vectorization. However, comparing to the efficiency of the
fastest vectorized implementation that uses float4 data type, only 2% acceleration was
obtained.
5. Liu, J., Gong, C., Bao, W., Tang, G., Jiang, Y.: Solving the Caputo Fractional
ReactionDiffusion Equation on GPU. Discrete Dynamics in Nature and Society 2014, 820162
(2014). doi: 10.1155/2014/820162
6. Golev, A., Penev, A., Stefanova, K., Hristova, S.: Using GPU to speed up calculation of
some approximate methods for fractional differential equations. International Journal of
Pure and Applied Mathematics 119(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ), 391-401 (2018). doi: 10.12732/ijpam.v119i3.1
7. Gong, C., Bao, W., Liu, J.: A piecewise memory principle for fractional derivatives. Fract.
      </p>
      <p>
        Calc. Appl. Anal. 20(
        <xref ref-type="bibr" rid="ref4">4</xref>
        ), 1010–1022 (2017). doi: 10.1515/fca-2017-0052
8. Ford, N.J., Simpson, A.C.: The numerical solution of fractional differential equations:
Speed versus accuracy. Numerical Algorithms, 26(
        <xref ref-type="bibr" rid="ref4">4</xref>
        ), 333–346 (2001). doi:
10.1023/A:1016601312158
9. Baffet, D., Hesthaven, J.S.: A kernel compression scheme for fractional differential
equations. Siam J. Numer. Anal 55(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ), 496–520 (2017). doi: 10.1137/15M1043960
10. Bogaenko, V.A., Bulavatsky V.М.: Computer modeling of the dynamics of migration
processes of soluble substances in the case of groundwater filtration with free surface on the
base of the fractional derivative approach (in Russian). Dopov. Nac. Akad. Nauk Ukr. 12,
21–29 (2018). doi: 10.15407/dopovidi2018.12.021
11. Almeida, R.: A Caputo fractional derivative of a function with respect to another function.
      </p>
      <p>
        Communications in Nonlinear Science and Numerical Simulation 44, 460–481 (2017). doi:
10.1016/j.cnsns.2016.09.006
12. Samarskii A.: The Theory of Difference Schemes. CRC Press, New York (2001).
13. Bohaienko, V.O.: A fast finite-difference algorithm for solving space-fractional filtration
equation with a generalised Caputo derivative. Computational and Applied Mathematics
38(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ), 105 (2019). doi: 10.1007/s40314-019-0878-5
14. Bohaienko V.O.: Efficient computation schemes for generalized two-dimensional
timefractional diffusion equation. In: Papers of the International scientific and practical
conference ІТСМ – 2019, Ivano-Frankivsk, Ukraine, pp. 238–241 (2019).
15. Bohaienko, V.O.: Performance of vectorized GPU-algorithm for computing ψ-Caputo
derivative values. In: Hu Z., Petoukhov S., Dychka I., He M. (eds) Advances in Computer
Science for Engineering and Education III. ICCSEEA 2020. Advances in Intelligent
Systems and Computing, vol 1247. Springer, Cham (2021).
https://doi.org/10.1007/978-3030-55506-1_24
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Gorenflo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mainardi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Fractional calculus: integral and differential equations of fractional order</article-title>
          .
          <source>In: Fractals and Fractional Calculus in Continuum Mechanics</source>
          , pp.
          <fpage>223</fpage>
          -
          <lpage>276</lpage>
          , Springer Verlag, Wien, Austria (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Podlubny</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Fractional differential equations</article-title>
          , Academic Press, New York (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bulavatsky</surname>
            ,
            <given-names>V.M.</given-names>
          </string-name>
          :
          <article-title>Mathematical modeling of dynamics of the process of filtration convection diffusion under the condition of time nonlocality</article-title>
          .
          <source>Journal of Automation and Information Science</source>
          <volume>44</volume>
          (
          <issue>4</issue>
          ),
          <fpage>13</fpage>
          -
          <lpage>22</lpage>
          (
          <year>2012</year>
          ). doi:
          <volume>10</volume>
          .1615/JAutomatInfScien.v44.
          <year>i4</year>
          .
          <fpage>20</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bonchis</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaslik</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosu</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>HPC optimal parallel communication algorithm for the simulation of fractional-order systems</article-title>
          .
          <source>J Supercomput</source>
          <volume>75</volume>
          ,
          <fpage>1014</fpage>
          -
          <lpage>1025</lpage>
          (
          <year>2019</year>
          ). doi:
          <volume>10</volume>
          .1007/s11227-018-2267-z
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>