=Paper= {{Paper |id=Vol-1846/paper7 |storemode=property |title=GPU Computations and Memory Access Model Based on Petri Nets |pdfUrl=https://ceur-ws.org/Vol-1846/paper7.pdf |volume=Vol-1846 |authors=Anna Gogolińska,Lukasz Mikulski,Marcin Piatkowski |dblpUrl=https://dblp.org/rec/conf/apn/GogolinskaMP17 }} ==GPU Computations and Memory Access Model Based on Petri Nets== https://ceur-ws.org/Vol-1846/paper7.pdf
    GPU computations and memory access model
              based on Petri nets?

            Anna Gogolińska, Łukasz Mikulski, and Marcin Piątkowski

                    Faculty of Mathematics and Computer Science,
                    Nicolaus Copernicus University, Toruń, Poland
       {anna.gogolinska,lukasz.mikulski,marcin.piatkowski}@mat.umk.pl



        Abstract. In modern systems CPUs as well as GPUs are equipped with
        multi-level memory architectures, where different levels of the hierarchy
        vary in latency and capacity. Therefore, various memory access models
        were studied. Such a model can be seen as an interface abstracting the
        user from the physical architecture details. In this paper we present a
        general and uniform GPU computation and memory access model based
        on bounded inhibitor Petri nets (PNs). Its effectiveness is demonstrated
        by comparing its throughputs to practical computational experiments
        performed with the usage of Nvidia GPU with CUDA architecture.
        Our PN model is consistent with the workflow of multithreaded GPU
        streaming multiprocessors. It models a selection and execution of in-
        structions for each warp. The three types of instructions included in the
        model are: the arithmetic operation, the access to the shared memory
        and the access to the global memory. For a given algorithm the model al-
        lows to check how efficient the parallelization is, and whether a different
        organization of threads will improve performance.
        The accuracy of our model was tested with different kernels. As the
        preliminary experiments we used the matrix multiplication program and
        stability example created by Nvidia, and as the main experiment a binary
        version of the least significant digit radix sort algorithm. We created three
        implementations of the algorithm using CUDA architecture, differing
        in the usage of shared and global memory as well as organization of
        calculations. For each implementation the PN model was used and the
        results of experiments are presented in the work.

        Keywords: Petri nets, CUDA architecture, GPU, memory model


1     Introduction

The inter-process communication over a common part of the memory shared by
processes is a usual performance bottleneck in multiprocessor environments. In
modern systems CPUs as well as GPUs are equipped with multi-level memory
architectures, where different levels of the hierarchy vary in latency and capacity.
?
    This research has been supported by the Polish National Science Center through
    grant No.2013/09/D/ST6/03928
106    PNSE’17 – Petri Nets and Software Engineering



By considering different local views of the processes on the common part of the
memory one can try to improve the processor utilization. Therefore, various
memory access models were studied, see for instance [8, 11, 15]. Such a model
can be seen as an interface abstracting the user from the physical architecture
details. It allows to specify, without a reference to processors, the local views
that are possible in concurrent task executions and maintain its consistency.
     Another important issue is the task distribution between threads and CPU/GPU
cores and the instruction scheduling, which can have a significant impact on the
efficiency. Consider for instance running the three threads on a single processor
depicted in Figure 1. Each of them performs a list of arithmetic operations in-
terleaved by memory reads/writes consisting of a short preprocessing and then
a longer period of waiting for the memory access (which is a usual situation
in parallel computing). In the initial part of the computation each thread re-
alizes its preprocessing for the memory access and then starts to wait for the
access itself. This causes the processor idle period when all threads are waiting
(marked as I). After that the arithmetic operations of all threads are executed
simultaneously. They can be scheduled in such a way that one thread waits for
the memory access while other threads perform their computations and there is
no idle period (marked as II). Thanks to that the waiting period of a thread can
be hidden behind the active computations of other threads.



       t1 :
       t2 :
       t3 :
                       (I)                                      (II)




                             – memory access           – arithmetic operation

Fig. 1. The example run of the threads t1 , t2 and t3 on a single processor. The area
marked with (I) represents the idle period when no computation is done, and the area
marked with (II) represents the period when the waiting for the memory access is
hidden behind arithmetic operations.



    The main contribution of this paper is a general and uniform GPU computa-
tion and memory access model based on bounded Petri nets [14] together with
the application simulating its execution. For a given algorithm pseudocode (or
the source code) one can use our model to count the number of operations per-
formed (taking into account their simultaneous execution). The main advantage
of the model over the basic arithmetic counting and other models described be-
low is the possibility of reorganization of parts of the code, the potential ability
to predict a duration of GPU calculations and to handle the aforementioned
        Gogolińska et.al.: GPU Computations and Memory Access Model . . .      107



computation scheduling. Such an approach might be used to improve the al-
gorithm code organization and to partition the computation tasks between the
threads to maximally increase the efficiency.
    The effectiveness of our model is demonstrated by comparing its through-
puts to practical computational experiments performed with the usage of Nvidia
GPU. We study the impact on the complexity of various parameters such as the
number of concurrent processes, and the level of memory used (shared memory
vs. global memory). The successful application of our model is discussed in de-
tails, as a proof of concept, on a single example of digit radix sorting algorithm.
We do not want to discuss methods of parallelization, nor the most efficient way
of using different types of the memory. It is beyond the scope of this paper. Those
problems are very complex, and cannot be explained in such a short publication.
More about those topics can be found for instance in [6, 20].
    Our PN model is not the only one GPU efficiency model. There are quite
a few other GPU performance models, which can be divided into two groups
(according to [12]):
    (1) Calibrated performance models that make specific predictions and include
many lower levels of details. They usually contain many specialized parameters,
some of which may be difficult to obtain or calculate. The first example is a
model presented in [7]. It is the first analytical model of the GPU efficiency.
The most important values used there are MWP (number of memory requestes
that can be executed concurrently) and CWP (number of warps, which can be
computed while one warp is waiting for memory values). The model consists of
14 equations, which contains 21 parameters. Other example is a model presented
in [19]. The authors created not only performance GPU model but also power
consumption model. In the performance part they estimate execution time for in-
dividual GPU architecture components (e.g. shared, global memory) separately.
This way they easily identify potential performance bottlenecks. The model has
a form of equations and contains 32 parameters.
    (2) Asymptotic models for algorithm analysis at a high level of abstraction
that capture only the essential features of GPU architecture. One of the models
from this group is the model described in [13]. He used elements of PRAM, BSP
and QRQW approaches. The model calculates the number of cycles required for
the whole kernel, taking into account the number of blocks, warps and threads,
the maximal number of cycles required by a single thread to perform calcula-
tions and the number of threads which can be executed in parallel. However the
model is quite simple and some important elements are neglected (like hiding
memory latency in computations). The other asymptotic model is Thread Multi-
Core Memory (TMM) model presented in [11]. Basing on the TMM, GPUs can
be presented as abstract core groups, each containing a number of cores and
fast local memory. A large and slow global memory is shared by all cores. The
running time of the algorithm is calculated basing on suitable equation with
basic GPU parameters. One of the most popular GPU efficiency models is the
roofline model introduced in [21]. It can be used to obtain performance estimates
of GPU computations, and requires two parameters: the number of operations
108    PNSE’17 – Petri Nets and Software Engineering



performed by a kernel and the number of bytes transferred from/to the memory,
which can be used to calculate the arithmetic intensity I. In the naive roofline
model I can be handily presented as a point in two-dimensional space restricted
by two ceilings lines: the memory bandwidth and the processor’s peak efficiency.
The resulting performance is a bound under which the arithmetic intensity ap-
peared: the memory bandwidth bound or the peak performance bound. In the
extended versions of the model, additional ceilings can be added. They related
for example to software prefetching or task level parallelism. The roofline model
can determine the type of kernel limitation and show, how optimal the program
is. However, such a simple arithmetic operation cannot precisely represent the
complex processes of kernel execution, like for example hiding the waiting period
in calculations (see Figure 1). Similar problem occurs in other purely arithmeti-
cal models. More complex tools are necessary for such a purpose. Our model
belongs to the asymptotic group, and to the best of our knowledge it is the first
one utilizing Petri nets.
    The paper is organized as follows. In the next section we describe Graphical
Processing Units and CUDA Toolkit, focusing in particular on memory types
(and organization). In Section 3 we recall some standard notions and notations
related to Petri nets. Then we introduce our memory access and computation
model followed by the description of radix sort algorithm. We also present the
results of experiments conducted on the base of our implementations of the
radix sort algorithm. We conclude the paper and give some directions of further
research in the last section.


2     CUDA

An intensive development of Graphical Processing Units (GPU in short) resulted
in construction of high performance computational devices, which besides the
graphical display management, allow also the execution of parallel general pur-
pose algorithms (not necessarily related to computer graphics). As opposed to
CPU consisting of a few cores optimized for sequential serial processing, GPU
contains thousands of smaller, more efficient cores designed for handling multi-
ple tasks simultaneously. The most popular ones are GPU’s produced by Nvidia
Corporation, supplied with Nvidia CUDA Toolkit (see [2]).
    A program running in heterogeneous environment equipped with GPU can be
split into so-called host parts, which are executed by CPU, and so-called kernel
parts, which are executed by GPU. The host part specify the kernel execution
context and manages the data transfer between the host and the GPU memory.
The kernel functions create a big number of threads allowing a highly parallel
computation (where each thread runs the same kernel code). GPU threads, as
opposed to CPU threads, are much lighter, hence their creation and controlling
requires less CPU cycles.
    All threads executed on GPU are organized into several equally-sized thread
blocks, which in turn are organized into a grid.
        Gogolińska et.al.: GPU Computations and Memory Access Model . . .    109



    A thread block is the set of concurrently executed threads. Such an execution
and the thread cooperation can be coordinated by a barrier synchronization.
Moreover, the data can be exchanged between threads using a shared memory.
The size of a block is limited by the capacity of resources accessible on a single
processor core. On currently available GPU’s a single block can contain up to
1024 threads. Each thread block within a grid is uniquely identified by its block
ID, and each thread within a block – by its thread ID.
    A grid is an array of thread blocks executing the same kernel. The number
of blocks in a grid is specified by the amount of data to be processed and the
number of available processors. The exchange of data between threads within a
grid requires the usage of the global memory. Moreover, while all threads within
a single block run simultaneously, different blocks can be executed in any order.
    The physical Nvidia GPU architecture consists of several multithreaded stream-
ing multiprocessors (SM in short). Thread blocks are distributed between SM’s
in such a way that all threads within a single block are concurrently executed
on the same SM (different blocks may, but not necessarily have to, be executed
on the same SM). A streaming multiprocessor organizes threads into so-called
warps consisting of 32 threads each. The partition is done according to increasing
thread ID. Each SM works utilizing SIMT (single-instruction, multiple-threads)
architecture, which means that all threads within a single warp execute one com-
mon instruction at a time. Any divergence (e.g. caused by a data processed) leads
to a serial execution of a single computation path until possible convergence to
the same execution path.
    A single SM serves multiple warps. It is equipped with a number of warp
schedulers and instruction dispatch units (currently 2 or 4 depending on the
device used). A scheduler selects a warp, which is ready to be executed and
issues it to the physical cores of GPU. If the currently active warp needs to
wait for the memory read/write operation it is replaced by another ready warp.
While the replaced warp is waiting for the memory access, other warps perform
their computations, therefore the SM is busy as often as possible. The waiting
period of a single warp is hidden behind the computations of the others (if there
are only enough active warps available) and is not seen outside the SM. The
simulation of such a behavior is the main part of our model.
    The above mentioned heterogeneous environment is equipped with the host
memory managed by CPU and the GPU device memory. The latter is signifi-
cantly more complex than the former. Due to a necessary compromise between
the data transfer/access speed and the possible capacity, the GPU device mem-
ory consists of various types of data storage, such as global, constant, local and
shared memory.
    The global memory is the largest and at the same time the slowest type of
GPU memory (with hundreds of cycles latency). Together with the constant
memory it is the only type of GPU memory, which can be accessed by the
host. It is available for reading and writing for all running threads, however the
data exchange and result sharing are possible only after a kernel-wide global
synchronization.
110      PNSE’17 – Petri Nets and Software Engineering



    The content of the global memory is accessed in blocks of size 32, 64 or 128
bytes (depending on the device used). Every time an element within a block
is accessed, the whole block have to be transferred. Therefore, the concurrent
(among the threads within a single warp) global memory read and write opera-
tions are grouped into transactions, the number of which depends on the cache
lines required to serve all threads within a warp. However, if different threads in
a warp refer to different memory blocks, all such blocks have to be transferred
to cache sequentially. Such a situation cause the necessity of repeated global
memory accesses.
    The shared memory is a fast memory physically placed inside a multipro-
cessor. It consists of blocks, each of which is available for all threads within a
single thread block. Moreover, each such block is divided into several so-called
memory banks. The access of different threads to different banks is realized si-
multaneously, while the access of different threads to the same bank is realized
sequentially. Such a situation is called the bank conflict. The shared memory can
be used for data exchange between threads within the same thread block after
block-wide thread synchronization.
    The local memory of a single thread consists of a number of registers, which
are the fastest type of memory available (with almost negligible access time).
It is used to store local thread variables. Due to large number of threads the
capacity of each thread local memory is strongly limited.
    To complete the picture we have to mention also the constant and texture
memory – dedicated parts of the GPU device memory (usually buffered). Both
of them are optimized for access speed within the device, but are available for
threads in read-only mode. Neither of them is considered in our model.


3     Petri Nets
The set of non-negative integers is denoted by IN. Given a set X, the cardinality
(number of elements) of X is denoted by |X|, the powerset (set of all subsets)
by 2X – the cardinality of the powerset is 2|X| . Multisets over X are members
of INX , i.e., functions from X into IN. For convenience and readability, if the set
X is finite, multisets in INX will be represented by vectors of IN|X| (assuming
a fixed ordering of the set X). The addition and the partial order ≤ on INX are
understood componentwise, while < means ≤ and 6=.
    Let us now recall basic definitions and facts concerning inhibitor Petri nets [1,
16].


Definition 1. An inhibitor1 place/transition net (p/t-net) is a quintuple S =
(P, T, W, I, M0 ), where:
 – P and T are finite disjoint sets, of places and transitions (actions), respec-
   tively;
1
    Note that in the case of bounded nets the use of inhibitors is not necessary, one can
    provide an equivalent (with more complex structure) net without inhibitors.
        Gogolińska et.al.: GPU Computations and Memory Access Model . . .            111



 – W : P × T ∪ T × P → IN is an arc weight function;
 – I ⊆ P × T is an inhibition relation;
 – M0 ∈ INP is a multiset of places, named the initial marking.

For all a ∈ T we use the following denotations:

    •
      a = {p ∈ P | W (p, a) > 0} – the set of entries to a
    a• = {p ∈ P | W (a, p) > 0} – the set of exits from a
    ◦
      a = {p ∈ P | (p, a) ∈ I} – the set of inhibitor places for a.

    Petri nets admit a natural graphical representation. Nodes represent places
and transitions, arcs with classical arrow heads represent the weight function,
while arcs with small circles as arrowheads represent inhibition relation. Places
are indicated by circles, and transitions by boxes. Markings are depicted by
tokens inside the circles, the capacity of places is unlimited. However, Petri nets
used in our model are bounded (which means that there exist a common bound
for all the numbers of tokens appearing during the computation in a single place).
    The set of all finite strings of transitions is denoted by T ∗ , the empty string
is denoted by ε, the length of w ∈ T ∗ is denoted by |w|, number of occurrences
of a transition a in a string w is denoted by |w|a .
    Multisets of places are called markings. In the context of p/t-nets, they are
typically represented by nonnegative integer vectors of dimension |P |, assuming
that P is totally ordered.
    A transition a ∈ T is enabled at a marking M whenever • a ≤ M (all its
entries are marked) and ∀p∈◦ a M (p) = 0 (all inhibitor places are empty). If a is
enabled at M , then it can be executed. A marking M is called a dead marking if
no transition is enabled at M (which means that ∀a∈T ∃p∈P W (p, a) > M (p) ∨
((p, a) ∈ I ∧ M (p) > 0) ). The execution of an enabled transition a is not forced
and changes the current marking M to the new marking M 0 = (M − • a) + a•
(tokens are removed from entries, then put to exits). We shall denote M a for
the predicate “a is enabled at M ” and M aM 0 for the predicate “a is enabled at
M and M 0 is the resulting marking”.
    In this paper however, we use the maximal concurrent semantics and in every
marking we execute one of the maximal sets of enabled transitions (i.e. a step,
which is maximally concurrent at thisP      marking).  Formally, a set    of transitions
                                                  •
A ⊆ T is called step and is enabled if        a∈A   a  ≤  M   and  ∀p∈( a∈A ◦ a) M (p) =
                                                                        S

0. The execution  of  a  step A changes  the current   marking    M  to the new marking
M0 = M −              •              •
              P            P
                 a∈A    a  +    a∈A a  . We say  that  a step  is maximally   concurrent
at marking M if A is enabled at M and ∀a∈A      /  A  ∪ {a}  is not  enabled  at M .
    The notions of enabledness and execution we extend, in a natural way, to
strings of steps (computations): the empty string ε is enabled at any marking,
a string w = Av is enabled at a marking M whenever M AM 0 and v is enabled
at M 0 . The predicates M A, M w, M AM 0 and M wM 0 are defined like for single
transitions.
112     PNSE’17 – Petri Nets and Software Engineering



4     Memory access and computations model
The Petri net model of GPU calculations is consistent with the workflow of
multithreaded streaming multiprocessors (SMs). The model represents the way
one SM operates. It models each warp assigned to the SM, selection of the
next instruction for the SM, accesses to the global and shared memory, and
arithmetic operations. It does not represent threads hierarchy (blocks, grid),
repeated accesses to the global memory nor bank conflicts. The model allows
simultaneous accesses to the global memory, but the number of warps, which
can use the global memory, at the same time, is limited by the number of SM
warp schedulers (2 or 4). Each element of the model was created basing on [3,
4].
      The size of the considered Petri net depends on the number of warps. The
two elements: the place p0 – SM and the transition t0 – waiting are the constant
part of the model, other are generated for every warp. The place p0 represents
the streaming multiprocessor and its initial marking should correspond to the
number of warp schedulers. For the modern graphical cards it should be 2 or 4.
This place is connected by a loop with the transition t0 . The transition t0 can
be executed only when the warp schedulers cannot schedule any instruction, i.e.
there is no instruction ready to be executed. The waiting transition is added
to the model to gain control over how many steps of the calculations on the
SM is idle. The minimization of that number is crucial for optimization of GPU
programming. Beside those two elements, the PN model contains 18 places and
17 transitions for every warp required by the analysed algorithm. That part is
called a warp part of PN model (WPNM). The detailed description of the most
important places and transitions of WPNM is presented in Table 1.
      The WPNM together with SM place and waiting transition are depicted in
Figure 2. The place p1 represents the activation of the warp. It is marked when
the warp is active. The place p2 is marked when the warp finishes the execution
of its previous instruction. The warp can be scheduled for the execution only
when the places p2 and p4 are marked. A token in p4 means that the next
instruction is selected and ready. The places from p3 to p10 and the transitions
t2 , t3 , t4 (the frame I part in Figure 2) are responsible for controlling and selecting
the instructions. The three types of instructions are allowed in the model: an
arithmetic calculation, an access to the shared memory and an access to the
global memory. The initial marking of the place p5 corresponds to the number
xa of arithmetic operations required by the analyzed algorithm. Similarly, the
initial markings of the places p7 and p9 represent respectively the numbers xs
and xg of read/write operations from/to the shared and global memory. If the
place p3 is marked and at least one of the places: p5 , p7 , p9 is not empty, the
next instruction can be selected. The selection is random. When the instruction
is chosen, according to its type (arithmetic calculation, shared memory access,
global memory access) the marking of the corresponding place is decreased and
a token is added to p4 . Moreover, (according to the instruction type) one of
the places: p6 , p8 or p10 is marked. Now the selected instruction is ready to be
executed and the warp can be processed by SM. The execution of the instruction
        Gogolińska et.al.: GPU Computations and Memory Access Model . . .        113



is represented by the three parts of the net marked in frame II, frame III and
frame IV (see Figure 2). They correspond to the type of the selected instruction:
frame II for an arithmetic calculation, frame III – an access to the shared memory
and frame IV – an access to the global memory. The arithmetic calculation is
simply represented by one place and two transitions. When the calculation is
done, tokens are put in places: p2 and p3 (which means that the instruction is
finished and the next one can be selected), and in the place p0 (which means
that SM is ready to execute the next instruction). The same situation is obtained
when the access to the memory (shared or global) is finished, but those parts
of the model contain more places and transitions. Those additional elements are
used to model the memory access latencies. The shared memory latency and the
global memory latency are the parameters of the model and are denoted by l1
for the global memory and by l2 for the shared memory. The transition t9 (t12
respectively) may be executed only after l1 (l2 respectively) executions of t8 (t17
respectively). For testing, their default values were 20 for l1 and 2 for l2 , which
is consistent with [4].
     The transition t16 is connected by inhibitor arcs with the places p5 , p7 and p9
(the frame I in Figure 2) and can be executed only when those places are empty,
i.e. there is no instruction left for execution. The transition is also connected by
a regular arc with p2 , Moreover, t16 is the only one able to take the token from
the place p1 and its execution is equivalent to the termination of a given warp.
    As it was mentioned above, the WPNM (places from p1 to p18 and tran-
sitions from t1 to t17 ) is generated for a single warp. In the case of multiple
warps a separate WPNM should be generated for each of them. To distinguish
between distinct WPNMs one can either increase the numeration of places and
transitions accordingly or assign to them two-part labels consisting of the origi-
nal place/transition number together with the warp id. It should be noticed that
each place corresponding to p2 and representing the readiness of the given warp
should be connected by an inhibitor arc with the transition t0 . Moreover, each
transition corresponding to t1 should be connected with the place p0 (SM ). The
same goes for transitions corresponding to t5 , t14 and t15 . Note that the control
is returned by t5 not t11 in the case of part responsible for the access to the
global memory. Thanks to that, SM can process the next ready warp while the
current one is waiting for the memory access.
    Our PN model of GPU computation and memory access may by adapted
for any algorithm. Instantiations of the model for different kernels may differ
in the number of warps and the marking of places responsible for instructions
counting (i.e. p5 , p7 and p9 ). For the chosen number of required warps, a new
model containing sufficient number of WPNMs can be generated. The other
possibility is the generation of one big model with the maximal possible number
of WPNMs (i.e. 64 for modern graphical cards [4]), and then the necessary
number of places representing active warps should be marked. The number of
WPNMs can be calculated basing on the number of threads and blocks, which
are parameters of the kernel. Notice that the model provides no controlling
mechanism for the maximum number of WPNMs. It is up to the user to be aware
114      PNSE’17 – Petri Nets and Software Engineering



                                              p2
                                               •



      p1 •                                     t1




                                              p11                                                              t0


                   (I)        p4                     p3 •                                 (II)
                                                                               t7                     t15      •• p0
                                                                                          p17
        t16              t2                    t3             t4

                                                                                           t6         (III)
                    xa p5           p 6 xs p 7         p8 xg p9          p10

                                                                                                 2
                                                                                    p15               t17

                                                                                    l2 +1
                                  p12
                                        l1 +1 t                                                  l2
                    t5                          9                  t11                    t12         l2 p18
                                                        p14
                              2                 l1

                                  t8           l1                                   p16               t14
                                              p13             (IV)



Fig. 2. The example of a warp PN model (WPNM). In frame I part the next instruc-
tion is selected. Sections: frame II, frame III and frame IV represent the execution of
different types of instructions: arithmetic calculation, access to the shared memory and
access to the global memory (respectively).


                                       Place/Transition name and description
       p0     Streaming Multiprocessor (SM)              t0 Waiting
       p1     Active warp                                t1 The warp is executed on SM
       p2     The previous instruction is finished and   t2 Selection of a calculation for
              the warp is ready for the next one            the next instruction
       p3     Check the next instruction                 t3 Selection of an access to the shared
                                                            memory for the next instruction
       p4     Next instruction is ready                  t4 Selection of an access to the global
                                                            memory for the next instruction
       p5     Arithmetic operations                      t5 Execution of the access to the global memory
       p6     Instruction – arithmetic calculation       t6 Execution of the access to the shared memory
       p7     Access to the shared memory                t7 Execution of arithmetic operation
       p8     Instruction – shared memory access         t8 Waiting for the global memory access
       p9     Access to the global memory                t9 Access to the global memory
       p10    Instruction – global memory access        t11 Access to the global memory is finished
                                                        t12 Access to the shared memory
                                                        t14 Access to the shared memory is finished
                                                        t15 Calculation is finished
                                                        t16 Termination of the warp
                                                        t17 Waiting for the shared memory access


  Table 1. The description of the places and transitions depicted in the Figure 2.




that the maximal number of WPNMs is limited to 64 in modern GPUs. The
number of arithmetic operations and accesses to the shared and global memory
        Gogolińska et.al.: GPU Computations and Memory Access Model . . .       115



need to be calculated for the considered algorithm. Those numbers should be
used as the initial marking of places p5 , p7 and p9 . If the model is constructed
according to the description above, it is ready to be used out of the box. The
usage of the model involves the execution of computations according to the
maximal concurrent semantics (i.e. concurrent execution of all transitions that
are enabled) starting from the initial marking until reaching the dead marking.
The latter is obtained only when all places corresponding to p1 (for each WPNM)
become empty, which is equivalent to the termination of all warps. The number
of steps of the computation is returned by the model and corresponds to the
GPU execution time.
    The maximal concurrent semantics requires execution of all enabled transi-
tions. However, some of the enabled transitions may be in a conflict (i.e. execu-
tion of one transition makes disables another transition). In our implementation
of the model, in every step, a permutation of the enabled transitions is randomly
generated (using uniform distribution). Transitions are executed according to an
order determined by the generated permutation. If two (or more) transitions are
in a conflict, the transition appearing earlier in the considered order is executed.
    The initial tests of the PN model were performed using the matrix multipli-
cation kernel from [4] and the stability example from [20]. The PN models were
generated for both kernels. The matrix multiplication program was executed
many times with different sizes of matrices. Similarly, the stability example was
executed with different values of Time Step and Final Time parameters. For
the same data, the computations of the PN model were executed. The execution
times of kernels and the numbers of steps of PN computations were compared.
The results were consistent in both cases.


5   Experimental results

In the main experiment we used a binary version of the least significant digit
radix sort algorithm [17, 18]. The idea of this method is to sort a list of positive
n-bit integers using their binary representation. We make n runs rearranging
the list in such a manner that in i-th run all the integers having 0 on i-th bit
are arranged in the first part of the array, while those having 1 – in the second
part. An important requirement is to preserve the order of elements which do
not differ on the processed bit. In other words, the sorting subroutine need to
be stable. As a side effect the whole sorting procedure is also stable.
    In the parallel version we made n runs (one for every bit), each run consisting
of three phases. At the beginning of each run, we partition the dataset equally
between m nodes. During the first phase, j-th node counts zeros[i, j] – the
number of elements containing 0 on i-th bit (consequently, we know ones[i, j]
– the number of elements with 1 on i-th bit for this node). In the next phase,
we need to compute the positions for the set of elements assigned to each node.
Namely,
P         j-th node should
                         P place all the elements having 0 on i-th bit between
   k