=Paper=
{{Paper
|id=Vol-1846/paper7
|storemode=property
|title=GPU Computations and Memory Access Model Based on Petri Nets
|pdfUrl=https://ceur-ws.org/Vol-1846/paper7.pdf
|volume=Vol-1846
|authors=Anna Gogolińska,Lukasz Mikulski,Marcin Piatkowski
|dblpUrl=https://dblp.org/rec/conf/apn/GogolinskaMP17
}}
==GPU Computations and Memory Access Model Based on Petri Nets==
GPU computations and memory access model based on Petri nets? Anna Gogolińska, Łukasz Mikulski, and Marcin Piątkowski Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Toruń, Poland {anna.gogolinska,lukasz.mikulski,marcin.piatkowski}@mat.umk.pl Abstract. In modern systems CPUs as well as GPUs are equipped with multi-level memory architectures, where different levels of the hierarchy vary in latency and capacity. Therefore, various memory access models were studied. Such a model can be seen as an interface abstracting the user from the physical architecture details. In this paper we present a general and uniform GPU computation and memory access model based on bounded inhibitor Petri nets (PNs). Its effectiveness is demonstrated by comparing its throughputs to practical computational experiments performed with the usage of Nvidia GPU with CUDA architecture. Our PN model is consistent with the workflow of multithreaded GPU streaming multiprocessors. It models a selection and execution of in- structions for each warp. The three types of instructions included in the model are: the arithmetic operation, the access to the shared memory and the access to the global memory. For a given algorithm the model al- lows to check how efficient the parallelization is, and whether a different organization of threads will improve performance. The accuracy of our model was tested with different kernels. As the preliminary experiments we used the matrix multiplication program and stability example created by Nvidia, and as the main experiment a binary version of the least significant digit radix sort algorithm. We created three implementations of the algorithm using CUDA architecture, differing in the usage of shared and global memory as well as organization of calculations. For each implementation the PN model was used and the results of experiments are presented in the work. Keywords: Petri nets, CUDA architecture, GPU, memory model 1 Introduction The inter-process communication over a common part of the memory shared by processes is a usual performance bottleneck in multiprocessor environments. In modern systems CPUs as well as GPUs are equipped with multi-level memory architectures, where different levels of the hierarchy vary in latency and capacity. ? This research has been supported by the Polish National Science Center through grant No.2013/09/D/ST6/03928 106 PNSE’17 – Petri Nets and Software Engineering By considering different local views of the processes on the common part of the memory one can try to improve the processor utilization. Therefore, various memory access models were studied, see for instance [8, 11, 15]. Such a model can be seen as an interface abstracting the user from the physical architecture details. It allows to specify, without a reference to processors, the local views that are possible in concurrent task executions and maintain its consistency. Another important issue is the task distribution between threads and CPU/GPU cores and the instruction scheduling, which can have a significant impact on the efficiency. Consider for instance running the three threads on a single processor depicted in Figure 1. Each of them performs a list of arithmetic operations in- terleaved by memory reads/writes consisting of a short preprocessing and then a longer period of waiting for the memory access (which is a usual situation in parallel computing). In the initial part of the computation each thread re- alizes its preprocessing for the memory access and then starts to wait for the access itself. This causes the processor idle period when all threads are waiting (marked as I). After that the arithmetic operations of all threads are executed simultaneously. They can be scheduled in such a way that one thread waits for the memory access while other threads perform their computations and there is no idle period (marked as II). Thanks to that the waiting period of a thread can be hidden behind the active computations of other threads. t1 : t2 : t3 : (I) (II) – memory access – arithmetic operation Fig. 1. The example run of the threads t1 , t2 and t3 on a single processor. The area marked with (I) represents the idle period when no computation is done, and the area marked with (II) represents the period when the waiting for the memory access is hidden behind arithmetic operations. The main contribution of this paper is a general and uniform GPU computa- tion and memory access model based on bounded Petri nets [14] together with the application simulating its execution. For a given algorithm pseudocode (or the source code) one can use our model to count the number of operations per- formed (taking into account their simultaneous execution). The main advantage of the model over the basic arithmetic counting and other models described be- low is the possibility of reorganization of parts of the code, the potential ability to predict a duration of GPU calculations and to handle the aforementioned Gogolińska et.al.: GPU Computations and Memory Access Model . . . 107 computation scheduling. Such an approach might be used to improve the al- gorithm code organization and to partition the computation tasks between the threads to maximally increase the efficiency. The effectiveness of our model is demonstrated by comparing its through- puts to practical computational experiments performed with the usage of Nvidia GPU. We study the impact on the complexity of various parameters such as the number of concurrent processes, and the level of memory used (shared memory vs. global memory). The successful application of our model is discussed in de- tails, as a proof of concept, on a single example of digit radix sorting algorithm. We do not want to discuss methods of parallelization, nor the most efficient way of using different types of the memory. It is beyond the scope of this paper. Those problems are very complex, and cannot be explained in such a short publication. More about those topics can be found for instance in [6, 20]. Our PN model is not the only one GPU efficiency model. There are quite a few other GPU performance models, which can be divided into two groups (according to [12]): (1) Calibrated performance models that make specific predictions and include many lower levels of details. They usually contain many specialized parameters, some of which may be difficult to obtain or calculate. The first example is a model presented in [7]. It is the first analytical model of the GPU efficiency. The most important values used there are MWP (number of memory requestes that can be executed concurrently) and CWP (number of warps, which can be computed while one warp is waiting for memory values). The model consists of 14 equations, which contains 21 parameters. Other example is a model presented in [19]. The authors created not only performance GPU model but also power consumption model. In the performance part they estimate execution time for in- dividual GPU architecture components (e.g. shared, global memory) separately. This way they easily identify potential performance bottlenecks. The model has a form of equations and contains 32 parameters. (2) Asymptotic models for algorithm analysis at a high level of abstraction that capture only the essential features of GPU architecture. One of the models from this group is the model described in [13]. He used elements of PRAM, BSP and QRQW approaches. The model calculates the number of cycles required for the whole kernel, taking into account the number of blocks, warps and threads, the maximal number of cycles required by a single thread to perform calcula- tions and the number of threads which can be executed in parallel. However the model is quite simple and some important elements are neglected (like hiding memory latency in computations). The other asymptotic model is Thread Multi- Core Memory (TMM) model presented in [11]. Basing on the TMM, GPUs can be presented as abstract core groups, each containing a number of cores and fast local memory. A large and slow global memory is shared by all cores. The running time of the algorithm is calculated basing on suitable equation with basic GPU parameters. One of the most popular GPU efficiency models is the roofline model introduced in [21]. It can be used to obtain performance estimates of GPU computations, and requires two parameters: the number of operations 108 PNSE’17 – Petri Nets and Software Engineering performed by a kernel and the number of bytes transferred from/to the memory, which can be used to calculate the arithmetic intensity I. In the naive roofline model I can be handily presented as a point in two-dimensional space restricted by two ceilings lines: the memory bandwidth and the processor’s peak efficiency. The resulting performance is a bound under which the arithmetic intensity ap- peared: the memory bandwidth bound or the peak performance bound. In the extended versions of the model, additional ceilings can be added. They related for example to software prefetching or task level parallelism. The roofline model can determine the type of kernel limitation and show, how optimal the program is. However, such a simple arithmetic operation cannot precisely represent the complex processes of kernel execution, like for example hiding the waiting period in calculations (see Figure 1). Similar problem occurs in other purely arithmeti- cal models. More complex tools are necessary for such a purpose. Our model belongs to the asymptotic group, and to the best of our knowledge it is the first one utilizing Petri nets. The paper is organized as follows. In the next section we describe Graphical Processing Units and CUDA Toolkit, focusing in particular on memory types (and organization). In Section 3 we recall some standard notions and notations related to Petri nets. Then we introduce our memory access and computation model followed by the description of radix sort algorithm. We also present the results of experiments conducted on the base of our implementations of the radix sort algorithm. We conclude the paper and give some directions of further research in the last section. 2 CUDA An intensive development of Graphical Processing Units (GPU in short) resulted in construction of high performance computational devices, which besides the graphical display management, allow also the execution of parallel general pur- pose algorithms (not necessarily related to computer graphics). As opposed to CPU consisting of a few cores optimized for sequential serial processing, GPU contains thousands of smaller, more efficient cores designed for handling multi- ple tasks simultaneously. The most popular ones are GPU’s produced by Nvidia Corporation, supplied with Nvidia CUDA Toolkit (see [2]). A program running in heterogeneous environment equipped with GPU can be split into so-called host parts, which are executed by CPU, and so-called kernel parts, which are executed by GPU. The host part specify the kernel execution context and manages the data transfer between the host and the GPU memory. The kernel functions create a big number of threads allowing a highly parallel computation (where each thread runs the same kernel code). GPU threads, as opposed to CPU threads, are much lighter, hence their creation and controlling requires less CPU cycles. All threads executed on GPU are organized into several equally-sized thread blocks, which in turn are organized into a grid. Gogolińska et.al.: GPU Computations and Memory Access Model . . . 109 A thread block is the set of concurrently executed threads. Such an execution and the thread cooperation can be coordinated by a barrier synchronization. Moreover, the data can be exchanged between threads using a shared memory. The size of a block is limited by the capacity of resources accessible on a single processor core. On currently available GPU’s a single block can contain up to 1024 threads. Each thread block within a grid is uniquely identified by its block ID, and each thread within a block – by its thread ID. A grid is an array of thread blocks executing the same kernel. The number of blocks in a grid is specified by the amount of data to be processed and the number of available processors. The exchange of data between threads within a grid requires the usage of the global memory. Moreover, while all threads within a single block run simultaneously, different blocks can be executed in any order. The physical Nvidia GPU architecture consists of several multithreaded stream- ing multiprocessors (SM in short). Thread blocks are distributed between SM’s in such a way that all threads within a single block are concurrently executed on the same SM (different blocks may, but not necessarily have to, be executed on the same SM). A streaming multiprocessor organizes threads into so-called warps consisting of 32 threads each. The partition is done according to increasing thread ID. Each SM works utilizing SIMT (single-instruction, multiple-threads) architecture, which means that all threads within a single warp execute one com- mon instruction at a time. Any divergence (e.g. caused by a data processed) leads to a serial execution of a single computation path until possible convergence to the same execution path. A single SM serves multiple warps. It is equipped with a number of warp schedulers and instruction dispatch units (currently 2 or 4 depending on the device used). A scheduler selects a warp, which is ready to be executed and issues it to the physical cores of GPU. If the currently active warp needs to wait for the memory read/write operation it is replaced by another ready warp. While the replaced warp is waiting for the memory access, other warps perform their computations, therefore the SM is busy as often as possible. The waiting period of a single warp is hidden behind the computations of the others (if there are only enough active warps available) and is not seen outside the SM. The simulation of such a behavior is the main part of our model. The above mentioned heterogeneous environment is equipped with the host memory managed by CPU and the GPU device memory. The latter is signifi- cantly more complex than the former. Due to a necessary compromise between the data transfer/access speed and the possible capacity, the GPU device mem- ory consists of various types of data storage, such as global, constant, local and shared memory. The global memory is the largest and at the same time the slowest type of GPU memory (with hundreds of cycles latency). Together with the constant memory it is the only type of GPU memory, which can be accessed by the host. It is available for reading and writing for all running threads, however the data exchange and result sharing are possible only after a kernel-wide global synchronization. 110 PNSE’17 – Petri Nets and Software Engineering The content of the global memory is accessed in blocks of size 32, 64 or 128 bytes (depending on the device used). Every time an element within a block is accessed, the whole block have to be transferred. Therefore, the concurrent (among the threads within a single warp) global memory read and write opera- tions are grouped into transactions, the number of which depends on the cache lines required to serve all threads within a warp. However, if different threads in a warp refer to different memory blocks, all such blocks have to be transferred to cache sequentially. Such a situation cause the necessity of repeated global memory accesses. The shared memory is a fast memory physically placed inside a multipro- cessor. It consists of blocks, each of which is available for all threads within a single thread block. Moreover, each such block is divided into several so-called memory banks. The access of different threads to different banks is realized si- multaneously, while the access of different threads to the same bank is realized sequentially. Such a situation is called the bank conflict. The shared memory can be used for data exchange between threads within the same thread block after block-wide thread synchronization. The local memory of a single thread consists of a number of registers, which are the fastest type of memory available (with almost negligible access time). It is used to store local thread variables. Due to large number of threads the capacity of each thread local memory is strongly limited. To complete the picture we have to mention also the constant and texture memory – dedicated parts of the GPU device memory (usually buffered). Both of them are optimized for access speed within the device, but are available for threads in read-only mode. Neither of them is considered in our model. 3 Petri Nets The set of non-negative integers is denoted by IN. Given a set X, the cardinality (number of elements) of X is denoted by |X|, the powerset (set of all subsets) by 2X – the cardinality of the powerset is 2|X| . Multisets over X are members of INX , i.e., functions from X into IN. For convenience and readability, if the set X is finite, multisets in INX will be represented by vectors of IN|X| (assuming a fixed ordering of the set X). The addition and the partial order ≤ on INX are understood componentwise, while < means ≤ and 6=. Let us now recall basic definitions and facts concerning inhibitor Petri nets [1, 16]. Definition 1. An inhibitor1 place/transition net (p/t-net) is a quintuple S = (P, T, W, I, M0 ), where: – P and T are finite disjoint sets, of places and transitions (actions), respec- tively; 1 Note that in the case of bounded nets the use of inhibitors is not necessary, one can provide an equivalent (with more complex structure) net without inhibitors. Gogolińska et.al.: GPU Computations and Memory Access Model . . . 111 – W : P × T ∪ T × P → IN is an arc weight function; – I ⊆ P × T is an inhibition relation; – M0 ∈ INP is a multiset of places, named the initial marking. For all a ∈ T we use the following denotations: • a = {p ∈ P | W (p, a) > 0} – the set of entries to a a• = {p ∈ P | W (a, p) > 0} – the set of exits from a ◦ a = {p ∈ P | (p, a) ∈ I} – the set of inhibitor places for a. Petri nets admit a natural graphical representation. Nodes represent places and transitions, arcs with classical arrow heads represent the weight function, while arcs with small circles as arrowheads represent inhibition relation. Places are indicated by circles, and transitions by boxes. Markings are depicted by tokens inside the circles, the capacity of places is unlimited. However, Petri nets used in our model are bounded (which means that there exist a common bound for all the numbers of tokens appearing during the computation in a single place). The set of all finite strings of transitions is denoted by T ∗ , the empty string is denoted by ε, the length of w ∈ T ∗ is denoted by |w|, number of occurrences of a transition a in a string w is denoted by |w|a . Multisets of places are called markings. In the context of p/t-nets, they are typically represented by nonnegative integer vectors of dimension |P |, assuming that P is totally ordered. A transition a ∈ T is enabled at a marking M whenever • a ≤ M (all its entries are marked) and ∀p∈◦ a M (p) = 0 (all inhibitor places are empty). If a is enabled at M , then it can be executed. A marking M is called a dead marking if no transition is enabled at M (which means that ∀a∈T ∃p∈P W (p, a) > M (p) ∨ ((p, a) ∈ I ∧ M (p) > 0) ). The execution of an enabled transition a is not forced and changes the current marking M to the new marking M 0 = (M − • a) + a• (tokens are removed from entries, then put to exits). We shall denote M a for the predicate “a is enabled at M ” and M aM 0 for the predicate “a is enabled at M and M 0 is the resulting marking”. In this paper however, we use the maximal concurrent semantics and in every marking we execute one of the maximal sets of enabled transitions (i.e. a step, which is maximally concurrent at thisP marking). Formally, a set of transitions • A ⊆ T is called step and is enabled if a∈A a ≤ M and ∀p∈( a∈A ◦ a) M (p) = S 0. The execution of a step A changes the current marking M to the new marking M0 = M − • • P P a∈A a + a∈A a . We say that a step is maximally concurrent at marking M if A is enabled at M and ∀a∈A / A ∪ {a} is not enabled at M . The notions of enabledness and execution we extend, in a natural way, to strings of steps (computations): the empty string ε is enabled at any marking, a string w = Av is enabled at a marking M whenever M AM 0 and v is enabled at M 0 . The predicates M A, M w, M AM 0 and M wM 0 are defined like for single transitions. 112 PNSE’17 – Petri Nets and Software Engineering 4 Memory access and computations model The Petri net model of GPU calculations is consistent with the workflow of multithreaded streaming multiprocessors (SMs). The model represents the way one SM operates. It models each warp assigned to the SM, selection of the next instruction for the SM, accesses to the global and shared memory, and arithmetic operations. It does not represent threads hierarchy (blocks, grid), repeated accesses to the global memory nor bank conflicts. The model allows simultaneous accesses to the global memory, but the number of warps, which can use the global memory, at the same time, is limited by the number of SM warp schedulers (2 or 4). Each element of the model was created basing on [3, 4]. The size of the considered Petri net depends on the number of warps. The two elements: the place p0 – SM and the transition t0 – waiting are the constant part of the model, other are generated for every warp. The place p0 represents the streaming multiprocessor and its initial marking should correspond to the number of warp schedulers. For the modern graphical cards it should be 2 or 4. This place is connected by a loop with the transition t0 . The transition t0 can be executed only when the warp schedulers cannot schedule any instruction, i.e. there is no instruction ready to be executed. The waiting transition is added to the model to gain control over how many steps of the calculations on the SM is idle. The minimization of that number is crucial for optimization of GPU programming. Beside those two elements, the PN model contains 18 places and 17 transitions for every warp required by the analysed algorithm. That part is called a warp part of PN model (WPNM). The detailed description of the most important places and transitions of WPNM is presented in Table 1. The WPNM together with SM place and waiting transition are depicted in Figure 2. The place p1 represents the activation of the warp. It is marked when the warp is active. The place p2 is marked when the warp finishes the execution of its previous instruction. The warp can be scheduled for the execution only when the places p2 and p4 are marked. A token in p4 means that the next instruction is selected and ready. The places from p3 to p10 and the transitions t2 , t3 , t4 (the frame I part in Figure 2) are responsible for controlling and selecting the instructions. The three types of instructions are allowed in the model: an arithmetic calculation, an access to the shared memory and an access to the global memory. The initial marking of the place p5 corresponds to the number xa of arithmetic operations required by the analyzed algorithm. Similarly, the initial markings of the places p7 and p9 represent respectively the numbers xs and xg of read/write operations from/to the shared and global memory. If the place p3 is marked and at least one of the places: p5 , p7 , p9 is not empty, the next instruction can be selected. The selection is random. When the instruction is chosen, according to its type (arithmetic calculation, shared memory access, global memory access) the marking of the corresponding place is decreased and a token is added to p4 . Moreover, (according to the instruction type) one of the places: p6 , p8 or p10 is marked. Now the selected instruction is ready to be executed and the warp can be processed by SM. The execution of the instruction Gogolińska et.al.: GPU Computations and Memory Access Model . . . 113 is represented by the three parts of the net marked in frame II, frame III and frame IV (see Figure 2). They correspond to the type of the selected instruction: frame II for an arithmetic calculation, frame III – an access to the shared memory and frame IV – an access to the global memory. The arithmetic calculation is simply represented by one place and two transitions. When the calculation is done, tokens are put in places: p2 and p3 (which means that the instruction is finished and the next one can be selected), and in the place p0 (which means that SM is ready to execute the next instruction). The same situation is obtained when the access to the memory (shared or global) is finished, but those parts of the model contain more places and transitions. Those additional elements are used to model the memory access latencies. The shared memory latency and the global memory latency are the parameters of the model and are denoted by l1 for the global memory and by l2 for the shared memory. The transition t9 (t12 respectively) may be executed only after l1 (l2 respectively) executions of t8 (t17 respectively). For testing, their default values were 20 for l1 and 2 for l2 , which is consistent with [4]. The transition t16 is connected by inhibitor arcs with the places p5 , p7 and p9 (the frame I in Figure 2) and can be executed only when those places are empty, i.e. there is no instruction left for execution. The transition is also connected by a regular arc with p2 , Moreover, t16 is the only one able to take the token from the place p1 and its execution is equivalent to the termination of a given warp. As it was mentioned above, the WPNM (places from p1 to p18 and tran- sitions from t1 to t17 ) is generated for a single warp. In the case of multiple warps a separate WPNM should be generated for each of them. To distinguish between distinct WPNMs one can either increase the numeration of places and transitions accordingly or assign to them two-part labels consisting of the origi- nal place/transition number together with the warp id. It should be noticed that each place corresponding to p2 and representing the readiness of the given warp should be connected by an inhibitor arc with the transition t0 . Moreover, each transition corresponding to t1 should be connected with the place p0 (SM ). The same goes for transitions corresponding to t5 , t14 and t15 . Note that the control is returned by t5 not t11 in the case of part responsible for the access to the global memory. Thanks to that, SM can process the next ready warp while the current one is waiting for the memory access. Our PN model of GPU computation and memory access may by adapted for any algorithm. Instantiations of the model for different kernels may differ in the number of warps and the marking of places responsible for instructions counting (i.e. p5 , p7 and p9 ). For the chosen number of required warps, a new model containing sufficient number of WPNMs can be generated. The other possibility is the generation of one big model with the maximal possible number of WPNMs (i.e. 64 for modern graphical cards [4]), and then the necessary number of places representing active warps should be marked. The number of WPNMs can be calculated basing on the number of threads and blocks, which are parameters of the kernel. Notice that the model provides no controlling mechanism for the maximum number of WPNMs. It is up to the user to be aware 114 PNSE’17 – Petri Nets and Software Engineering p2 • p1 • t1 p11 t0 (I) p4 p3 • (II) t7 t15 •• p0 p17 t16 t2 t3 t4 t6 (III) xa p5 p 6 xs p 7 p8 xg p9 p10 2 p15 t17 l2 +1 p12 l1 +1 t l2 t5 9 t11 t12 l2 p18 p14 2 l1 t8 l1 p16 t14 p13 (IV) Fig. 2. The example of a warp PN model (WPNM). In frame I part the next instruc- tion is selected. Sections: frame II, frame III and frame IV represent the execution of different types of instructions: arithmetic calculation, access to the shared memory and access to the global memory (respectively). Place/Transition name and description p0 Streaming Multiprocessor (SM) t0 Waiting p1 Active warp t1 The warp is executed on SM p2 The previous instruction is finished and t2 Selection of a calculation for the warp is ready for the next one the next instruction p3 Check the next instruction t3 Selection of an access to the shared memory for the next instruction p4 Next instruction is ready t4 Selection of an access to the global memory for the next instruction p5 Arithmetic operations t5 Execution of the access to the global memory p6 Instruction – arithmetic calculation t6 Execution of the access to the shared memory p7 Access to the shared memory t7 Execution of arithmetic operation p8 Instruction – shared memory access t8 Waiting for the global memory access p9 Access to the global memory t9 Access to the global memory p10 Instruction – global memory access t11 Access to the global memory is finished t12 Access to the shared memory t14 Access to the shared memory is finished t15 Calculation is finished t16 Termination of the warp t17 Waiting for the shared memory access Table 1. The description of the places and transitions depicted in the Figure 2. that the maximal number of WPNMs is limited to 64 in modern GPUs. The number of arithmetic operations and accesses to the shared and global memory Gogolińska et.al.: GPU Computations and Memory Access Model . . . 115 need to be calculated for the considered algorithm. Those numbers should be used as the initial marking of places p5 , p7 and p9 . If the model is constructed according to the description above, it is ready to be used out of the box. The usage of the model involves the execution of computations according to the maximal concurrent semantics (i.e. concurrent execution of all transitions that are enabled) starting from the initial marking until reaching the dead marking. The latter is obtained only when all places corresponding to p1 (for each WPNM) become empty, which is equivalent to the termination of all warps. The number of steps of the computation is returned by the model and corresponds to the GPU execution time. The maximal concurrent semantics requires execution of all enabled transi- tions. However, some of the enabled transitions may be in a conflict (i.e. execu- tion of one transition makes disables another transition). In our implementation of the model, in every step, a permutation of the enabled transitions is randomly generated (using uniform distribution). Transitions are executed according to an order determined by the generated permutation. If two (or more) transitions are in a conflict, the transition appearing earlier in the considered order is executed. The initial tests of the PN model were performed using the matrix multipli- cation kernel from [4] and the stability example from [20]. The PN models were generated for both kernels. The matrix multiplication program was executed many times with different sizes of matrices. Similarly, the stability example was executed with different values of Time Step and Final Time parameters. For the same data, the computations of the PN model were executed. The execution times of kernels and the numbers of steps of PN computations were compared. The results were consistent in both cases. 5 Experimental results In the main experiment we used a binary version of the least significant digit radix sort algorithm [17, 18]. The idea of this method is to sort a list of positive n-bit integers using their binary representation. We make n runs rearranging the list in such a manner that in i-th run all the integers having 0 on i-th bit are arranged in the first part of the array, while those having 1 – in the second part. An important requirement is to preserve the order of elements which do not differ on the processed bit. In other words, the sorting subroutine need to be stable. As a side effect the whole sorting procedure is also stable. In the parallel version we made n runs (one for every bit), each run consisting of three phases. At the beginning of each run, we partition the dataset equally between m nodes. During the first phase, j-th node counts zeros[i, j] – the number of elements containing 0 on i-th bit (consequently, we know ones[i, j] – the number of elements with 1 on i-th bit for this node). In the next phase, we need to compute the positions for the set of elements assigned to each node. Namely, P j-th node should P place all the elements having 0 on i-th bit between k