=Paper=
{{Paper
|id=Vol-2667/paper3
|storemode=property
|title=Investigation of the RAM access model in a heterogeneous computing system 
|pdfUrl=https://ceur-ws.org/Vol-2667/paper3.pdf
|volume=Vol-2667
|authors=Aleksander Kolpakov,Aleksey Belov,Yuriy Kropotov
}}
==Investigation of the RAM access model in a heterogeneous computing system ==
<pdf width="1500px">https://ceur-ws.org/Vol-2667/paper3.pdf</pdf>
<pre>
            Investigation of the RAM access model in a
                 heterogeneous computing system
        Aleksander Kolpakov                                      Aleksey Belov                                           Yuriy Kropotov
    Department of Electronics and                        Department of Electronics and                            Department of Electronics and
          Computer Science                                     Computer Science                                         Computer Science
 Vladimir State University named after                Vladimir State University named after                    Vladimir State University named after
   Alexander and Nicholay Stoletovs                     Alexander and Nicholay Stoletovs                         Alexander and Nicholay Stoletovs
            Murom, Russia                                        Murom, Russia                                            Murom, Russia
    ORCID: 0000-0001-9328-2331                           ORCID: 0000-0002-2072-7042                               ORCID: 0000-0002-9012-3092

    Abstract—The issue of creating high-performance                          many of which offer significant acceleration compared to
computing systems based on heterogeneous computer systems is                 sequential implementations on the processor.
topical, since the volumes of processed information, calculations
and studies with large data sets are constantly increasing. The                  A model of the upper bound for the running time of an
aim of the work is an experimental study of previously                       algorithm on a graphics processor in a CPU-GPU environment
developed models for predicting the performance of                           is presented in [2] and is based on an abstract PRAM model
heterogeneous computer systems in telecommunications. As a                   [3]. The upper estimate of the running time of the algorithm
result, the study showed that the developed models allow us to               on a graphics processor in a CPU-GPU environment
obtain an adequate estimate of the possible time of the algorithm            according to this model is calculated according to the formula
for various parameters of the GPU with some limitations.                     below
   Keywords—graphic        processor     units,         heterogeneous                                                B(N )
                                                                                                      N HD ( N )                      N DH ( N )
computing systems, parallel calculations                                            T GPU ( N )                    Ti G ( N )                 
                                                                                                        S HD          i 1              S DH
                         I.    INTRODUCTION
    One of the most dynamically developing areas in parallel                     This model does not account for certain features of the
programming at the moment is the use of computer systems                     graphics processor, such as the size of a warp-to GPU memory
with a heterogeneous architecture, in which there are                        access time, etc., however, the GPGPU performance
computing devices with different architectures and,                          prediction model, which is a combination of known parallel
accordingly, with different methods of using parallel                        computing models, was given in [4]. Given the complex
computing. The most common approach in the design of such                    architecture of the GPU, none of these models is complete,
systems was the use of graphic video cards or devices based                  and it takes a combination of them along with a few
on them as the main parallel calculator. This growing need for               extensions. The following models were used during
solving big problems stimulates research and innovation in the               development:
field of parallel computing in general and in the development
of methods for graphic processors in particular.                                 1) The PRAM model [3].
                                                                                 2) The BSP model [5].
      II.    RESEARCH AND DEVELOPMENT OF MODELS OF
HETEROGENEOUS COMPUTING SYSTEMS BASED ON GRAPHICS                                3) The QRQW model [6].
                  PROCESSORS
                                                                                 The final equation of this model is:
    Modern graphic processors (GPUs) are parallel
processors. More precisely, they are known as stream                                                 N B (K )  N w (K )  N t (K ) CT (K )
processors because they are capable of performing various                                T (K )                                                 
functions in the incoming data stream. They represent                                                               NC D R
advanced architectures that are designed for parallel
processing of data (primarily graphic). They are currently                       All parameters of this model are shown in table 1.
extremely powerful programmable processors, MIMD
architecture capabilities with some limitations.                                   TABLE I.          THE LIST OF DEVELOPED MODEL PARAMETERS

     As technology, languages, and hardware evolved,                                Parameter                       Description
researchers were able to leverage the added flexibility of                              D                    The kernel pipeline depth
                                                                                       Nc                  The number of cores per SM
GPUs when deploying non-graphical applications to the GPU
                                                                                        R                           GPU Clock
(GPGPU), especially when processing images. A more                                    Ct(K)          The maximum number of ticks consumed by
detailed history of the development of GPGPU is presented in                                                 any thread in the kernel K
[1].                                                                                      Nt              Number of threads in warp = 32
                                                                                          Nw              The number of warps per block
    A further development momentum was the emergence of                                  NB(K)            The number of blocks per kernel
CUDA, the NVIDIA C-based development environment for                                      Ki                  i-th kernel on the GPU
GPGPUs. CUDA allows developers unfamiliar with graphical                                 T(K)                 Time spent by kernel K
programming to write code that can be executed on the GPU.                               T(P)                Time spent by program P
CUDA provides the necessary abstractions for the developer
to write multi-threaded programs with little or no knowledge                     The performance of the CUDA kernel can vary greatly
of the graphics APIs. Since then, many implementations of                    with slight changes depending on memory access strategies.
parallelized applications have been developed for GPUs,                      Using shared memory can provide up to 20 times better
                                                                             performance than using global memory, and using shared


Copyright © 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
Data Science

global memory access can lead to 5 times better performance               thread belonging to the first half of the warp and a thread
compared to non-coalesced access. Arithmetic operations also              belonging to the second half of the same warp.
require a different number of cycles to perform, for example,                 Shared memory also has a broadcast mechanism that
operations such as integer summation require 4 cycles, while              allows you to read a 32-bit word and broadcast it to multiple
calculating an integer module takes 48 cycles [7,8]. Any                  threads at the same time with one transaction to read from
model that does not capture these changes is unlikely to be               memory. This reduces the number of conflicts in the bank
accurate.                                                                 when several threads of the halfwarp are read from the address
                                                                          within the same 32-bit word. More precisely, a memory read
 III.  USING GPU MEMORY ACCESS PATTERNS TO IMPROVE                        request made to several addresses are served in several stages
    HETEROGENEOUS COMPUTING SYSTEM PERFORMANCE                            over time – one step every two cycles, serving one conflict-
    Since the access delay when reading data from the global              free subset of these addresses per step, until all addresses are
memory of the GPU is up to 200 times higher, than reading                 served. At each step, a subset is constructed from the
data from the registers, providing the most efficient way to              remaining addresses that have yet to be served using the
access the GPU RAM is crucial to improve the performance                  following procedure:
of the GPU in particular and the heterogeneous system as a
whole. Therefore, optimizing global memory access is                         1) Select one of the words indicated by the remaining
becoming the single most important programming factor for                 addresses as the translated word.
the GPGPU architecture.                                                      2) Include in a subset:
                                                                                 a) all addresses that are within the broadcast word;
    The GPU global memory reads and writes data in half
warp threads (16 threads), which are optimized by the device                     b) one address for each bank indicated in the remaining
in just one global memory transaction if certain access                   addresses.
requirements are met. For the GTX 280 video card, the                         The space of constant memory is cached, so reading from
following protocol is used to determine the number of                     constant memory is delayed as in the case of reading from
transactions used by the halfwarp:                                        global memory only if there is no cache, otherwise a
  1) Implemented search memory segment, which contains                    transaction from constant cache is performed. For all threads,
the address inquiry from the active thread with the lowest                halfwarp reads from the constant cache as fast as reads from
number. The segment size is 32 bytes for 8-bit data, 64 bytes             registers if all threads read the same address. Access time
for 16-bit data, and 128 bytes for 32-, 64- and 128-bit data.             scales linearly depending on the number of different
                                                                          addresses read by all threads.
       a) if the transaction size is 128 bytes and only the upper             Texture memory allows to cache data present in global
or lower half of the segment is used, the transaction size is             memory. If a cached item is requested, then it is served in a
reduced to 64 bytes;                                                      single request. The lack of a cache results in a global memory
       b) if the transaction size is 64 bytes and only the lower          read operation, which takes much longer.
or upper half is used, the transaction size is reduced to 32                  To compare the access time to different types of memory,
bytes.                                                                    1,000,000 read operations from each type of memory were
   2) A transaction is in progress, serviced threads are marked           performed. Access was performed both sequentially and
as inactive.                                                              randomly. Tests were performed using the NVIDIA GTX280
   3) Repeated until all threads in the halfwarp are serviced.            GPU. The results are presented in Table 2. As you can see,
    When more than one thread requests data from addresses,               shared memory provides the best performance, followed by a
that fall into the same segment, one transaction can satisfy all          constant cache, and then a texture cache. Global memory
such threads. This service of multiple requests in a single               shows the highest latency.
transaction is called coalescing. Thus, it is obvious that certain
GPU memory access patterns will necessarily have a positive                  TABLE II.     COMPARISON TABLE OF ACCESS TIMES FOR DIFFERENT
effect, while others will gradually increase the delay as the                                       MEMORY TYPES
memory addresses requested by the halfwarp threads diverge.                       Memory type    Serial access, ms   Random access, ms
    The shared memory of the GPU is divided into memory                             Global              969               21777
modules of the same size, called banks, which can be accessed                       Shared              51                 86.6
simultaneously by several threads. Thus, any request to read                       Constant             35                 192
or write to the memory, consisting of n addresses, that fall into                   Texture             140                247
n separate memory banks, can be carried out simultaneously,                 IV.    EXPERIMENTAL MODELING OF GPU MEMORY ACCESS
which gives an effective throughput that is n times higher than
                                                                                                      PATTERNS
the throughput of one module. However, if two memory
request addresses fall into the same memory bank, a bank                     For most parallel computing platforms, modeling memory
conflict occurs and access must be serialized. The GPU                    access patterns and their associated costs is the most complex
divides memory access requests with bank conflicts into as                and most important part. The following algorithm is used to
many individual requests without conflicts as necessary,                  experimentally verify statements about the process of
reducing the effective bandwidth by a factor equal to the                 accessing memory on the GPU:
number of individual memory requests. For GTX280, the size                     Algorithm 1 Global Memory Access Benchmark
of the warp is 32, and the number of banks is 16. Access to
shared memory for warp is divided into one request for the                    Input data: number of elements N, step stride, offset offset,
first half of warp and one request for the second half of warp.           array A in global memory.
As a result, there can be no conflict of the bank between a
                                                                               1: Calculate the number of elements in the thread, Nthread;


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                   12
Data Science

    2: Calculate the data range of this thread, using stride and           model presented above adequately describes the performance
offset;                                                                    of GPGPU.
    3: while index is in the range do                                       V. EXPERIMENTAL VERIFICATION OF THE IMPACT
    4:    Reading A[index] to variable R;                                  OF ACCESS CONFLICTS WHEN USING SHARED MEMORY
    5:    Increment R and save back to A[index];                               In this experiment, while maintaining the general structure
                                                                           of global memory accesses, as in the previous experiment,
    6:    index = index + stride;                                          each thread writes an element to the shared memory. The
    7: end while                                                           shared memory access pattern is controlled by the bank
                                                                           variable, which can be set to a value from 0 to 16. With a larger
    Using the above algorithm, the experiment was simulated,               bank value, we can thus increase the number of access
that shows how much gain from sharing depends on the                       conflicts.
number of threads in the warp. This is controlled by the stride
variable. Stride denotes the gap between elements, that are                    Algorithm 2 Shared Memory Access Benchmark
accessed sequentially by a single thread. Consequently,                       Input data: number of elements N, step stride, offset offset,
threads in halfwarp can take advantage of coalesced access, if             control variable bank, array A in global memory, array B in
the stride value is large. For example, when stride = 32, each             shared memory.
warp thread receives consecutive elements, which ensures
complete union. When stride is 1, each thread reads one                        1: Calculate the number of elements in the thread, Nthread;
element that is offset by 32, so they are not completely merged                2: Calculate the data range of this thread, using stride and
and require 16 memory transactions that will be serviced for               offset;
halfwarp. To ensure a fair comparison, in the above code the
number of hits on the stream does not depend on stride.                        3: while index is in the range do
    In the code shown in algorithm 1, the number of                            4:     for i = 0 to 10000 do
calculations per iteration is very small compared to the                       5:     Reading A[index] and save it back;
memory access delay for stride = 1. However, as the stride
value increases, memory access and calculations take                           6:     B [IDthread × bank (mod sizeblock)];
approximately the same number of clock cycles. Using the
                                                                               7:     end for
MAX model, we can assume the runtime of this kernel and
compare it with the actual one in Figure 1. The program                        8: end while
execution graph is shown for various stride values. It should
be noted that the basic access code only for memory, that is,                  The kernel in this algorithm has about 16 cycles of
with a small number of calculations, differs from the model,               calculation per iteration, and there are 64,000 iterations. The
due to limited information about the hardware [9,10].                      number of clock cycles required to access the memory is about
                                                                           bank × 4 per iteration [11,12,13]. Actual lead time and lead
                                                                           time predicted by the developed model are shown in Figure 2.


Fig. 1. The results of experimental studies of modeling access to global
memory using the MAX model.
                                                                           Fig. 2. The results of experimental studies of the access conflicts impact, when
    As can be seen from Figure 1, the value of stride                      using shared memory.
significantly affects the execution time of the algorithm.
Because the amount of operations of the actual calculations in                 As can be seen from Figure 2, there is a linear dependence
the algorithm is very small, then from the graph you can see               of the number of conflicts on the execution time of the
that to increase the performance of the parallel algorithm it is           program. Thus, to increase the performance of parallel
necessary to read the array elements located at a distance from            algorithms, it is necessary to exclude the intersection of the
each other by no less than the value of warp to take advantage             read data for different threads [14]. Also from Figure 2 we can
of the combined access to memory. It can also be seen from                 conclude that the presented model allows us to adequately
Figure 1 that although the graph of the predicted lead time                estimate the execution time of the algorithm in the presence of
does not coincide with the graph of the actual time, it also               memory conflicts, although it does not provide accurate
allows the predicted lead time to obtain enough information                information due to the closed hardware platform.
about the decrease in performance while decreasing the
distance between the read elements. This suggests that the


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                                 13
Data Science

                        VI.     CONCLUSIONS                                  [5]  L.G. Valiant, “A Bridging Model for Parallel Computation,”
                                                                                  Communications of the ACM, vol. 33, no. 8, pp. 103-111, 1990.
    This paper presents an experimental study of previously                  [6] P.B. Gibbons, Y. Matias and V. Ramachandran, “The Queue-Read
developed models for predicting the performance of a                              Queue-Write PRAM Model: Accounting for Contention in Parallel
heterogeneous computer system in telecommunications. The                          Algorithms,” SIAM Journal of Computation, vol. 28, no. 2, pp. 733-
study showed that the developed models allow us to obtain an                      769, 1999.
adequate estimate of the possible time of the algorithm for                  [7] CUDA C Programming Guide [Online]. URL: http://docs.nvidia.com/
                                                                                  cuda/pdf/CUDA_C_Programming_Guide.pdf.
various parameters of the GPU. However, it is worth noting
                                                                             [8] D.R. Helman and J. JaJa, “Designing Practical Efficient Algorithms for
that the estimate obtained using the developed models is not                      Symmetric Multiprocessors,” Lecture Notes in Computer Science,
accurate, because average access time is used for all levels of                   International Workshop, vol. 1619, pp. 37-56, 1999.
the memory hierarchy. In the future, it is planned to finalize               [9] A.A. Kolpakov and Y.A. Kropotov, “Advanced mixing audio streams
the models taking into account the use of the amount of                           for heterogeneous computer systems in telecommunications,” CEUR
memory access time, obeying a rank distribution, for example,                     Workshop Proceedings, vol. 1902, pp. 32-36, 2017.
Pareto or Zipf.                                                              [10] A.A. Kolpakov, “Theoretical estimate of the growth performance of a
                                                                                  computer system using multiple computing devices,” The world of
                        ACKNOWLEDGMENT                                            scientific discoveries, no. 1, pp. 206-209, 2012.
                                                                             [11] Y.A. Kropotov, A.A. Belov and A.Yu. Proscuryakov, “Issues of
    The work was carried out under a grant from the president                     processing experimental time series in an electronic system of
of the Russian Federation for state support of young Russian                      automated control,” Electronics Issues, vol.1, no. 1, pp. 95-101, 2010.
scientists № МК-2378.2020.9.                                                 [12] Y.A. Kropotov, “The algorithm for determining the parameters of the
                                                                                  exponential approximation of the law of the probability distribution of
                              REFERENCES                                          the amplitudes of a speech signal,” Radio engineering, no. 6, pp. 44-
[1]   J.D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruger, A.E.           47, 2007.
      Lefohn and T.J. Purcell, “A survey of general-purpose computation on   [13] Y.A. Kropotov and A.A. Bykov, “The model of the law of the
      graphics hardware,” Computer Graphics Forum, vol. 26, no. 1, pp. 80-        probability distribution of signal amplitudes in the basis of the
      113, 2007.                                                                  exponential functions of the system,” Design and technology of
[2]   A.A. Kolpakov and Y.A. Kropotov, “Development of a model for                electronic tools, no. 2, pp. 30-34, 2007.
      predicting the performance of a heterogeneous computer system in       [14] Y.A. Kropotov, A.Yu. Proscuryakov and A.A. Belov, “A method for
      telecommunications,” Proceedings of ITNT, Samara National                   predicting changes in time series parameters in digital information
      Research University, pp. 2265-2274, 2018.                                   management systems,” Computer Optics, vol. 42, no. 6, pp. 1093-1100,
[3]   S. Fortune and J. Wyllie, “Parallelism in Random Access Machines,”          2018. DOI: 10.18287/2412-6179-2018-42-6-1093-1100.
      Proceedings of 10th Annual ACM Symposium on Theory of
      Computing (STOC), ACM New York, NY, USA, pp. 114-118, 1978.
[4]   Y.A. Kropotov and A.A. Kolpakov, “Experimental study of the model
      for predicting the performance of a heterogeneous computer system in
      telecommunications,” 12th International Scientific and Technical
      Conference Dynamics of Systems, Mechanisms and Machines,
      Dynamics, 2019. DOI: 10.1109/Dynamics.2018.8601478.


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                                14

</pre>