Investigation of the RAM access model in a heterogeneous computing system Aleksander Kolpakov Aleksey Belov Yuriy Kropotov Department of Electronics and Department of Electronics and Department of Electronics and Computer Science Computer Science Computer Science Vladimir State University named after Vladimir State University named after Vladimir State University named after Alexander and Nicholay Stoletovs Alexander and Nicholay Stoletovs Alexander and Nicholay Stoletovs Murom, Russia Murom, Russia Murom, Russia ORCID: 0000-0001-9328-2331 ORCID: 0000-0002-2072-7042 ORCID: 0000-0002-9012-3092 Abstract—The issue of creating high-performance many of which offer significant acceleration compared to computing systems based on heterogeneous computer systems is sequential implementations on the processor. topical, since the volumes of processed information, calculations and studies with large data sets are constantly increasing. The A model of the upper bound for the running time of an aim of the work is an experimental study of previously algorithm on a graphics processor in a CPU-GPU environment developed models for predicting the performance of is presented in [2] and is based on an abstract PRAM model heterogeneous computer systems in telecommunications. As a [3]. The upper estimate of the running time of the algorithm result, the study showed that the developed models allow us to on a graphics processor in a CPU-GPU environment obtain an adequate estimate of the possible time of the algorithm according to this model is calculated according to the formula for various parameters of the GPU with some limitations. below Keywords—graphic processor units, heterogeneous B(N ) N HD ( N ) N DH ( N ) computing systems, parallel calculations  T GPU ( N )    Ti G ( N )   S HD i 1 S DH I. INTRODUCTION One of the most dynamically developing areas in parallel This model does not account for certain features of the programming at the moment is the use of computer systems graphics processor, such as the size of a warp-to GPU memory with a heterogeneous architecture, in which there are access time, etc., however, the GPGPU performance computing devices with different architectures and, prediction model, which is a combination of known parallel accordingly, with different methods of using parallel computing models, was given in [4]. Given the complex computing. The most common approach in the design of such architecture of the GPU, none of these models is complete, systems was the use of graphic video cards or devices based and it takes a combination of them along with a few on them as the main parallel calculator. This growing need for extensions. The following models were used during solving big problems stimulates research and innovation in the development: field of parallel computing in general and in the development of methods for graphic processors in particular. 1) The PRAM model [3]. 2) The BSP model [5]. II. RESEARCH AND DEVELOPMENT OF MODELS OF HETEROGENEOUS COMPUTING SYSTEMS BASED ON GRAPHICS 3) The QRQW model [6]. PROCESSORS The final equation of this model is: Modern graphic processors (GPUs) are parallel processors. More precisely, they are known as stream N B (K )  N w (K )  N t (K ) CT (K ) processors because they are capable of performing various  T (K )    functions in the incoming data stream. They represent NC D R advanced architectures that are designed for parallel processing of data (primarily graphic). They are currently All parameters of this model are shown in table 1. extremely powerful programmable processors, MIMD architecture capabilities with some limitations. TABLE I. THE LIST OF DEVELOPED MODEL PARAMETERS As technology, languages, and hardware evolved, Parameter Description researchers were able to leverage the added flexibility of D The kernel pipeline depth Nc The number of cores per SM GPUs when deploying non-graphical applications to the GPU R GPU Clock (GPGPU), especially when processing images. A more Ct(K) The maximum number of ticks consumed by detailed history of the development of GPGPU is presented in any thread in the kernel K [1]. Nt Number of threads in warp = 32 Nw The number of warps per block A further development momentum was the emergence of NB(K) The number of blocks per kernel CUDA, the NVIDIA C-based development environment for Ki i-th kernel on the GPU GPGPUs. CUDA allows developers unfamiliar with graphical T(K) Time spent by kernel K programming to write code that can be executed on the GPU. T(P) Time spent by program P CUDA provides the necessary abstractions for the developer to write multi-threaded programs with little or no knowledge The performance of the CUDA kernel can vary greatly of the graphics APIs. Since then, many implementations of with slight changes depending on memory access strategies. parallelized applications have been developed for GPUs, Using shared memory can provide up to 20 times better performance than using global memory, and using shared Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) Data Science global memory access can lead to 5 times better performance thread belonging to the first half of the warp and a thread compared to non-coalesced access. Arithmetic operations also belonging to the second half of the same warp. require a different number of cycles to perform, for example, Shared memory also has a broadcast mechanism that operations such as integer summation require 4 cycles, while allows you to read a 32-bit word and broadcast it to multiple calculating an integer module takes 48 cycles [7,8]. Any threads at the same time with one transaction to read from model that does not capture these changes is unlikely to be memory. This reduces the number of conflicts in the bank accurate. when several threads of the halfwarp are read from the address within the same 32-bit word. More precisely, a memory read III. USING GPU MEMORY ACCESS PATTERNS TO IMPROVE request made to several addresses are served in several stages HETEROGENEOUS COMPUTING SYSTEM PERFORMANCE over time – one step every two cycles, serving one conflict- Since the access delay when reading data from the global free subset of these addresses per step, until all addresses are memory of the GPU is up to 200 times higher, than reading served. At each step, a subset is constructed from the data from the registers, providing the most efficient way to remaining addresses that have yet to be served using the access the GPU RAM is crucial to improve the performance following procedure: of the GPU in particular and the heterogeneous system as a whole. Therefore, optimizing global memory access is 1) Select one of the words indicated by the remaining becoming the single most important programming factor for addresses as the translated word. the GPGPU architecture. 2) Include in a subset: a) all addresses that are within the broadcast word; The GPU global memory reads and writes data in half warp threads (16 threads), which are optimized by the device b) one address for each bank indicated in the remaining in just one global memory transaction if certain access addresses. requirements are met. For the GTX 280 video card, the The space of constant memory is cached, so reading from following protocol is used to determine the number of constant memory is delayed as in the case of reading from transactions used by the halfwarp: global memory only if there is no cache, otherwise a 1) Implemented search memory segment, which contains transaction from constant cache is performed. For all threads, the address inquiry from the active thread with the lowest halfwarp reads from the constant cache as fast as reads from number. The segment size is 32 bytes for 8-bit data, 64 bytes registers if all threads read the same address. Access time for 16-bit data, and 128 bytes for 32-, 64- and 128-bit data. scales linearly depending on the number of different addresses read by all threads. a) if the transaction size is 128 bytes and only the upper Texture memory allows to cache data present in global or lower half of the segment is used, the transaction size is memory. If a cached item is requested, then it is served in a reduced to 64 bytes; single request. The lack of a cache results in a global memory b) if the transaction size is 64 bytes and only the lower read operation, which takes much longer. or upper half is used, the transaction size is reduced to 32 To compare the access time to different types of memory, bytes. 1,000,000 read operations from each type of memory were 2) A transaction is in progress, serviced threads are marked performed. Access was performed both sequentially and as inactive. randomly. Tests were performed using the NVIDIA GTX280 3) Repeated until all threads in the halfwarp are serviced. GPU. The results are presented in Table 2. As you can see, When more than one thread requests data from addresses, shared memory provides the best performance, followed by a that fall into the same segment, one transaction can satisfy all constant cache, and then a texture cache. Global memory such threads. This service of multiple requests in a single shows the highest latency. transaction is called coalescing. Thus, it is obvious that certain GPU memory access patterns will necessarily have a positive TABLE II. COMPARISON TABLE OF ACCESS TIMES FOR DIFFERENT effect, while others will gradually increase the delay as the MEMORY TYPES memory addresses requested by the halfwarp threads diverge. Memory type Serial access, ms Random access, ms The shared memory of the GPU is divided into memory Global 969 21777 modules of the same size, called banks, which can be accessed Shared 51 86.6 simultaneously by several threads. Thus, any request to read Constant 35 192 or write to the memory, consisting of n addresses, that fall into Texture 140 247 n separate memory banks, can be carried out simultaneously, IV. EXPERIMENTAL MODELING OF GPU MEMORY ACCESS which gives an effective throughput that is n times higher than PATTERNS the throughput of one module. However, if two memory request addresses fall into the same memory bank, a bank For most parallel computing platforms, modeling memory conflict occurs and access must be serialized. The GPU access patterns and their associated costs is the most complex divides memory access requests with bank conflicts into as and most important part. The following algorithm is used to many individual requests without conflicts as necessary, experimentally verify statements about the process of reducing the effective bandwidth by a factor equal to the accessing memory on the GPU: number of individual memory requests. For GTX280, the size Algorithm 1 Global Memory Access Benchmark of the warp is 32, and the number of banks is 16. Access to shared memory for warp is divided into one request for the Input data: number of elements N, step stride, offset offset, first half of warp and one request for the second half of warp. array A in global memory. As a result, there can be no conflict of the bank between a 1: Calculate the number of elements in the thread, Nthread; VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 12 Data Science 2: Calculate the data range of this thread, using stride and model presented above adequately describes the performance offset; of GPGPU. 3: while index is in the range do V. EXPERIMENTAL VERIFICATION OF THE IMPACT 4: Reading A[index] to variable R; OF ACCESS CONFLICTS WHEN USING SHARED MEMORY 5: Increment R and save back to A[index]; In this experiment, while maintaining the general structure of global memory accesses, as in the previous experiment, 6: index = index + stride; each thread writes an element to the shared memory. The 7: end while shared memory access pattern is controlled by the bank variable, which can be set to a value from 0 to 16. With a larger Using the above algorithm, the experiment was simulated, bank value, we can thus increase the number of access that shows how much gain from sharing depends on the conflicts. number of threads in the warp. This is controlled by the stride variable. Stride denotes the gap between elements, that are Algorithm 2 Shared Memory Access Benchmark accessed sequentially by a single thread. Consequently, Input data: number of elements N, step stride, offset offset, threads in halfwarp can take advantage of coalesced access, if control variable bank, array A in global memory, array B in the stride value is large. For example, when stride = 32, each shared memory. warp thread receives consecutive elements, which ensures complete union. When stride is 1, each thread reads one 1: Calculate the number of elements in the thread, Nthread; element that is offset by 32, so they are not completely merged 2: Calculate the data range of this thread, using stride and and require 16 memory transactions that will be serviced for offset; halfwarp. To ensure a fair comparison, in the above code the number of hits on the stream does not depend on stride. 3: while index is in the range do In the code shown in algorithm 1, the number of 4: for i = 0 to 10000 do calculations per iteration is very small compared to the 5: Reading A[index] and save it back; memory access delay for stride = 1. However, as the stride value increases, memory access and calculations take 6: B [IDthread × bank (mod sizeblock)]; approximately the same number of clock cycles. Using the 7: end for MAX model, we can assume the runtime of this kernel and compare it with the actual one in Figure 1. The program 8: end while execution graph is shown for various stride values. It should be noted that the basic access code only for memory, that is, The kernel in this algorithm has about 16 cycles of with a small number of calculations, differs from the model, calculation per iteration, and there are 64,000 iterations. The due to limited information about the hardware [9,10]. number of clock cycles required to access the memory is about bank × 4 per iteration [11,12,13]. Actual lead time and lead time predicted by the developed model are shown in Figure 2. Fig. 1. The results of experimental studies of modeling access to global memory using the MAX model. Fig. 2. The results of experimental studies of the access conflicts impact, when As can be seen from Figure 1, the value of stride using shared memory. significantly affects the execution time of the algorithm. Because the amount of operations of the actual calculations in As can be seen from Figure 2, there is a linear dependence the algorithm is very small, then from the graph you can see of the number of conflicts on the execution time of the that to increase the performance of the parallel algorithm it is program. Thus, to increase the performance of parallel necessary to read the array elements located at a distance from algorithms, it is necessary to exclude the intersection of the each other by no less than the value of warp to take advantage read data for different threads [14]. Also from Figure 2 we can of the combined access to memory. It can also be seen from conclude that the presented model allows us to adequately Figure 1 that although the graph of the predicted lead time estimate the execution time of the algorithm in the presence of does not coincide with the graph of the actual time, it also memory conflicts, although it does not provide accurate allows the predicted lead time to obtain enough information information due to the closed hardware platform. about the decrease in performance while decreasing the distance between the read elements. This suggests that the VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 13 Data Science VI. CONCLUSIONS [5] L.G. Valiant, “A Bridging Model for Parallel Computation,” Communications of the ACM, vol. 33, no. 8, pp. 103-111, 1990. This paper presents an experimental study of previously [6] P.B. Gibbons, Y. Matias and V. Ramachandran, “The Queue-Read developed models for predicting the performance of a Queue-Write PRAM Model: Accounting for Contention in Parallel heterogeneous computer system in telecommunications. The Algorithms,” SIAM Journal of Computation, vol. 28, no. 2, pp. 733- study showed that the developed models allow us to obtain an 769, 1999. adequate estimate of the possible time of the algorithm for [7] CUDA C Programming Guide [Online]. URL: http://docs.nvidia.com/ cuda/pdf/CUDA_C_Programming_Guide.pdf. various parameters of the GPU. However, it is worth noting [8] D.R. Helman and J. JaJa, “Designing Practical Efficient Algorithms for that the estimate obtained using the developed models is not Symmetric Multiprocessors,” Lecture Notes in Computer Science, accurate, because average access time is used for all levels of International Workshop, vol. 1619, pp. 37-56, 1999. the memory hierarchy. In the future, it is planned to finalize [9] A.A. Kolpakov and Y.A. Kropotov, “Advanced mixing audio streams the models taking into account the use of the amount of for heterogeneous computer systems in telecommunications,” CEUR memory access time, obeying a rank distribution, for example, Workshop Proceedings, vol. 1902, pp. 32-36, 2017. Pareto or Zipf. [10] A.A. Kolpakov, “Theoretical estimate of the growth performance of a computer system using multiple computing devices,” The world of ACKNOWLEDGMENT scientific discoveries, no. 1, pp. 206-209, 2012. [11] Y.A. Kropotov, A.A. Belov and A.Yu. Proscuryakov, “Issues of The work was carried out under a grant from the president processing experimental time series in an electronic system of of the Russian Federation for state support of young Russian automated control,” Electronics Issues, vol.1, no. 1, pp. 95-101, 2010. scientists № МК-2378.2020.9. [12] Y.A. Kropotov, “The algorithm for determining the parameters of the exponential approximation of the law of the probability distribution of REFERENCES the amplitudes of a speech signal,” Radio engineering, no. 6, pp. 44- [1] J.D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruger, A.E. 47, 2007. Lefohn and T.J. Purcell, “A survey of general-purpose computation on [13] Y.A. Kropotov and A.A. Bykov, “The model of the law of the graphics hardware,” Computer Graphics Forum, vol. 26, no. 1, pp. 80- probability distribution of signal amplitudes in the basis of the 113, 2007. exponential functions of the system,” Design and technology of [2] A.A. Kolpakov and Y.A. Kropotov, “Development of a model for electronic tools, no. 2, pp. 30-34, 2007. predicting the performance of a heterogeneous computer system in [14] Y.A. Kropotov, A.Yu. Proscuryakov and A.A. Belov, “A method for telecommunications,” Proceedings of ITNT, Samara National predicting changes in time series parameters in digital information Research University, pp. 2265-2274, 2018. management systems,” Computer Optics, vol. 42, no. 6, pp. 1093-1100, [3] S. Fortune and J. Wyllie, “Parallelism in Random Access Machines,” 2018. DOI: 10.18287/2412-6179-2018-42-6-1093-1100. Proceedings of 10th Annual ACM Symposium on Theory of Computing (STOC), ACM New York, NY, USA, pp. 114-118, 1978. [4] Y.A. Kropotov and A.A. Kolpakov, “Experimental study of the model for predicting the performance of a heterogeneous computer system in telecommunications,” 12th International Scientific and Technical Conference Dynamics of Systems, Mechanisms and Machines, Dynamics, 2019. DOI: 10.1109/Dynamics.2018.8601478. VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 14