Software for heterogeneous computer systems and structures of data
                 processing systems with increased performance
                                                  A.A. Kolpakov1, Ju.A. Kropotov1
                               1
                                Murom institute (branch) VlSU, Orlovskaya st., 23, 602264, Murom, Vladimir region, Russia


Abstract

The issue of creating high-performance computing systems based on heterogeneous computer systems is topical, as the volumes of processed
information, calculations and studies with large data sets are constantly increasing. The aim is to develop software design techniques
heterogeneous computer data processing system. As a result, a technique has been developed for combining shaders to improve the
performance of heterogeneous computations. The presented solution is supposed to be implemented as a software module for a computer
system using CUDA technology.

Keywords: parallel computing; heterogeneous computing systems; graphics processors; CUDA; OpenCL


1. Introduction

   The GPGPU program can be conditionally represented using the following sets:
       − set of shaders that perform calculations;
       − set of variables that control the computations;
       − the set of data over which the calculations are performed and in which their results are recorded;
       − a set of instructions that run a particular shader, provide him with input for certain data and output the result to a
          certain texture.

1.1 GPU programming based on vertex and pixel programs

    The elementary primitive for visualization, with which the graphics processor works, is a triangle. With each vertex of a
triangle, it is possible to associate a limited set of arbitrary data, for example, its color, normal, and other user data, Up to 8
textures can be associated with the primitive itself – one-, two-, and three-dimensional images. The structural scheme of data
processing based on vertex and pixel programs is shown in fig. 1.
                       User
                     software
                                     3D scene control
                                       commands
                    OpenGL or                                                      Interpolated data
                                                                                                                Pixel              Pixel color
                     DirectX
                                                                                                              program
              3D scene                Set of vertex
               control                                                   Assembled
                                         indices                                                                 Pixel
             commands                                                     primitive
                                                                                                              Coordinates

                GPU command                             Assembling                            Data
                                                                                                                            Rasterization
                      buffer                             primitives                       Interpolation

                                                                                                               Mixing
                   Set                                                                                        operation
                                        Vertex                Transformed set of                                               Screen
                of vertex
                                                                   vertices                                                     plane
                                       program

                                   Fig. 1. The structural scheme of data processing based on vertex and pixel programs.

   As shown in fig. 1, the user application sends requests for visualization of a 3D-scene to the OpenGL or DirectX low-level
programming libraries, which are parts of the operating system. Then this data is transformed with the graphics card driver into
direct commands of the graphics processor [1,2,3].

1.2 GPU programming based on CUDA library

   Despite the fact that programming vertex and pixel programs has proven effective for modeling various physical processes, the
use of this method is not convenient for the programmer. The programmer needs to have a sufficiently high qualification to
perform computational tasks on the graphics processor, i.e. to use the principles of the GPU in detail, because a small inaccuracy


3rd International conference “Information Technology and Nanotechnology 2017”                                                                    25
                                             High-Performance Computing / A.A. Kolpakov, Ju.A. Kropotov
in the control code of the graphics processor can lead to a significant distortion of the result of the calculations. The structural
scheme of the software on the basis of CUDA library is shown in fig. 2.
                                                                        Software


                                   CUDA
                                  library


                                   CUDA Runtime


                                                CUDA driver


                                                                  Graphic processor

                                         Fig.2. The structural scheme of the software on the basis of CUDA library.

      The executable unit of the CUDA program is warp. The size of warp is 32 threads. This is due to the fact that latency is 4
cycles when executing one instruction on the multiprocessor. Only with respect to warp we can talk about the parallel execution of
flows, no other assumptions can be made. However, this does not mean that warps are executed sequentially on the
multiprocessor. The execution of warps can be parallel, for example, in the event that one warp expects data from global memory,
other warps can be executed at this time.
      Interaction between threads can be carried out only within the block. Data is exchanged via a shared memory which is
common to all streams in the block. The execution of threads can synchronize by calling special synchronization functions.

1.3 GPU programming based on OpenCL library

   The development of the GPU and its use in tasks unrelated to computer graphics has resulted in the development of a single
standard for describing computations on highly parallel systems – OpenCL (Open Computational Library). The generated
OpenCL library appeared on the basis of the previously developed Nvidia CUDA technology, which describes the interface of
application interaction with the computing resources of the graphics processor. Unlike CUDA technology, OpenCL technology
describes a computation model without connection to a specific type of device on which these calculations will be executed. Due
to the fact that OpenCL is designed exactly as a standard for computations on highly parallel systems, many specific features of
CUDA technology have been excluded from the standard. In general, CUDA technology has more possibilities, in comparison
with OpenCL for describing parallel computing, if the Nvidia graphics processor is the computing device.
   OpenCL allows to describe the calculations, abstracting from a particular device, on which these calculations will be
implemented. In general, OpenCL algorithms can be executed on several CPU cores, on a graphics processor or on IBM Cell /
B.E processors. OpenCL implementation uses extensions of the C language to describe the algorithm [3,4].
   The OpenCL library is a promising library for use in various scientific research. The advantage of the OpenCL library is
support from high-performance clusters. Any application that uses OpenCL can be run without modifications on a cluster that
contains, among other things, graphics processors. This application will be available to all existing computing resources in the
system. The structural scheme of the software on the basis of the OpenCL library is shown in fig. 3.

                                  Central processor unit
                                                                                               Graphic processor
                                     (host-program)


                                                       The runtime (context)


                           Source code of                  Compiled OpenCL-                   Memory              Execution
                          OpenCL-programs                  programs (kernels)                  buffer              Queue


                                     Fig. 3. The structural scheme of the software on the basis of the OpenCL library.


3rd International conference “Information Technology and Nanotechnology 2017”                                                  26
                                             High-Performance Computing / A.A. Kolpakov, Ju.A. Kropotov
   The central element of the OpenCL platform model is the concept of a host, the primary device that manages OpenCL
calculations and performs all interactions with the user. The host is always represented in a single instance, while the OpenCL-
devices on which the OpenCL-instructions are executed can be represented in the plural. OpenCL-device can be CPU, GPU,
DSP or any other processor in the system, supported by the OpenCL-drivers installed in the system.

2. Software of a heterogeneous computer data processing system

   The block diagram of the developed software of a heterogeneous computer data processing system is shown in fig. 4.
                                                                        Source code


                                            Memory of              Module for constructing
                                                                                                           Variable
                                            constants               and processing a text
                                                                                                           memory
                                                                            tree


                                            The scheme of           Module for constructing
                                             the program              and processing a                 DB of text trees
                                                 tree                   program tree


                                                                      The decision module
                                                                                                      Code processing
                                                                       on the transfer of
                                                                                                         module
                                                                          calculations


                                            Optimized                                                    Finite state
                                                                       Execution module                  automaton
                                         executable code


                         Fig. 4. The block diagram of the developed software of a heterogeneous computer data processing system.

   According to fig. 4 the input source for the software is the source code of the program being processed. It enters the module
for building and processing a text tree, in which the initial processing of the source code occurs and the construction of a text
tree on its basis. Memory constants and variable memory are used to work with a text tree.
   The received text tree is transferred for processing to the module for constructing and processing the program tree. The
module for building a software tree uses a database of previously processed text trees, which allows to significantly accelerate
the construction of a program tree [5,6].
   For the processing of the program it is necessary to build a tree that represents a text description in a convenient format. There
can be several types of nodes in the program tree:
         root node – contains all the data that must be computed during the execution of the program as descendants;
         node with data – represents an array of data that is required to transmit the input ancestor or into which to write the
             result descendant;
         node with variable – represents a variable that must be passed to the input of the ancestor shader or that describes the
             condition of the conditional parent operator;
         node with no operations – transfers the data of the descendant to the ancestor without changing them;
         node with shader – performs a shader with input data received from the children and passes the result to the
             ancestor;
         node with arithmetic operation – performs arithmetic transformation of input data received from descendants and
             passes the result to the ancestor;
         node with branching start – describes a conditional statement; the first child is a variable that describes which branch
             to choose, followed by the various branches of execution;
         node with branching completed – describes the completion of the conditional statement, all its branches are reduced
             to this node.
   Trees are alternately built for each instruction. The parser parses the string representation, defines the output array with the
data, optionally expands the abstractions and defines the set of functions that must be performed to obtain the result.
   In each subsequent tree, the node with the data that was recorded in one of their previous instructions is replaced by the tree
of the last instruction. The program tree is obtained after processing the last instruction. It should be noted that the software tree
may not be a tree by definition - some nodes may have more than one ancestor, but most often it represents a tree or a structure
close to it.
   The software tree is transferred to the module of decision about the transfer of calculations, where the original algorithm is
divided into stages and a decision is made to transfer the calculations to the GPU for each stage with further transfer to the code
processing module.


3rd International conference “Information Technology and Nanotechnology 2017”                                                      27
                                               High-Performance Computing / A.A. Kolpakov, Ju.A. Kropotov
3. Development of the code processing module

    The code processing module makes the necessary changes to the source code program to produce a finite automaton of the
program transmitted in the execution unit, generating executable code treated.
    The level of the target executor for the family of accelerators takes on the input a graph of functional operations and on the
output receives a program for the GPU [7,8]. This program has the form of a performance script, in which operations are
available for starting the shader on the GPU with a certain set of parameters, allocating and freeing memory, transferring data
from RAM to video RAM and back. The basic stages of broadcasting the program at the level of the target performer:
    1. Splitting the original graph into subgraphs corresponding to different passes. Subsequently each subgraph is broadcast
separately.
    2. Select the display for the data and for the code. In this case, for example, one array of logical data can be mapped to several
physical GPU buffers, and the restore operation in the source code is performed in a loop, with each iteration processing a block
of 4 elements.
    3. Code generation for the GPU. After the mappings are selected, they are generated by the code on the GPU. In particular,
the indexes for the GPU buffers are calculated and the repetitive reads are eliminated.
    4. Conversion of the generated code. At this stage, the arithmetic expressions are merged into vector commands and the
constants are convoluted.
    Since the individual expressions calculated on the GPU are relatively small, and the passage requires a large computing
power of the shader, the first stage is relatively simple. Most often, the result of his work is the only subgraph that coincides with
the original graph. The main criterion for partitioning into passages is the reuse of data. Splitting is only possible if you can not
avoid multiple calculations of the same data without it.
    The most difficult from the point of view of implementation are the stages 2 and 3. Step 4 is relatively simple and is
performed using arithmetic transformations. The possibility of merging arithmetic expressions into vector operations is provided
by an efficient choice of mappings in stages 2 and 3.
    The functional graph C$ defines the following types of vertices [9]:
    Sheet tops:
    1. Constants.
    2. Related variables. In fact, are similar to the cycle. They run through a certain range and can be used to calculate the index
expressions.
    3. Functions. They can be member functions (including standard operations) or arrays. If the function is found expression, it
is applied to its arguments, so in the end it will not be presenting.
    Internal vertices:
    5. Applying a function to arguments (@). This can be an application of a normal function or a reference to an array element.
    6. Reduction operation (R). It has 2 arguments, the first of which specifies a reducing function, and the second one is
reducible. Reduction is applied to all dimensions of an array that does not contain associated variables, as well as related
variables, which allows partial reduction.
    Associated variables "spread" from the bottom up the graph. They can end their distribution only at the vertices of reduction
or at the root apex. We also assume that an acyclic graph is arranged in such a way that all paths of propagation of a bound
variable terminate on the same vertex.
    Below is an example of combining two shaders, in which you can see the non-optimal parts of the code. The block diagram of
the first shader is shown in fig. 5.
                                                                    vec4 inp1 = texture2DRect
                                       Begin                     (input_tex, gl_TexCoord[0].st);


                                                                 vec4 inp2 = texture2DRect (input_tex,
                                                                  gl_TexCoord[0].st + vec2(0.0, 1.0));


                                        End                             gl_FragColor = inp1 * inp2;


                                                         Fig. 5. Block diagram of the first shader.

    A block diagram of the second shader is shown in fig. 6.
    The block diagram of the combined shader, obtained after merging the sequence of these shaders, is shown in fig. 7.
    To perform the shader more efficiently, it is necessary to perform its correction. A block diagram of the final adjusted shader
is shown in fig. 8.
    After carrying out all the above-described changes, the shader code will be compiled and will be ready for use in calculations.


3rd International conference “Information Technology and Nanotechnology 2017”                                                    28
                                               High-Performance Computing / A.A. Kolpakov, Ju.A. Kropotov
                                                                             vec4 first =
                                         Begin                  texture2DRect(shader1_tex, gl_TexCoorc
                                                                               [0].st);


                                                                           vec4 inp = first +
                                                                  texture2DRect(input_tex, gl_TexCoorc
                                                                                [0].st);


                                         End                            gl_FragColor = inp * coeff;


                                                        Fig. 6. Block diagram of the second shader.


                                                                   vec4 inp1 = texture2DRect (input_tex,
                                         Begin                              gl_TexCoord[0].st);


                                                                  vec4 inp2 = texture2DRect (input_tex,
                                                                   gl_TexCoord[0].st + vec2(0.0, 1.0));


                                                                           vec4 result = inp1 * inp2;


                                                                                vec4 first = result;


                                                              vec4 inp = first + texture2DRect(input_tex,
                                                                          gl_TexCoord[0].st);


                                        End                               gl_FragColor = inp * coeff;


                                                       Fig. 7. Block diagram of the combined shader.


                                                              vec4 inp1 = texture2DRect (input_tex,
                                      Begin                            gl_TexCoord[0].st);


                                                              vec4 inp2 = texture2DRect (input_tex,
                                                               gl_TexCoord[0].st + vec2(0.0, 1.0));


                                                                      vec4 result = inp1 * inp2;


                                                                       vec4 inp = result + inp1;


                                       End                            gl_FragColor = inp * coeff;


                                                       Fig. 8. Block diagram of the resulting shader.

4. Conclusion

   The program provides the schemas of the processed program tree and the generated automaton and the received shaders as
output. The scheme of the processed program tree is shown in Fig. 9.
   As can be seen from fig. 9, the software tree consists of nodes of various types, each of which describes the behavior of the
program: the root node; node with data; node with variable; node with no operations; node with a shader; node with arithmetic
operation; node with the beginning of branching; node with the completion of branching.
   The code for the graphics processor is generated in accordance with the selected display. Each key vertex is associated with a
set of output variables, for the calculation of which it responds. The root node, in addition to their calculation, is responsible for
writing them to the output buffers. For each sub-graph of operations, a code template is generated. Based on this template, a
code is generated for each set of values of the block measurement instances [10].

3rd International conference “Information Technology and Nanotechnology 2017”                                                    29
                                                     High-Performance Computing / A.A. Kolpakov, Ju.A. Kropotov

                                                                                            DATA: bg

                                  DATA: bg                 SHADER: ad
                0
                                                  0                                   1   VAR: advanceValue

              ROOT                                                                                                              IF
                                                                                  0

                                                                                             DATA: in            0          1           2
                                                                                      1
                                 DATA: out                 SHADER: tr
               1
                                                 0                                                             bleBW   SHADER:               NOP
                                                                                          trackTreshold                  bw
                                                                                  0
                     DATA: in

                                             SHADER: er                                        VAR: enableER      0
                                                                  0
                                                                                                                                     ENDIF


                                                                            ADD                     DIV2
                                                                                                                1
                                                                                                                       IF
                         ENDIF

                                                                        1                                              2
                                              SHADER:                                                          NOP
                                                er2


                                                                        Fig. 9. Processed software tree.

   The most difficult part of the code generation phase is the generation of access commands in RAM. By mapping, the number
of buffer numbers that can participate in this data reading operation is determined. Further, a set of two-dimensional indexes for
reading from this buffer is determined for different values of block indexes for each of the buffer numbers. There is the vertex
closest to the root, in which the index can be calculated, for each index, based on a set of variables on which it really depends.
Improves the calculation of indices, which are linear by iteration of the cycle. In each key vertex, a set of indices is marked,
which it needs to compute, in order to avoid the recalculation of the indices. The data reading operation is performed in the same
key node to which the index belongs. This allows you to avoid re-reading the same data from the RAM, even if they are made by
different reading vertices.
   To work with the indices used automatic conversion tools of integral and simplifying expressions. They support the
conversion of expressions containing arithmetic operations, minimum, maximum, and comparison operations.
   The program allows flexible manipulation of options.
   In the course of investigating the effectiveness of the developed information processing algorithms, the problem of finding
zero bit vectors was solved, which is solved using genetic algorithms. In solving this problem, the main working time is
occupied by parallel computations of the values of the fitness functions of different individuals, the operations of crossing and
mutation. The algorithm has properties that are characteristic of many genetic algorithms:
   1. Representation of an individual in the form of a bit string.
   2. A small number of logical operations in calculating the function of fitness, performing mutation and crossing.
   3. Consecutive memory access.
   These properties allow to effectively use the calculations on the graphics processor.
   The mutation operation is standard for such individuals - changing the value of one random bit. A single-point crossover is
used as a cross operation.
   The function of fitness of an individual is the number of units in it. Accordingly, it is necessary to deduce an ideal individual
with a zero fitness function. There is an algorithm that allows you to calculate the number of units in a 32-bit number using only
arithmetic operations.
   To test the efficiency of the algorithm for improving data processing efficiency, we used a test computer system of the
following configuration: CPU Intel Core 2 Quad Q9400 (2.66GHz), 8GB RAM, graphics card Nvidia GeForce GTX560 2Gb
with 336 streams, Windows 7 x64 operating system, MS compiler Visual Studio 2008 in release mode.
   The average time spent on receiving a new generation for different sizes of the problem and the number of individuals in the
generation was investigated. For this, several iterations of obtaining the next generation (about 100-1000 starts) were started and
the total time spent on the whole algorithm was divided by the number of generations obtained.
   Performance study In the first challenge test varied number of 32-bit integers in the array (M) and the number of parallel
streams (N).
   We investigated the average time t spent on obtaining a new generation for a different number of 32-bit integers in an array
and the number of parallel threads. The research was conducted using OpenCL and NVIDIA CUDA technologies [4]. The
results of the study of the average time spent on obtaining a new generation for the parameter M = 10 are shown in Fig. 10.
   In fig. 10 graphics OpenCL and CUDA show the execution time of the base algorithm, OpenCL-O and CUDA-O - using the
developed algorithm for improving data processing performance. As can be seen from the graphs, with the value of the
parameter M = 10, the application of the developed optimization algorithm gives an increase in performance: in the case of
OpenCL, the processing time for 128 threads is reduced from 1.08 ms to 0.75 ms and from 219 ms to 84.2 ms for 102400 Flows.
In the case of NVIDIA CUDA, the processing time is reduced from 0.85 ms to 0.74 ms for 128 threads and from 189 ms to 48.4
ms for 102400 streams [4].
3rd International conference “Information Technology and Nanotechnology 2017”                                                                      30
                                              High-Performance Computing / A.A. Kolpakov, Ju.A. Kropotov
                        240
                                                                                       200
                       t, мс
                        220                                                           t, мс
                                                                                       180
                        200
                                                                      OpenCL
                                                                                       160
                        180                                                                                                    CUDA

                        160                                                            140


                        140                                                            120

                        120
                                                                                       100

                        100
                                                                                        80
                         80
                                                                     OpenCL - 0         60
                                                                                                                        CUDA - 0
                         60
                                                                                        40
                         40

                         20                                                             20


                          0                                                              0
                                128          1024        10240       102400       N           128   1024       10240         102400   N

                                                     а                                              b
Fig. 10. The average time spent by a parallel system to receive a new generation, whith M =10, а: OpenCL – basic algorithm, OpenCL-О – developed algorithm,
                                                  b: CUDA – basic algorithm, CUDA-О – developed algorithm.

References

[1] Bahvalov NS, Voevodin VV. Modern Problems of Computational Mathematics and Mathematical Modelling. Computational Mathematics Book. Vol. 1.
     Moscow, Science, 2005; 342 p. (in Russian)
[2] Boreskov AV. Shaders development and debugging. Sankt-Peterburg: BHV-Peterburg, 2006; 488 p. (in Russian)
[3] Kolpakov AA. Theoretical evaluation of growth performance computing systems from the use of multiple computing devices. V mire nauchnykh otkrytii
     2012; 1: 206–209. (in Russian)
[4] Kolpakov AA. Optimizing the use of genetic algorithms for computing graphics processors for the problem of zero bit vector. Informatsionnye sistemy i
     tekhnologii 2013; 2(76): 22–28. (in Russian)
[5] Barkalov KA, Gergel VP. Parallel global optimization on GPU. Journal of Global Optimization 2016; 66(1): 3–20.
[6] Strongin RG, Gergel VP. Parallel computing for globally optimal decision making on cluster systems. Future Generation Computer Systems 2005; 21: 673–
     678.
[7] Galimov MR, Biryalcev EV. Some technological aspects of GPGPU applications in applied program systems. Vychislitelnye metody i programmirovanie
     2010; 11: 77–93. (in Russian)
[8] Kropotov JuA, Belov AA, Proskuryakov AJu, Kolpakov AA. Methods of Designing Telecommunication Information and Control Audio Exchange Systems
     in Difficult Noise Conditions. Sistemy upravleniia, sviazi i bezopasnosti 2015; 2: 165–183. (in Russian)
[9] Boreskov AV. OpenGL extension. Sankt-Peterburg: BHV-Peterburg, 2005; 672 p. (in Russian)
[10] Ermolaev VA, Kropotov JuA. Algorithms for processing acoustic signals in telecommunication systems by local parametric methods of analysis.
     Proceedings International Siberian Conference on Control and Communications (SIBCON), 2015. URL: http://ieeexplore.ieee.org/document/7147109/.


3rd International conference “Information Technology and Nanotechnology 2017”                                                                         31