Automated Generation of OpenCL Programs Based on Algebra-Algorithmic Approach Anatoliy Doroshenko1, Oleksii Beketov1, Mykola Bondarenko2, Olena Yatsenko1 1 Institute of Software Systems of National Academy of Sciences of Ukraine, Glushkov prosp. 40, 03187 Kyiv, Ukraine doroshenkoanatoliy2@gmail.com, beketov.oleksii@gmail.com, oayat@ukr.net 2 Taras Shevchenko National University of Kyiv, Glushkov prosp. 4d, 03680 Kyiv, Ukraine bondarenko_mykola@yahoo.com.ua Abstract. The paper proposes the further development of algebra-algorithmic design and synthesis tools towards the development of OpenCL programs. The method for semi-automatic parallelization of cyclic operators is proposed. The particular feature of the approach consists in using high-level algebra- algorithmic program specifications (schemes) and rewriting rules technique. The developed tools provide the construction of parallel algorithm schemes by superposition of predefined language constructs of Glushkov’s system of algo- rithmic algebra, which are considered as reusable components. An algorithm scheme is a basis for the generation of corresponding source code in a target programming language. The approach is illustrated with an example of develop- ing an OpenCL interpolation program used in a numerical weather forecasting. The results of the experiment consisting in executing the generated OpenCL program on a graphics processing unit are given. Keywords. Algorithmic algebra, automated algorithm design, CUDA, hetero- geneous platform, OpenCL, parallel computation, software synthesis. 1 Introduction Further progress in improving the quality of parallel software development is associ- ated with using heterogeneous architectures of parallel computing systems. One of the facilities for programming heterogeneous parallel systems is OpenCL (Open Compu- ting Language) [1], a framework for developing parallel software that executes across platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators. Unlike Nvidia CUDA [2], which im- plements GPGPU (General-Purpose computing on GPU) only for Nvidia accelerators and requires the use of a specific compiler, OpenCL is the specification supported by various hardware developers and only requires to set the path to the OpenCL library at compilation. It should be noted that programming heterogeneous platforms is a complex task and therefore there is a necessity of developing the tools for automated software design and parallelization of existing sequential programs for such plat- forms. In the previous works, we had been developing a theory, methodology and tools for automated program design, based on the algebra of algorithmics [3, 4]. The algo- rithmics formalizes the knowledge about subject domains with the help of algebraic facilities and deals with problems of formalization, substantiation of correctness and transformation of algorithms. The formal facilities for development of parallel pro- grams for multicore CPUs [4] and Nvidia GPUs (using CUDA) [3] were developed. They were based on Glushkov’s system of algorithmic algebra (SAA) [3, 4] and term rewriting technique [4]. On the basis of the developed theory and methodology, the integrated toolkit for designing and synthesis of programs (IDS) was developed [3, 4]. In this paper, we propose the further development of our algebra-algorithmic methodology and tools in the direction of formalized and automated design of OpenCL programs. A method and a software tool intended for semi-automatic paral- lelization of cyclic operators are described. The approach is illustrated with an exam- ple of designing a parallel interpolation algorithm, which is the part of the numerical weather forecasting program. The results of the experiment consisting in executing the generated OpenCL program on a GPU are given. 2 Algebra-Algorithmic Software Design Tools The approach to designing parallel programs being proposed is based on Glushkov’s system of algorithmic algebra [3, 4], intended for formal representation of algorithmic knowledge in the form of high-level specifications. SAA is the two-sorted algebra GA   {Pr, Op}; GA  , where Pr and Op are the sets of logic conditions (predi- cates) and operators defined on an information set; GA is the signature consisting of logic operations (disjunction, conjunction, negation) and operator constructs, in par- ticular:  serial execution of operators: “operator 1”; “operator 2” ;  branching: IF ‘condition’ THEN “operator 1” ELSE “operator 2” END IF ;  for loop: FOR (counter FROM start TO fin) “operator ” END OF LOOP ;  asynchronous execution of n operators: PARALLEL(i  0,..., n – 1)(“operator i ”) ;  synchronizer, which delays the computation until the value of the specified condi- tion is true: WAIT ‘condition’ . The algorithms, represented in SAA, are called SAA schemes. Automated design of algorithm schemes and generation of corresponding programs is provided by the developed IDS toolkit [3, 4]. The design process is represented by a tree of an algo- rithm. The user chooses SAA constructs from the list and adds them to an algorithm tree. On each step of the design process, the system allows a user to select only those operations, the insertion of which into a scheme does not break its syntactical correct- ness. The algorithm tree is then used for the automatic generation of SAA scheme text and programming code in one of the target languages (C, C++, Java). The mapping of SAA operators to text in a programming language is defined in a form of code tem- plates in the database of IDS. In [3], additional facilities for designing GPU programs using the CUDA platform were developed. In this work, we introduce new SAA operations intended for high-level design of OpenCL programs and propose the method for parallelization of nested loops in se- quential programs. OpenCL [1] combines the application programming interface (API) and a variant of C language for programming and simultaneous use of various parallel computing devices (for example, CPU, GPU, and Xeon Phi) in a heterogeneous environment. The main steps of developing a common OpenCL program and corresponding new basic operators (enclosed in quotation marks) added to SAA are the following. 1. Obtaining the list of available platforms and saving it to a variable plms: “Get all available platforms (plms)”. 2. Obtaining the list of devices dvs of the platform plm: “Get all devices (dvs) availa- ble on a platform (plm)”. 3. Creating the computing environment for the devices: “Create a context (cnt) for devices (dvs)”. 4. Creating the queue of commands for the device: “Create a command queue (cmdqueue) for a context (cnt) and a device (dv)”. 5. Designing a kernel function — a task to execute on the devices. The example of an SAA scheme for a kernel is given in Sect. 3. 6. Compilation of a file containing an OpenCL kernel: “Create a kernel (krnl) from a source (file)”, where krnl is the name of a kernel function, file is the path to the file with its source code. 7. Creating the buffer for data specified in a variable var: “Create a memory buffer (buff) for data (var) on devices in the context (cnt)”. 8. Copying a buffer for a variable var to device memory: “Add commands to queue (cmdqueue) writing a buffer of data (var) from host to device”. 9. Setting the value arg_value of the parameter with the index arg_index for a kernel: “Set the argument value (arg_value) for a parameter (arg_index) of a kernel (krnl)”. 10. Adding a kernel to a command queue and its asynchronous execution: “Add a command to queue (cmdqueue) executing a kernel (krnl) (globalworksize) (localworksize) on a device”, where globalworksize is the global number of dimen- sions [1] for executing the kernel; localworksize is the number of dimensions of a local subset of kernel instances (work-items) in a group. 11. Reading the data from a result buffer: “Add commands to queue (cmdqueue) read- ing from a buffer of data (var) from device to host”. The use of the above constructs is illustrated in Sect. 3. In most computing problems, a large part of hardware resources are utilized by computations inside loops, therefore the use of automatic parallelization of the cyclic operators is most efficient for them. Further, we describe the developed method and software framework named LoopRipper intended for semi-automatic parallelization of programs containing loop operators for heterogeneous platforms. The framework is based on the combined use of the rewriting rules system TermWare [4] and IDS toolkit. LoopRipper parallelizes the input compound loop of the following form: “SEQUENTIAL LOOP” ==== FOR (i0 FROM 0 TO #I 0 ) FOR (i1 FROM 0 TO #I1 ) ... (1) FOR (in FROM 0 TO #I N ) “ F ( i , p in ( i ), p out ( i ))” END OF LOOP, where I 0 , I1 , ..., I N are the sets values of indices i0 , i1 , ..., iN ; N + 1 is the loop nest- ing depth; i  {i j j  0,..., N } ; p  (i )  P   { p j (i ) j  0... # P },   {in , out} are the sets of input and output data variables; F is the function processing the data; the iterations of the loop are independent. Let T be the number of threads which will be used in the execution of a kernel function. The loop obtained as a result of the parallelization is the following: “PARALLEL LOOP” ==== FOR (e FROM 0 TO L) “ fill (inBuf )”; “ push(inBuf )”; “kernel (e, inBuf , outBuf )”; “ pull (outBuf )”; “unpack (outBuf )” END OF LOOP, where L is the number of kernel calls which is chosen the least possible so that L  T  G , G is the total number of initial loop iterations; fill is the function filling the buffer of initial data inBuf ; push is moving the initial data buffer to the device memory; kernel is the call of a kernel function; the kernel function contains the call of the initial loop body “F ( i ,inbufid ( i ), outbufid ( i ))”, where id is the thread number; pull is the function moving the processed data from the device memory to the buffer of processed data outBuf ; unpack is copying data from the buffer of processed data to corresponding variables. The user of the LoopRipper framework (see Fig. 1) provides a source code or an SAA scheme of the sequential program, specifies the loop to be parallelized and also provides the list of input and output variables used in the loop. With the help of the TermWare system, the framework generates kernel, fill and unpack functions and additional data structures (buffers). IDS toolkit replaces the sequential loop in the sequential program with corresponding kernel call and synthesizes the whole parallel program from the above-mentioned functions and data structures. Fig. 1. Parallelization of a program with the LoopRipper framework. 3 Case Study In this section, we illustrate the use of the considered algebra-algorithmic tools with an example of developing an OpenCL interpolation program used in numerical weather forecasting [5]. The program performs the interpolation of meteorological values defined on a macroscale grid to a mesoscale grid. The initial sequential interpolation scheme consists of four nested loops of the form (1) with iterators h  [0 ... Pk ] , k  [0 ... Lmz ] , j  [0 ... Mmz ] , i  [0 ... Nmz ] , where Pk , Lmz , Mmz , Nmz are input integer parameters defining the number of points in finite-difference grid along altitude, longitude and latitude. The iterations of the loops are independent and can be executed in parallel. The SAA scheme of the OpenCL kernel obtained as a result of the transformation of the sequential interpola- tion algorithm with the help of the LoopRipper framework is given below. The com- putation is parallelized by indices h , k , j ; the main loop of interpolation iterates through the index i . In the scheme, US , VS , TS , HS , QS are input arrays with meteorological values (wind velocity, temperature, humidity, etc.); Qc is the output array. The index for processing array elements is computed and stored in the varia- ble ind . “interpolation_kernel(US, VS, HS, QS, TS, F_x, Zmz, Qc, Pk, Lmz, Mmz, Nmz)” ==== (h := “Get the global work-item identifier for dimension (0)”); (k := “Get the global work-item identifier for dimension (1)”); (j := “Get the global work-item identifier for dimension (2)”); IF NOT((h >= Pk) OR (k >= Lmz) OR (j >= Mmz)) THEN FOR (i FROM 0 TO Nmz) LOOP “(ind) := (h + Pk * k + Pk * Lmz * j + Pk * Lmz * Mmz * i)”; “(a) := ((WZZ + US[ind] / 0.321) * Rs * VS[ind])”; “(Tp) := (TS[ind] * pow(1000.0 / HS[ind], 2.0 / 7.0))”; “(Tv) := (Tp * (1.0 + 0.6078 * QS[ind]))”; “(Qc[ind]) := (a – (0.5*Tv + (1.0 – Zmz[k]) * g * F_x[j + I*Mmz]/0.321))” END OF LOOP END IF The SAA scheme of the host code of the interpolation program uses the operators described in Sect. 2 and is the following: “interpolation_host” ==== “Read input parameters (Pk, Lmz, Mmz, Nmz) from the command line”; “Get all available platforms (plms)”; “Get all devices (dvs) available on a platform (plms[0])”; “Create a context (cnt) for devices (dvs)”; “Create a command queue (cmdqueue) for a context (cnt) and a device (dvs[0])”; “Create a kernel (“interpolation”) from a source (“kernels/interpolation.cl”)”; “Declare and fill continuous arrays for data (US,VS,HS,QS,TS,Qc,F_X,Zmz)”; “Create memory buffers for data (US, VS, HS, QS, TS, Qc, F_X, Zmz) on devices in the context (cnt)”; “Add commands to queue (cmdqueue) writing buffers of data (US, VS, HS, QS, TS, F_X, Zmz) from host to device”; “Set the argument values for parameters of a kernel (krnl)”; “Add a command to queue (cmdqueue) executing a kernel (krnl) (3) ({Pk, Lmz, Mmz}) on a device”; WAIT ‘All commands in (cmdqueue) were issued to device and completed’; “Add commands to queue (cmdqueue) reading from a buffer of data (Qc) from device to host”; IDS toolkit automatically translated the above SAA schemes into OpenCL code. The obtained program was executed in two computing environments. The first envi- ronment consisted of Intel Core i5-4210U CPU (2 cores) and Nvidia GeForce 840M GPU (384 cores), the second one contained Intel Core i3-3110M CPU (2 cores) with integrated Intel HD Graphics 4000 GPU (128 cores). Fig. 2(a) shows the multiproces- sor speedup Sp  Ts / Tp obtained in the first environment, where Ts is the execution time of the sequential program on CPU, T p is the execution time of the parallel pro- gram (OpenCL and its previous version implemented in CUDA [6]) on GPU. The previous version used multi-dimensional arrays for storing meteorological data in- stead of one-dimensional. As can be seen, the OpenCL program significantly outper- forms the CUDA version, which is explained by the use of array linearization. Fig. 2(b) shows the dependency of CPU and GPU execution time on the percentage of data passed to CPU in the second environment. As can be seen from the diagram, the execution of all computations on CPU is approximately 7 times longer than execution on GPU only. However, if computations are performed simultaneously on CPU and GPU, the optimization by about 7–10% is achieved. (a) (b) Fig. 2. The results of executing the parallel interpolation program: the dependency of the multi- processor speedup on the size of the input data (a) and the dependency of execution time on the percentage of data passed to CPU (b). 4 Related Work The proposed approach is related to works on the synthesis of programs from specifi- cations [7] and automated generation of OpenCL programs [8–12]. In particular, pa- per [8] presents an approach and a tool named Gaspard2 based on model-driven engi- neering to specify, design, and generate OpenCL applications. In [9], a programming tool called STEPOCL is proposed along with a new domain-specific language de- signed to simplify the development of an application for multiple accelerators. Pa- per [10] presents an automatic generator based on a C++ domain-specific embedded language for the generation of OpenCL kernels defined by high-level specifications provided by the user. In [11], an approach to the generation of OpenCL code based on high-level functional expressions and rewriting rules is proposed. Par4All [12] is an automatic parallelizing and optimizing compiler for C and Fortran which generates OpenMP, CUDA and OpenCL source codes. The main difference of our approach from the mentioned related works is that it uses algebraic specifications, based on Glushkov algebra of algorithms. Specifications are represented in a natural linguistic form simplifying understanding of algorithms and facilitating achievement of demanded software quality. Another advantage of our tools is the method of automated design of syntactically correct algorithm specifica- tions [3, 4], which eliminates syntax errors during construction of algorithm schemes. Unlike the above-mentioned Par4All, our parallelization framework LoopRipper al- lows processing the amounts of data which exceed the GPU memory size and using heterogeneous computing clusters. 5 Conclusion This paper proposes an approach and tools for automated designing and generation of OpenCL programs based on algorithmic algebra. The method for semi-automatic parallelization of cyclic operators is developed. The particular feature of the approach consists in using high-level algebra-algorithmic program specifications, which are represented in a natural linguistic form. The developed tools provide the construction of algorithm schemes by superposition of predefined language constructs of Glushkov’s algebra, which are considered as reusable components. The developed tools automatically translate the specifications into source code in a programming language. The approach is illustrated on developing the OpenCL interpolation pro- gram used in a numerical weather forecasting. References 1. OpenCL Overview. The open standard for parallel programming of heterogeneous sys- tems, https://www.khronos.org/opencl, last accessed 2019/02/15. 2. Nvidia CUDA technology, http://www.nvidia.com/cuda, last accessed 2019/02/15. 3. Andon, P.I., Doroshenko, A.Yu., Zhereb, K.A., Yatsenko, O.A.: Algebra-algorithmic models and methods of parallel programming. Akademperiodyka, Kyiv (2018) 4. Doroshenko, A., Zhereb, K., Yatsenko, O.: Developing and optimizing parallel programs with algebra-algorithmic and term rewriting tools. In: Ermolayev, V., Mayr, H.C., Ni- kitchenko, M., Spivakovsky, A., Zholtkevych, G. (eds.) ICTERI 2013. CCIS, vol. 412, pp. 70–92. Springer, Cham (2013). https://doi.org/10.1007/978-3-319-03998-5_5 5. Prusov, V., Doroshenko, A.: Computational techniques for modeling atmospheric process- es. IGI Global, Hershey (2018). https://doi.org/10.4018/978-1-5225-2636-0 6. Doroshenko, A.Yu., Yatsenko, O.A., Beketov, O.G.: Algorithm for automatic loop paral- lelization for graphics processing units. Problems in programming, (4), 28–36 (2017) (in Ukrainian) 7. Gulwani, S.: Dimensions in program synthesis. In: 12th international ACM SIGPLAN symposium on Principles and practice of declarative programming, pp. 13–24. ACM, New York (2010) 8. Rodrigues, A., Guyomarc’h, F., Dekeyser, J.L.: An MDE approach for automatic code generation from UML/MARTE to OpenCL. Computing in Science and Engineering, 15(1), 46–55 (2012) 9. Li, P., Brunet, E., Trahay, F., Parrot, C., Thomas, G., Namyst, R.: Automatic OpenCL code generation for multi-device heterogeneous architectures. In: 44th International Con- ference on Parallel Processing (ICPP 2015), pp. 959–968. IEEE, Piscataway, NJ (2015) 10. Tillet, P., Rupp, K., Selberherr, S.: An automatic OpenCL compute kernel generator for basic linear algebra operations. In: 2012 Symposium on High Performance Computing (HPC’12), pp. 4:1–4:2. Society for Computer Simulation International, San Diego, CA (2012) 11. Steuwer, M., Fensch, C., Lindley, S., Dubach, C.: Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code. In: 20th ACM SIGPLAN International Conference on Functional Programming (ICFP’15), ACM SIGPLAN Notices, vol. 50, pp. 205–217. ACM, New York (2015) 12. PIPS: Automatic Parallelizer and Code Transformation Framework, http://pips4u.org, last accessed 2019/02/15.