=Paper=
{{Paper
|id=Vol-2543/rpaper18
|storemode=property
|title=Dynamic Data-dependence Analysis in SAPFOR
|pdfUrl=https://ceur-ws.org/Vol-2543/rpaper18.pdf
|volume=Vol-2543
|authors=Nikita Kataev,Alexander Smirnov,Andrey Zhukov
|dblpUrl=https://dblp.org/rec/conf/ssi/KataevSZ19
}}
==Dynamic Data-dependence Analysis in SAPFOR==
Dynamic Data-dependence Analysis in SAPFOR Nikita Kataev1[0000-0002-7603-4026], Alexander Smirnov1[0000-0002-2971-4248] Andrey Zhukov2[0000-0001-9018-2941] 1 Keldysh Institute of Applied Mathematic RAS, Miusskaya sq., 4, 125047, Moscow, Russia 2 Lomonosov Moscow State University, GSP-1, Leninskie Gory, 11999, Moscow, Russia kaniandr@gmail.com Abstract. The possibilities of static program analysis are often insufficient to investigate real-world applications. Complicated control-flow graph and memory access patterns lead to a conservative assumption of loop-carried data dependencies. To make decisions about the possibility to parallelize loops in a program, SAPFOR implemented a dynamic data dependence analysis. This analysis is based on the instrumentation of the LLVM representation of the ana- lyzed programs and can be performed for applications in the C and Fortran lan- guages. The use of static analysis allows to reduce the number of analyzed memory accesses and to ignore scalar variables, which can be explored in a static way. A selective analysis of program functions and loops is also allowed. These features allow us to significantly reduce the overhead of the program ex- ecution time, while maintaining the completeness of the analysis. The devel- oped tool was tested on performance tests from the NAS Parallel Benchmarks (NPB) package for C and Fortran languages. The implementation of dynamic analysis, in addition to traditional types of data dependencies (flow, anti, out- put), allows us to determine privatizable variables and a possibility of loop pipelining. Together with the capabilities of DVM and OpenMP these greatly facilitates program parallelization and simplify insertion of the appropriate compiler directives. Keywords: Program Analysis, Dynamic Analysis, Semi-automatic Paralleliza- tion, SAPFOR, DVM, LLVM. 1 Introduction The main goal of SAPFOR [1, 2] (System FOR Automated Parallelization) develop- ment is to reduce the complexity of parallel programming. Programming across di- verse architectures leads to the complicated combined use of various parallel pro- gramming models (MPI, SHMEM, OpenMP, CUDA, OpenACC, OpenCL). The SAPFOR system uses the DVMH languages (Fortran-DVMH and C-DVMH) includ- ed in the DVM system as the target programming language [3, 4]. These languages propose an interface which hides features specific to various parallel programming models and facilitates both manual and semi-automatic development of parallel pro- grams. The SAPFOR project combines three main lines of research: program analysis, automatic parallelization of “well-formed” sequential programs and semi-automatic Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 200 transformation of a program to obtain its well-formed version. Each of these lines, both individually and as a whole, is called upon to assist in the development of paral- lel programs. All of the three mentioned lines of research are based on the investiga- tion of program properties. Thus, program analysis is an essential part of any parallel- ization process, not necessarily automatic. SAPFOR provides both static and dynamic analysis of programs. Separately, each type of analysis is not enough to investigate parallelization opportunities. On the one hand, static approach in many cases is too conservative and it reports the presence of really missing dependencies in order to maintain the correctness of the program. The authors of [5] explore the possibilities of automatic parallelization using Intel and PGI compilers on the example of a set of scientific kernels OmpSRC [6]. They identify at least two main problems that prevent static analysis of these benchmarks: the use of pointers and a complex control flow graph. In both cases, compilers are forced to make conservative decisions about dependencies in the program. On the other hand, dynamic analysis, firstly, analyzes the program for a certain set of input data, and, secondly, it is resource intensive, that is, it may have major time and memory over- heads. In addition to the fact whether data dependence exists it is important to determine the possible ways to eliminate it. For example, anti and output dependencies can be eliminated by data privatization (i.e. creating a local copy of the data for each process or thread). Corresponding static analysis of scalar variables based on data flow analy- sis is implemented in SAPFOR [7]. However, array privatization requires exploration of the sets of elements which are accessed at each loop iteration. Pointer-based and indirect accesses, the dependence of subscript expressions on function parameters and other values identifiable only at runtime make it impossible to statically determine privatizable arrays. Moreover, the elimination of such dependencies is one of the key transformations required to parallelize benchmarks from the NAS Parallel Bench- marks (NPB) [8, 9]. Another source of parallelism is flow loop-carried dependencies with distance limited by a constant. Pipelined execution of corresponding loops is possible. The DVM system provides across directive to reveal such loops and its specification requires the indication of a dependence distances. To overcome these analysis issues, a dynamic data dependence analysis was implemented in the SAPFOR system. 2 Existing Approaches to Dynamic Analysis Dynamic analysis has the following main shortcomings. It depends on the complete- ness of the input and it also has high memory and runtime overheads. The degree of representativeness of the input affects significantly the reliability of dynamic analysis. The use of poorly selected data sets can lead to the omission of existing dependencies. In this sense, dynamic analysis is possible only under the close control of the user who parallelizes a program, and it is convenient for use in semi-automatic paralleliza- tion systems such as SAPFOR. 201 To obtain the program properties at runtime the following two approaches are widely used: sampling profilers evaluate program execution at regular time intervals, and instrumentation-based analysis allows us to study the program behavior in prede- fined points. Instrumentation technique inserts calls to some external library into the program code to collect necessary information. Sampling profilers less demand on the consumed resources than instrumentation-based analysis but they may suffer from the loss of information, so sampling profilers are not widely used for data-dependence analysis. The use of instrumentation is the most suitable approach to determine data depend- encies. In turn, it is possible to insert calls in a source code, in some intermediate representation or in compiled executables (binary instrumentation). Instrumentation of programs at the source level requires a separate implementation of tools for each sup- ported programming language (in the case of SAPFOR, these are Fortran and C lan- guages). It is necessary to process separately all possible syntactic constructions of supported languages. In addition, preliminary program transformation may reduce overhead costs in certain cases. These kinds of transformations are convenient to per- form on some intermediate representation of the program which is common for dif- ferent languages (for example, LLVM IR). Binary instrumentation is the most effec- tive in terms of completeness of coverage of the executable program, as it allows instrumentation even in the absence of source codes. For example, binary instrumen- tation allows us to analyze already compiled modules for which other instrumentation techniques are not applicable. The disadvantage is that to establish relation between the received information and the source code of the analyzed program complete source-level debug information is required. It should be mentioned that these required metadata could be lost during program compilation. The SAPFOR system uses LLVM intermediate representation (IR) to perform pro- gram analysis (both static and dynamic) [10]. LLVM IR is common for various pro- gramming languages. LLVM supports the ability to perform additional analysis and transformations to selectively instrument a program, many of which are already im- plemented in LLVM. The metadata available in LLVM IR can be further extended to fully describe the source program. To achieve this goal SAPFOR establishes a corre- spondence between certain constructions of the source program represented in the form of a syntax tree (AST) and LLVM IR constructs (loops, variables, functions and their calls). LLVM IR simultaneously exists in three forms: the structure of C ++ 11 language classes, binary and textual representation. A human readable representation provides natural opportunities to debug and visualize the transformations. The disad- vantage is that pre-compiled sections of the program cannot be instrumented and must be subjected to conservative analysis. One of the most widely used tools for dynamic data dependence analysis is Intel Advisor, which is a part of Intel Parallel Studio [11]. This tool allows you to deter- mine three types of dependencies: flow (RAW), anti (WAR), output (WAW). A more detailed classification cannot be performed. For example, this tool cannot reveal vari- ables that can be privatized, it also does not determine the dependence distances. For analysis, the binary instrumentation is used, therefore, in order to correctly correlate the obtained results with the source code, it is recommended to disable optimizations. 202 To start the analysis of data dependencies, you must first obtain the survey report. Then the loops which are presented in this report could be marked for further analysis. To obtain a survey report, sampling profiler is used, so the number of detected loops depends on the size of the interval that is used to collect the data. But even for the minimum interval, not all loops will be recognized, for example, for the LU program (class S) with the minimum interval only one third of the loops will be detected (about 60 loops out of 187 program loops). This is primarily due to the fact that input of the S class is a very small data set for test purposes on which the program runs for a split second. However, the specified data set describes the whole possible behavior of the program. Table 1 shows the Intel Advisor slowdown for the LU program (class S), as well as the number of loops available for the analysis with a minimum sampling interval. Slowdown is indicated as the ratio of execution times for programs compiled with different optimization options (-O0 and -O3) and with enabled and disabled data de- pendence analysis. It is worth noting that the recommended analysis mode requires the use of the -O0 option to ensure analysis of the source program without optimiza- tions and to ensure exact correspondence of the output information to the source code objects. In this case, the slowdown is about 4160 times, despite the fact that only a third of all program loops have been be analyzed. Test has been executed on Intel Xeon CPU E5-1660 v2, 3.70 GHz, with Turbo Boost disabled. To compile and analyze programs, Intel Parallel Studio 2019 was used based on the GCC 7.4 system libraries. Table 1. Intel Advisor 2019 Slowdown in LU (NPB) Benchmark Analysis (Class S) Optimization -O0 (enable analysis) -O3 (enable analysis) -O0 (disable analysis) 850 times / 62 loops 433 times / 27 loops -O3 (disable analysis) 4160 times / 62 loops 2384 times / 27 loops The basic idea a dynamic analysis algorithm implemented in SAPFOR is similar to the pairwise method described in [12]. The authors propose improvements to this al- gorithm to reduce memory and time overhead, but we were not able to access the tool they developed. The authors rely on the binary instrumentation, although they talk about the possibility of using LLVM. In addition, dynamic analysis was performed only on the most resource-intensive loops. Therefore, it is rather difficult to estimate the real costs when analyzing the entire program (the total number of loops in the analyzed programs is not given). Some small loops with low execution time could be omitted in case of parallelization for systems with shared memory. However, SAPFOR needs to make global decisions about the distribution of data and computa- tions for heterogeneous computational cluster. Therefore, it is impossible to exclude any loops from the analysis, since they can access the distributed data (including indi- rect accesses which could not be tracked with static approach). 203 3 Dynamic Analysis Algorithm Description The dynamic analysis in SAPFOR should analyze all loops and detect for each loop the type of data dependence and the possibility of its elimination. To determine the possibility of pipelined execution of a loop, the dependence distance (maximum and minimum) should be determined. It is also necessary to identify whether dependence is spurious and whether data privatization allows it to be eliminated. Since it is not necessary to discover the specific operators that produce data dependencies, a list of the whole memory accesses is not required. Instead, it is possible to store a flag which indicates that a memory location with a specified address has been accessed. Some additional information useful to identify the desired properties also should be collect- ed. This means that simple processing of accumulated data should be performed on each memory access. This approach allows us to reduce the memory overhead if at each loop iteration multiple accesses occur at the same address. The time of dynamic analysis changes in an unobvious way. On the one hand, at the moment of memory access there is a data processing, which slows down the execution. On the other hand, when the control flow exits this loop the number of processed data is decreased. Hence, the analysis time greatly depends on the structure of the analyzed program. For each nested loop there is a separate data structure which describes memory ac- cesses in this loop. On each memory access its description is updated for the closest loop nest. After the loop exit the accumulated information is used to update appropri- ate data structure for the outer loop and dependencies found for the inner loop are also stored in the global storage. The dynamic analyzer uses two main data structures: a global storage of analysis results and stack of execution contexts. The global storage contains list of all memory accesses in the loop and a description of all dependencies found for each loop. The description comprises of dependence kind (flow, anti, output), the dependence dis- tance (maximum and minimum) and a flag which indicates the possibility of data privatization. An execution context is a set of memory addresses in conjunction with additional information necessary to identify properties of interest. Upon entering the loop or function, a new empty context is created and pushed to the stack. The context stores the number of the current iteration of the corresponding loop. In the case of a function the iteration number does not matter and has undefined value. At the begin- ning of loop iteration a special function from the analyzer runtime library is called. This function changes the number of iteration in the appropriate context. On each memory access the dynamic analyzer looks for the description of the accessed ad- dress. If there is no any description then the new one is created in the top context. For example, in case of reading there is a description of an accessed address in the top context, and this description indicates that the last reading is made at the current itera- tion. If there was a record that indicates write access at another iteration, then the fact of flow dependence is established and the distance between iterations is calculated. To calculate the distance, it is necessary to store not only the iteration number of the last read/write, but also the iteration number of the first read/write. When exit from a loop or function occurs, the top context is removed from the stack. For each address from the removed context, if there was a read or write access 204 to this address, the corresponding operation is fixed at the new top of the stack. For a loop related to a removed context information about dependencies is added to the global storage. At the time of program finalization, information from the global storage is printed in the specified format. 4 Implementation Details The runtime library of dynamic analyzer is implemented in C ++ 11, to make instru- mentation LLVM pass has been implemented. This pass inserts in LLVM IR function calls to the runtime library. For each instrumented object, its description based on debug information is passed to the dynamic analyzer. This information is used to identify the source code object corresponding to the instrumented lower level object. Instrumentation is supported for both Fortran and C/C ++ programs. LLVM IR is common for diverse source languages, but the debug information may vary slightly. For example, structures of a special kind are used to describe Fortran COMMON blocks. Therefore, for other languages, a high-level description of instrumented data may be lost. Dynamic analysis comprises the following sequence of steps: 1. Construction of LLVM representation for each file that should be analyzed. 2. Linking of LLVM bitcode files together into a single LLVM bitcode file. At this step LLVM tool llvm-link is used. 3. Instrumentation of the single file obtained in the previous step. 4. Further compilation of the instrumented file. It should be combined with the rest of the analyzed project, as well as with the runtime library of dynamic analyzer. The output of the instrumented program is either a textual list of data dependencies in analyzed loops, or a description of the dependencies in JSON format. The resulting JSON file can be passed to the static analyzer of SAPFOR to refine the analysis re- sults. Instrumentation pass makes the following IR-level transformations: inserts declarations of functions of the dynamic analyzer, inserts calls to these functions into the function bodies of the instrumented module, declares a global pool to store meta-information and to identify each instrumented object, creates internal functions to initialize meta-information and to register types and global objects, inserts calls to initialization functions at the beginning of the program entry point. Calls to runtime library functions allow dynamic analyzer to handle memory accesses, calls to functions, the beginning of the whole loop, as well as the beginning of each loop iteration and exits from loop. Correspondence between actual and formal param- eters is also established. 205 5 Evaluation The implemented tool was tested on the Fortran [8] and C [9] versions of the NAS Parallel Benchmarks (NPB). Estimation of the time overhead is shown in Fig. 1 and Fig. 2. The growth of the consumed memory (in the number of times) is shown in Fig. 3 and Fig. 4. Benchmarks have been executed on Intel Xeon CPU E5-1660 v2, 3.70 GHz, with Turbo Boost disabled. To obtain LLVM IR and compile programs, the Clang and Flang compilers version 7.1.0 were used, based on the GCC 7.4 system libraries. Compilation was performed using the –O3 option. In contrast to the use of binary instrumentation, the use of this option is allowed, since the optimizations are applied after the instrumentation of the original code. Memory consumption and time over- head have been also measured for Intel Advisor. In this case, the dependency analysis was performed with the –O0 option to presume reliable results. However, slowdown is evaluated relative to programs compiled with the –O3 option. Fig. 1. Slowdown (number of times in relation to the original sources) of the C version of the NAS Parallel Benchmarks with full (All) and partial (No scalar) instrumentation and its com- parison with time overhead of Intel Advisor. In order to reduce memory consumption and time overhead, static analysis tools im- plemented in SAPFOR were used [7]. Scalar variables, which only have loads and stores as uses, could be analyzed with static analysis techniques in most cases. In case of data dependencies SAPFOR determines whether privatization techniques are appli- cable to these scalar variables. Our static analysis tool also reveals reduction variables in a source code. Dynamic analyzer can safely ignore accesses to these variables. Moreover, accesses to these variables can be optimized by using LLVM tools, for example, the corresponding low-level memory references can be promoted to be reg- ister references. The application of this optimization allows us to reduce the slow- down of the analyzed programs up to 40% (in the case of EP benchmark). Thus, tak- 206 ing into account this optimization, the execution time increases on average up to 2000 times. But together with the application of static analysis, a complete analysis of the entire program is provided. At the same time, analysis of the one third of all loops performed by Intel Advisor slows down the program by an average of 5,000 times. Fig. 2. Slowdown (number of times in relation to the original sources) of the Fortran version of the NAS Parallel Benchmarks with full (All) and partial (No scalar) instrumentation and its comparison with time overhead of Intel Advisor. Fig. 3. Memory consumption (number of times in relation to the original sources) of the C version of the NAS Parallel Benchmarks with full (All) and partial (No scalar) instrumentation and its comparison with memory overhead of Intel Advisor. We also implement special command line options which allow the user to select func- tions which should be analyzed. In this case, the instrumentation is performed for specified functions, as well as for functions called from these ones. Such selective 207 analysis is used to explore properties which are necessary to parallelize a specific region of code and significantly speeds up program analysis. For example, for the LU benchmark (class S), in case of exploration of one of the main loops the slowdown is of 707 times only. Dynamic analysis of this loop highlights a regular data depend- ence, which can be eliminated by pipelined execution of the loop. Fig. 4. Memory consumption (number of times in relation to the original sources) of the Fortran version of the NAS Parallel Benchmarks with full (All) and partial (No scalar) instrumentation and its comparison with memory overhead of Intel Advisor. 6 Conclusions The paper considers a dynamic analysis tool designed to determine data dependen- cies. It is implemented in SAPFOR system. This tool can be used both to obtain anal- ysis results for the purpose of semi-automatic parallelization of programs in SAPFOR, and for the purpose of the manual program parallelization. It determines the main types of data dependencies (flow, anti, output) and collects all variables accessed in program loops. It also recognizes variables which can be privatized. It is especially important to note that privatizable arrays are determined. Static approaches have limited capabilities to analyze such arrays. Our approach also allows us to reveal a possibility of pipelined execution of loops. Moreover, the computed data depend- ence distances are enough to insert the across directive which is a part of DVMH languages and which enables the pipelined execution of marked loops. Preliminary static analysis of scalar variables reduces the overhead of dynamic analysis without losing the completeness of the results. Instrumentation of LLVM representation in- stead of compiled executable (binary instrumentation) makes it possible to optimize the program after transformation without losing the accuracy of the analysis results. Future works involve further reduction of runtime overhead since parallelization for HPC systems with distributed memory requires analysis of the entire program. The source code for the SAPFOR system is available on GitHub [13]. 208 This work was partially supported by Presidium RAS, program I.26 "Fundamentals of creating algorithms and software for advanced ultra-high performance computing". References 1. Klinov, M.S., Kriukov, V.A.: Avtomaticheskoe rasparallelivanie Fortran-programm. Oto- brazhenie na klaster. Vestnik Nizhegorodskogo universiteta im. N.I. Lobachevskogo, No. 2, pp. 128–134 (2009), last access 05.12.2019. 2. Bakhtin, V.A., Zhukova, O.F., Kataev, N.A., Kolganov, A.S., Kriukov, V.A., Podderiugi- na N.V., Pritula, M.N., Savitskaia, O.A., Smirnov, A.A.: Avtomatizatsiia rasparallelivaniia programmnykh kompleksov. Nauchnyi servis v seti Internet: trudy XVIII Vserossiiskoi nauchnoi konferentsii (19-24 sentiabria 2016 g., g. Novorossiisk). M.: IPM im. M.V. Keldysha, pp. 76–85 (2016). https://doi.org/10.20948/abrau-2016-31, last access 05.12.2019. 3. Konovalov, N.A., Krukov, V.A, Mikhajlov, S.N., Pogrebtsov, A.A.: Fortan DVM: a Lan- guage for Portable Parallel Program Development. Programming and Computer Software. vol. 21, No. 1, pp. 35–38 (1995). 4. Bakhtin, V.A., Klinov, M.S., Kriukov, V.A., Podderiugina, N.V., Pritula, M.N., Sazanov, Iu.L.: Rasshirenie DVM-modeli parallelnogo programmirovaniia dlia klasterov s getero- gennymi uzlami. Vestnik Iuzhno-Uralskogo gosudarstvennogo universiteta, seriia "Ma- tematicheskoe modelirovanie i programmirovanie", №18 (277), vypusk 12. Cheliabinsk: Izdatelskii tsentr IuUrGU, S. 87–92 (2012). 5. Kim, M., Kim, H., Luk, C.K.: Prospector: A dynamic data-dependence profiler to help parallel programming. HotPar’10: Proceedings of the USENIX workshop on Hot Topics in parallelism (2010). 6. Dorta, A.J., Rodríguez, C., de Sande, F., and Gonzalez-Escribano, A.: The OpenMP Source Code Repository. Parallel, Distributed, and Network-Based Processing, Euromicro Conference (2005). 7. Kataev, N.A.: Application of the LLVM Compiler Infrastructure to the Program Analysis in SAPFOR. Voevodin V., Sobolev S. (eds) Supercomputing. RuSCDays 2018. Commu- nications in Computer and Information Science, vol 965. Springer, Cham. Pp. 487–499 (2018), https://doi.org/10.1007/978-3-030-05807-4_41, last access 05.12.2019. 8. NAS Parallel Benchmarks. UFL. https://www.nas.nasa.gov/publications/npb.html, last ac- cess 05.12.2019. 9. Seo, S., Jo, G., Lee, J.: Performance Characterization of the NAS Parallel Benchmarks in OpenCL. 2011 IEEE International Symposium on. Workload Characterization (IISWC), pp. 137–148 (2011). 10. Lattner, C., Adve, V.: LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. Proc. of the 2004 International Symposium on Code Generation and Optimization (CGO’04). Palo Alto, California (2004). 11. Intel Parallel Studio. https://software.intel.com/en-us/parallel-studio-xe, last access 05.12.2019. 12. Kim, M., Kim, H., Luk, CK.: SD3: A Scalable Approach to Dynamic Data-Dependence Profiling. 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture. IEEE (2011), https://doi.org/doi:10.1109/MICRO.2010.49, last access 05.12.2019. 13. SAPFOR. https://github.com/dvm-system, last access 05.12.2019.