Comparative Debugging of Parallel DVMH-programs Vladimir Bakhtin1,2[0000-0003-0343-3859], Dmitry Zakharov1[0000-0002-6319-5090] , Alexander Ermichev1,2[0000-0002-3678-2580] and Viktor Krukov1,2[0000-0001-6630-964X] 1 Keldysh Institute of Applied Mathematics, Miusskaya sq., 4, 125047, Moscow, Russia 2 Lomonosov Moscow State University, GSP-1, Leninskie Gory, 119991, Moscow, Russia dvm@keldysh.ru Abstract. DVM-system is designed for the development of parallel programs of scientific and technical calculations in C-DVMH and Fortran-DVMH lan- guages. These languages use a single parallel programming model (DVMH model) and are extensions of the standard C and Fortran languages with paral- lelism specifications, written in the form of directives to the compiler. The DVMH model makes it possible to create efficient parallel programs for hetero- geneous computing clusters, in the nodes of which accelerators (graphic proces- sors or Intel Xeon Phi coprocessors) can be used as computing devices along with universal multi-core processors. The article describes the method of de- bugging parallel programs in DVM-system, as well as new features of DVM- debugger. Keywords: Automation of Development of Parallel Programs, Automation of Debugging of Parallel Programs, DVM-system, Accelerator, GPU, Fortran, С. 1 Introduction Classic debugging approaches may not always be helpful when it comes to parallel programming. A step-by-step study of the executing process at certain breakpoints or debug prints are hardly applicable to real scientific and engineering parallel software systems designed for continuous execution for hours or even days. Parallel algorithms are usually much more complicated than sequential solutions of the same problems. Moreover, the parallel code may contain atypical for sequential debugging errors associated with the incorrect use of synchronization primitives and functions that provide parallelism. The following factors affect the complexity of debugging parallel programs:  the necessity to track the status of several (or even many) parallel pro- cesses/threads;  the difficulty of reproducing errors caused by non-deterministic execu- tion;  the debugging tools influence on the execution process (different execu- tion time of the operators, additional internal synchronizations). Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 72 To debug parallel programs, automated methods have been developed. Such meth- ods allow to find most errors in the program in automatic mode with minimal in- volvement of the programmer. One of these methods is the dynamic correctness con- trol. This method is used in many debugging tools for multithreaded programs, such as: Helgrind [1], DRD [2], Intel Parallel Inspector [3]. In the DVM-system, this meth- od is used to debug parallel programs made for heterogeneous clusters. Another automated debugging method is the comparative debugging of parallel programs. The main principle of this method is to compare the execution process of two programs, by controlling the values of variables at certain controlled points. Comparison can be carried out either between simultaneously running programs, or using trace files, which store all the necessary data about operations and variable val- ues at controlled points. This debugging method was implemented in the Guard [4] and Wizard [5] debuggers. The term "comparative debugging" was introduced in the articles describing these tools. Around the same time, this method was also imple- mented in the debugger of the DVM-system. In the Guard and Wizard debuggers, the controlled points at which the variable values are compared are setting by the user. In contrast, in the DVM-debugger, com- parison points are setting automatically. This article briefly describes the approaches for DVMH-programs debugging, pic- tures the problems that had arose during the use of DVM-system comparative debug- ging implementation, and suggests new methods to overcome these problems. 2 Approaches for Debugging Parallel DVMH programs To identify errors that lead to incorrect calculations, the DVM-system [6-8] contains various special tools for automating the debugging process. Such errors can be detect- ed by executing a DVMH-program in the dynamic correctness control mode and/or by starting computations on one or several processors in a comparing mode with the reference results obtained during its sequential execution. The use of automated debugging methods requires the instrumentation of a parallel program: insertion of special calls to the debugger that allow to control and process execution of a program. There are two approaches for such instrumentation: binary code instrumentation and source code instrumentation. Helgrind, DRD, Intel Parallel Inspector are based on binary instrumentation. But the use of high-level programming models (such as OpenMP [9]), can lead to certain difficulties with that approach. For example, instrumentation utility must restore the parallelism specifications from binary code of the program in order to be able to generate errors in the context of the source code of that program. To make this possible, the instrumentation utility must know the internal logic of the compiler. If both compiler and debugger are made by the same developer (e.g. Intel), then such recovery can be implemented. But when the compiler is developed by one company (e.g. Microsoft), and the debugger is by another (e.g. Intel), then recovering of parallelism specifications can be difficult. This problem can be avoided by using the source code instrumentation. 73 Debugging instrumentation of programs in DVM-system is performed by C- DVMH [6] and Fortran-DVMH [7] compilers, which add calls to DVM-debugger functions at the following points of the program:  beginning of a sequential or parallel cycle;  completion of a sequential or parallel cycle;  beginning of a new iteration of a cycle;  accessing a variable for reading;  before accessing a variable for writing;  after accessing a variable for writing. In addition, the debugger functions are also performed while executing calls to the Lib-DVMH library [8]. Dynamic correctness control of DVMH-directives is based on an analysis of the sequence of calls to Lib-DVMH functions and accesses to variables. Dynamic control is allowing to detect the following types of errors:  undeclared data dependency in a parallel loop;  incorrect use of private and reduction variables;  undeclared access to non-local elements of a distributed array;  incorrect work with shadow edges of reduction arrays modification of non-local elements of a distributed array in the sequential part of the pro- gram;  going out of bounds in distributed array;  writing data to the remote access buffer. It should be mentioned that not all errors can be determined by the dynamic cor- rectness control. For example, external procedures and functions without any source code cannot be analyzed. The comparative debugging method can be used in order to check such programs. This method will be discussed in detail in the next section. 3 Comparative Debugging of DVMH-programs The intermediate results comparison implemented in DVM-system allows to detect errors in parallel programs that arise due to incorrect DVMH instructions, as well as program errors that do not appear in sequential execution and were not detected by the dynamic control method. The general scheme of comparative debugging method is as follows: 1. Getting a reference trace (./dvm trc command in console interface). Trace contains: readings and modifications of variables, the beginning of each loop iteration, the beginning and end of parallel and sequential loops, the beginning of each parallel task, the beginning and end of the task group. The source for reference trace can be either the sequential execution of the same program or parallel execution (e.g. the trace recorded on the differ- ent computing cluster where the error does not occur); 2. Automatic comparison of program execution results with previously ac- cumulated reference trace (./dvm dif command). All integer numbers is compared strictly and real numbers are compared with a given absolute 74 and relative accuracy. Information on the all discrepancies is reported to user. MODE = () [] {, } = , (:, , ), … EL: () [] {, } = , (:, , ), … EL: END_HEADER Fig. 1. Trace header structure Reference trace file is created for each process and has a text format. This file starts with the special header that contains used trace parameters and hierarchy of loops and task groups (see Fig. 1). - reading variable: RD: [] = ; {, } - before modification of variable: BW: [] ; {, } - after modification of variable: AW: [] = ; {, } - reading reduction variable: RV_RD: [] = ; {, } - before modification of reduction variable: RV_BW: [] ; {, } - after modification of reduction variable: RV_AW: [] = ; {, } - final result of reduction operation: RV: [] ; {, } Fig. 2. Trace records format (variables) 75 - beginning of parallel loop: PL: () []; {, } - beginning of sequential loop: SL: () []; {, } - beginning of task group: TR: () []; {, } - next loop iteration of parallel task: IT: , (,< value of 2nd index variable>,…). - end of loop (parallel or sequential): EL: ; {, } - end of the local calculations block in the sequential part of the program: SKP: {, } Fig. 3. Trace records format (loops and parallel tasks) Global trace details level can be set by modifying the value of the MODE parame- ter in the trace header. Details level can also be specified for each cycle and task group of the DVMH-program by setting the corresponding mode in the line that de- scribes the desired structure in the header. The main body of the trace contains a sequence of specific records. These records determine the dynamic structure of the program: the sequence of statements during the concrete execution of the program. Fig. 2 and 3 provide a complete list of events that are recorded in the trace. #define L 3 int main(int an, char **as) { #pragma dvm array distribute[block] double A[L]; #pragma dvm parallel([i] on A[i]) for (int i = 0; i < L; i++) { if (i == 0 || i == L - 1) A[i] = 0; else A[i] = 2 + i; } return 0; } Fig. 4. C-DVMH example program (file “test.c”) 76 Figure 5 shows the recorded reference trace of the example program (code of which is shown in Fig. 4). This program consists of one parallel loop, which starts on the line 7 of the "test.c" file and initializes the elements of the distributed array A. As a result of this loop, the elements A[0] and A[2] are set to "0", and A[1] becomes "3". # Begin trace header. Don't modify these records TRACE_TIME = "Mon Apr 15 00:00:22 2019" ARCHITECTURE = "Machine x86_64" USER_HOST = "DVM-COREI7@DVM-COREI7" WORK_DIR = "/home/DVM/dvm_current/dvm_sys/demo" TASK_NAME = "test" MODE = FULL PL: 1() [1] {"test.c", 7} = #, (0:0,2,1) EL: 1 END_HEADER # End trace header PL: 1() [1]; {"test.c", 7}, 1.B IT: 0, (0) BW: [4] "A[i]"; {"test.c", 10} AW: [4] "A[i]" = 0; {"test.c", 10} IT: 1, (1) BW: [4] "A[i]"; {"test.c", 12} AW: [4] "A[i]" = 3; {"test.c", 12} IT: 2, (2) BW: [4] "A[i]"; {"test.c", 10} AW: [4] "A[i]" = 0; {"test.c", 10} EL: 1; {"test.c", 7}, 1.E END_TRACE Fig. 5. Reference trace of example program Comparative debugging proved to be effective on simple model problems, but came across two obstacles: resources and accuracy. Next section describes the es- sence of these problems. 4 Issues with the Current Comparative Debugger After instrumentation, each operator of debugged program is surrounded by several calls to debugging library, so it is not surprising that execution time can increase sig- nificantly. For example, in experiments with benchmark programs from the NAS NPB package [11], the following results were obtained:  the slowdown only from debugging instrumentation (i.e. the program was compiled for debugging, but executed without collecting/comparing the trace) was around 50–100 times;  trace file size (several dozens of bytes for each traced operator) turned out to be completely unacceptable for real programs. The average size of full trace for benchmark program was measured in terabytes, and the approx- 77 imate time for collecting the trace is ~28000 sec. (with an average runtime of the initial programs of ~5 sec., i.e., a ~4000 times deceleration). Therefore, full comparative debugging turned out to be applicable only on small "model" data, which are not always available. To overcome this problem, the follow- ing control and optimization tools were developed and implemented in the DVM- system debugger [12]:  selective tracing: only modifications, only distributed arrays, etc. (compi- lation options -d1...-d4);  local instrumentation, i.e. manual selection of program sections, that will be instrumented for debugging (directives DEBUG / END DEBUG);  preliminary estimation of the trace size by obtaining the trace header, and its manual correction to disable tracing of certain loops or iterations (command ./dvm size);  automatic selection of parallel loops iterations for tracing: “borders” and “corners” (compilation options -dbif1 and -dbif2);  generation of two cycle bodies: one without debugger calls, the other is instrumented; the instrumented loop body is used only on selected itera- tions, and the original program is executed on the rest;  recording the checksums of arrays at the end of the loop instead of tracing modifications of their elements in the body of the loop (parameter TraceOptions.CalcChecksums). Described improvements extend the possibility of applying comparative debugging to real applications. For example, when processing large amounts of data through loops, the most probable place for errors occurrence (going out of the array bounds, uninitialized variables) will be boundary iterations. Moreover, the usual structure of computational algorithms is causes that an error occurred at the internal iteration of the loop will propagate to the boundaries. For such tasks, selective tracing and com- paring of boundary iterations can be used (-dbif options), which allows (see Table 1):  significantly reduce the size of the trace (100-1000 times);  significantly reduce the execution time (10-1000 times);  preserve most coverage of program operators (over 99%). Table 1. Average trace size and execution time for NAS NPB benchmarks (class A) "Corners" "Corners" "Borders" "Borders" Full trace width = 1 width = 2 width = 1 width = 2 Average 100% 99,4% 99,8% 99,8% 99,8% coverage Average 6,57E+12 4,6E+07 9,4E+08 1,7E+10 6,0E+10 trace size, (~6 Tb) (~44 Mb) (~894 Mb) (~16 Gb) (~56 Gb) bytes 78 Average 27915 execution 15 19 105 287 (7,75 h.) time, sec But, despite all optimizations, use of comparative debugging for scientific and technical algorithms, especially on real data (rather than on artificially created small "model" tests) is still very restricted. Another problem detected during the practical usage of the debugging system was the “allowable” mismatch of the real variables values. Most notable case was with reduction variables. The calculation of the sum of the distributed array elements changes depending on the the parallel program configuration: firstly local sums are calculated, and then they are combined in non-deterministic order. Results of such computations are slightly different (usually in the 1-2 minor digits), but, from the user point of view, resulting values are be equally valid. Reduction operations create four problems for comparative debugging (“false alarms”):  the intermediate values of the reduction variable do not coincide at all, be- cause only partial sums (maximum values, etc.) are calculated;  the final value of variable, as mentioned above, may slightly differ (non- determinism in the weak sense);  the difference in one variable can immediately spread to many others (for example, if the norm of a vector is calculated and then the vector is nor- malized);  if the variable, that was influenced by the difference in the reduction re- sults, is used as the condition for ending iterations or choosing a branch of calculations, then comparing programs can diverge at that condition (non- determinism in the strong sense). These problems were solved by introducing a special reduction processing mode:  reduction variables are recognized (because they are described in DVMH- directives), and trace records of their reading and modification in the body of the loop are ignored by comparative debugger;  at the end of the parallel cycle, the final value of the reduction variable (RV trace record) is added into the trace for further comparison;  the values of the reduction variables are compared with some accuracy, which may be less than the accuracy of comparing the values of ordinary variables. This accuracy can be set by the programmer through a special configuration parameter;  during the comparison the actual calculated value of the reduction varia- ble (after successful comparison with the given accuracy) is replaced by its value from the “reference” trace to eliminate the potentially dangerous discrepancy. The described approach made it possible to suppress the “false alarms” associated with reduction variables. However, it was further discovered that this problem can appear not only on reduction computations. Executing a program on different ma- chines or using different compilation tools can leads to different results of any arith- 79 metic operation, such as multiplication or division (in 1–2 minor decimal digits of the mantissa) [13]. 5 New Approaches to Comparative Debugging Currently, a new version of the comparative debugging system is in development [14], aimed to overcome described problems. The new debugger will be based not on trace files, but on the exchange of debug information between two simultaneously running instances of the program: 1. program execution will be divided into computational blocks, the order of which is deterministic for any parallel system configuration; 2. during the execution of the next block, each instance of the debugged pro- gram will be accumulating needed trace records; 3. upon completion of the block, the accumulated traces will be sent to one process for comparison. The DVMH parallel model is making it possible to split any program into a deter- ministic sequence of blocks split by the boundaries of parallel loops. Proposed extension will support all previously implemented optimizations of the trace size (array checksums and boundary iterations), since all changes made by these optimizations affect the trace locally, within a specific parallel cycle, and therefore, within one specific block of the new trace. Described approach makes comparative debugging more flexible, allowing the simultaneous launch of the reference and debugged programs on one multiprocessor machine and remote debugging – comparing the execution of the reference program on one machine with an experimental version running on another. Splitting a parallel DVMH program into deterministic blocks has one more ad- vantage – the values of all variables at the block boundary should be identical regard- less of the number of processes and their configuration. Therefore, the block bounda- ries can be used as controlled points for comparative debugging, instead of tracing each operation of reading and modifying variables. In this case, all calculations within a certain block will be carried out without additional costs for collecting the trace, and values of all variables that was read and/or modified during its execution will be col- lected and compared at the end of block. The current implementation of the debugger already includes a special mode based on a similar principle: all calculations intended for execution on graphics accelerator can be also simultaneously performed on the central processor and then compared. In case of discrepancies, the needed information is reported to the user. After that, only results computed on central processor is used for further computations. Enabling and using this comparative debugging mode does not require the user to make any changes to the program, or even re-compile it. All that is needed is to set the value of the environment variable DVMH_COMPARE_DEBUG to 1, or use the ./dvm cmph command to start the execution of the program. Also, to address the problem of discrepancy between the results of operations with real numbers, the new version of the debugger includes a mode that extends already 80 described approach of correcting reduction variables to all results of real operations (configuration parameter TraceOptions.SubstAllResults). After successfully compar- ing a real variable with a reference value from trace, this reference will be substituted into the running program and used for further calculations. This correction approach excludes the possibility of applying the trace size optimi- zations described above, since both the array checksums and the boundary iterations do not cover a significant part of the modifications of variables inside of parallel cy- cles. Therefore, in order to ensure the possibility of debugging real scientific and technical algorithms with this mode, it is recommended to use it in conjunction with the comparison of simultaneously running programs. 6 Conclusion DVM-system was designed to automate the process of developing parallel programs. Parallel programs with DVMH-directives can be efficiently (and without any addi- tional changes) executed on clusters of various architectures including multi-core universal processors, graphics accelerators and Intel Xeon Phi coprocessors. This is achieved through various optimizations that are performed both statically (during compilation), and dynamically. An important advantage of the DVM-system is the powerful tools for debugging developed DVMH-programs. These tools include the dynamic correctness control and the comparative debugging. Various instrumentation options were implemented for optimizing the process of debugging. This article presented the problems that arose during the practical use of compara- tive debugging implemented in DVM-system, and suggested ways to overcome these problems. Currently, a new version of the comparative debugging system is being developed, in which both the reference and the debugged program are executed simultaneously and the comparison of the calculation results is carried out directly in the process of execution. The new version of the debugging system will help to solve described problems with accuracy of computations, make the debugging process more flexible and less demanding to the memory of the computing system. References 1. Helgrind: a thread error detector, http://www.valgrind.org/docs/manual/hg-manual.html, last accessed 2019/11/21. 2. DRD: a thread error detector, http://www.valgrind.org/docs/manual/drd-manual.html, last accessed 2019/11/21. 3. Intel Inspector. Memory and Thread Debugger, https://software.intel.com/en-us/intel- inspector, last accessed 2019/11/21. 4. Guard Parallel Relative Debugger, http://sourceforge.net/projects/guardsoft/, last accessed 2019/11/21. 81 5. Abramson, D.A., Sosic, R.: Relative Debugging using Multiple Program Versions. In: In- tensional Programming I. Sydney: World Scientific (1995). 6. C-DVMH language, C-DVMH compiler, compilation, execution and debugging of DVMH programs, http://dvm-system.org/static_data/docs/CDVMH-reference-en.pdf, last accessed 2019/11/21. 7. Fortran DVMH langauge, Fortran DVMH compiler, compilation, execution and debugging of DVMH programs, http://dvm-system.org/static_data/docs/FDVMH-user-guide-en.pdf, last accessed 2019/11/21. 8. Lib-DVM library, http://www.keldysh.ru/dvm/dvmhtm1107/eng/sys/libdvm/rtsDDe0.html 9. OpenMP Application Programming Interface. Version 5.0. November, 2018, https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf, last ac- cessed 2019/11/21. 10. The OpenACC Application Programming Interface. Version 2.6. November, 2017, https://www.openacc.org/sites/default/files/inline-files/OpenACC.2.6.final.pdf, last ac- cessed 2019/11/21. 11. NAS Parallel Benchmarks, http://www.nas.nasa.gov/publications/npb.html 12. Krukov, V., Kudryavtsev, М.: Automated debugging of parallel programs. Vychisl. Meto- dy Programm., 7 (4), 102–110 (2006). 13. Monniaux, D.: The pitfalls of verifying floating-point computations. ACM Transactions on Programming Languages and Systems (TOPLAS), ACM, 30 (3), pp. 12 (2008). 14. Ermichev, A., Krjukov, V.: Razvitie metoda sravnitel'noj otladki DVMH-programm. In: Nauchnyj servis v seti Internet: trudy XIX Vserossijskoj nauchnoj konferencii, pp. 150– 156, Moscow, IPM im. M.V. Keldysha (2017), https://doi.org/10.20948/abrau-2017-15.