Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 VERIFIABLE APPLICATION-LEVEL CHECKPOINT AND RESTART FRAMEWORK FOR PARALLEL COMPUTING I. Gankevicha, I. Petriakov, A. Gavrikov, D. Tereshchenko, G.Mozhaiskii Saint Petersburg State University, 13B Universitetskaya Emb., St Petersburg 199034, Russia E-mail: a i.gankevich@spbu.ru Fault tolerance of parallel and distributed applications is one of the concerns that becomes topical for large computer clusters and large distributed systems. For a long time the common solution to this problem was checkpoint and restart mechanisms implemented on operating system level, however, they are inefficient for large systems and now application-level checkpoint and restart is considered as a more efficient alternative. In this paper we implement application-level checkpoint and restart manually for the well-known parallel computing benchmarks to evaluate this alternative approach. We measure the overheads introduced by creating and restarting from a checkpoint, and the amount of effort that is needed to implement and verify the correctness of the resulting programme. Based on the results we propose generic framework for application-level checkpointing that simplifies the process and allows to verify that the application gives correct output when restarted from any checkpoint. Keywords: fault tolerance, message passing interface, MPI, miniFE, NAS parallel benchmarks, DMTCP Ivan Gankevich, Ivan Petriakov, Anton Gavrikov, Dmitrii Tereshchenko, Gleb Mozhaiskii Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 122 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 1. Introduction Current parallel computing technologies do not have automatic fault tolerance built in, and researchers rely on external tools and application developers to make applications tolerant to cluster node failures. Popular message passing interface (MPI) provides means of communication but does not provide means to manage application state. As a result the state of many applications that use MPI is stored in local and global variables that are not managed by MPI and can not be automatically saved to and restored from the file (or any other medium). This deficiency lead to the creation of external tools that periodically stop MPI application, dump memory contents of all parallel processes to the file and resume the execution [1,2]. This technique is called system-level checkpoint and restart, and it has obvious disadvantage of being inefficient for the large number of parallel processes and saving too much data if the checkpoint is triggered during some peak memory usage application phase. An alternative approach is to modify the application to save all the variables to the file every n-th iteration of the main programme loop and restore them from the file before the main programme loop starts. This approach is called application- level checkpoint and restart, and it is more efficient that system-level checkpoints because it saves the minimum amount of data that is required to restore the application from the file. In this paper we evaluate application-level and system-level checkpoint and restart on a set of parallel applications. We implement application-level checkpoints for NAS Parallel Benchmarks [3] and miniFE [4] applications, measure the overhead and programming effort, and compare and contrast them to system-level checkpoints created with DMTCP [2]. Based on the experience that we obtained we write MPI-Checkpoint library that contains a set of routines that can be added to MPI to help manage application global state and implement application-level checkpoints. 2. MPI-Checkpoint library The closest library that provides checkpoint and restart functionality for MPI applications is CRAFT [5], however, this library is written in C++ and is not compatible with C and Fortran applications. Our approach is to reuse functionality provided by MPI to simplify our library: we can reuse data type handling and global process communication. From this perspective, our library can be considered as a set of routines that can be added to MPI to provide application state management via checkpoints, rather than a standalone full-featured checkpoint library. Our library provides the following routines:  MPI_Checkpoint_create — open checkpoint file for writing;  MPI_Checkpoint_write — write application state to the file;  MPI_Checkpoint_restore — open checkpoint file for reading;  MPI_Checkpoint_read — read application state from the file;  MPI_Checkpoint_close — close the file. They are used as follows. Every n-th iteration of the main loop each MPI process creates itss own checkpoint file and writes application state (the values of all relevant local and global variables) to this file. All files are stored in the same directory and their names equal the ranks of the corresponding MPI processes. Before the main loop each MPI process tries to restore from the checkpoint file: on success the values of all relevant variables are read from the file and the loop starts from the corresponding iteration. From a technical standpoint, the public interface of the library permits the usage of any medium to store checkpoints (file systems, main memory of spare nodes etc.), but reference implementation supports only file systems. Input/output is implemented using memory-mapped files and is efficient for the large files as the old pages that has already been read from/written to the file are discarded from the memory. From the users’ perspective, in order to restore from the checkpoint the environment variable MPI_CHECKPOINT should be set to the file system path of the checkpoint directory. Since every MPI process works with its own checkpoint file, they can be stored either in parallel or local file system. If the local file system is used, the processes should be restarted on exactly the same cluster nodes to be able to read from these files. The advantage of this approach, however, is the fact that it 123 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 may be more scalable than parallel file system, because the cluster network is not used for the input/output. 3. Benchmarks Using MPI-Checkpoint library we implemented application-level checkpoints for NAS parallel benchmarks and miniFE. Our approach is based on the fact that most parallel batch processing applications follow bulk-synchronous parallel model [6]: they are organised in a series of sequential steps (main loop) that are internally parallel. After each step there is a synchronisation point and here we create checkpoint file. We restore from the checkpoint file before the main loop. The disadvantage of this approach is that the initialisation of the programme (i.e. the code before the main loop) is performed once again before the restoration. The approach is presented in listing 1. int step_min = 0; MPI_Checkpoint checkpoint = MPI_CHECKPOINT_NULL; int ret = MPI_Checkpoint_restore(MPI_COMM_WORLD, &checkpoint); if (ret == MPI_SUCCESS) { MPI_Checkpoint_read(checkpoint, &step_min, 1, MPI_INT); ... // read more variables MPI_Checkpoint_close(&checkpoint); } for (int step=step_min; step<=step_max; ++step) { ... // some application logic code int ret = MPI_Checkpoint_create(MPI_COMM_WORLD, &checkpoint); if (ret == MPI_SUCCESS) { MPI_Checkpoint_write(checkpoint, &step, 1, MPI_INT); ... // write more variables MPI_Checkpoint_close(&checkpoint); } } Listing 1. Main loop augmented with application-level checkpoint and restart functionality. Public library calls are marked with blue. Using this approach we implemented checkpoints for CG, EP, FT, IS, LU, MG, BT benchmarks and for the reference implementation of miniFE, and it took moderate amount of effort. We stored initial code without checkpoints in Git [7] and then in each commit we implemented a checkpoint for a particular benchmark. According to Git log1 we spent only four working days for all the benchmarks to write and verify all the code that is needed for the checkpoints, the rest of the time was spent on improvement of the library public interface, implementing Fortran public interface, compression and memory-mapped input/output. We verified the correctness of the application that was restarted from the checkpoint by using the automated verification code that is built in the NAS parallel benchmarks and by comparing the residual of miniFE benchmark. If we produce a checkpoint every iteration we get a set of directories containing checkpoint files (one directory for each iteration). Then we restart the application using each such directory and perform verification of the application output. If all verifications succeed, then application-level checkpoints code is correct (i.e. we saved all the required variables). For many real- world applications verification can be implemented as bytewise comparison of the output data; for applications that use pseudo-random numbers integral properties of the output can be used for verification. In addition to application-level checkpoints we implemented DMTCP checkpoints in our library. When DMTCP mode is enabled, the library on each call to MPI_Checkpoint_create instructs the coordinator process to create full application memory dump. MPI_Checkpoint_restore is a no-op in this mode since the restoration happens using the shell script generated by DMTCP. 1 https://github.com/igankevich/npb-checkpoints, https://github.com/igankevich/miniFE-checkpoints 124 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 4. Results We ran performance benchmarks multiple times for all applications, for both DMTCP and MPI-Checkpoint modes with varying number of MPI processes. We measured checkpoint size on the disk, checkpoint creation time (overhead) and total execution time of the application. We used parallel file system GlusterFS, that is deployed on the same cluster nodes where the applications run, to store the checkpoints. Full testbed configuration is listed in table 1. Table 1. Hardware and software configuration DMTCP version 2.6.0, arguments: --no-gzip MPICH version 3.3.2, environment variables: HYDRA_RMK=user NPB version 3.4, class C miniFE version 2.0, parameters: nx=300, ny=300, nz=300 Compiler GCC 7.5.0, compilation flags: -O3 -march=native Cluster 6 nodes, 2 processors per node, 4 cores per processor, 2 threads per core (96 threads in total), 1 Gbit network switch GlusterFS version 8.0, two replicas for each file (the same nodes and network switch as the cluster) Performance benchmarks showed that the total size for both MPI and DMTCP checkpoints grows linearly with the number of MPI processes: the growth rates for miniFE application are 0.2% and 4% per node (16 parallel processes) respectively (see fig.1). For miniFE both MPI and DMTCP checkpoint creation time decreases with the number of processes (see fig.1); this can be explained by the fact that the network switch single port throughput is fully utilised, but the overall switch throughput is not (its utilisation increases with the number of ports used). For all NAS parallel benchmarks (except MG) this time decreases when we go from single node to two-node configuration, and then increases (see fig.2); the decrease in this case can be explained the same way. The increase after two nodes can be explained by the fact that the parameters of NAS parallel benchmarks are determined from the number of MPI processes. Figure 1. The total size of checkpoint files for DMTCP and MPI 125 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 Figure 2. MPI and DMTCP checkpoint creation time for NAS parallel benchmarks and miniFE 5. Conclusion We evaluated application-level and system-level checkpoint and restart on a set of benchmarks that replicate behaviour of real-world applications. Contrary to our expectations we found out that it takes little programming effort to implement application-level checkpoints for someone who sees the source code of the application for the first time. Our performance benchmarks confirmed that application-level checkpoints are much smaller in size and take less time to create compared to system-level checkpoints. Our cluster was too small to confirm that the time needed to create checkpoint files increases with the number of nodes (our benchmarks showed that it actually decreases or does not change much). Based on our experience we proposed minimal set of routines that can be added to MPI to create application level checkpoints. We believe that the effort that application developers need to put into implementing application-level checkpoints is much smaller than the effort application users put into configuring system-level checkpoints: during our benchmarks we encountered several cases when the programme restarted from DMTCP checkpoint hanged, DMTCP does not work with OpenMPI library (we did not find working solution of this problem), DMTCP does not work if one wants to restart the application on a different set of nodes. For efficiency and reliability reasons developers of new MPI applications should consider implementing application-level checkpoints. Hopefully, this paper and our public- domain library2 would help in this regard. 6. Acknowledgement This work is supported by Council for grants of the President of the Russian Federation (grant no. MK-383.2020.9). References [1] Hargrove P. H., Duell J. C. Berkeley lab checkpoint/restart (BLCR) for Linux clusters //Journal of Physics: Conference Series. – IOP Publishing, 2006. – Vol. 46. – Issue 1. – P. 067. [2] Ansel J., Arya K., Cooperman G. DMTCP: Transparent checkpointing for cluster computations and the desktop //2009 IEEE International Symposium on Parallel & Distributed Processing. – IEEE, 2009. – P. 1-12. [3] Van der Wijngaart R. F., Wong P. NAS parallel benchmarks version 2.4. – 2002. 2 https://github.com/igankevich/mpi-checkpoint 126 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 [4] Heroux M. A. et al. Improving performance via mini-applications //Sandia National Laboratories, Tech. Rep. SAND2009-5574. – 2009. – Vol. 3. [5] Shahzad F. et al. CRAFT: A library for easier application-level checkpoint/restart and automatic fault tolerance //IEEE Transactions on Parallel and Distributed Systems. – 2018. – Vol. 30. – Issue 3. – P. 501-514. [6] Valiant L. G. A bridging model for parallel computation //Communications of the ACM. – 1990. – Vol. 33. – Issue 8. – P. 103-111. [7] Torvalds, L., Hamano, J.: Git: fast version control system (2010). http://git-scm.com. 127