=Paper=
{{Paper
|id=Vol-3041/80-85-paper-14
|storemode=property
|title=COMPASS Production System: Frontera Experience
|pdfUrl=https://ceur-ws.org/Vol-3041/80-85-paper-14.pdf
|volume=Vol-3041
|authors=Artem Petrosyan
}}
==COMPASS Production System: Frontera Experience==
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 COMPASS PRODUCTION SYSTEM: FRONTERA EXPERIENCE A.Sh. Petrosyan1,2 1 Joint Institute for Nuclear Research, 6 Joliot-Curie st., 141980, Dubna, Russia 2 Plekhanov Russian University of Economics, 36 Stremyanny per., 117997, Moscow, Russia E-mail: artem.petrosyan@jinr.ru Since 2019, the COMPASS experiment has been running on the Frontera high-performance computer. This is a large machine (number 5 in the ranking of the most powerful supercomputers in 2019). Details of software setup and approaches to organizing data processing on this machine are presented in the article. Keywords: COMPASS, PanDA, distributed computing, workflow management system, grid, high performance computing Artem Petrosyan Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 80 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 1. Introduction The COMPASS experiment [1] regularly receives the ability to process data on high- performance systems: in 2016-2018, the experiment worked on Blue Waters, from 2019 to the present, Frontera at the Texas supercomputer center has been used for data processing [2]. Frontera is a big machine, number 5 in the ranking of the most powerful supercomputers in the world in 2019 [3]. Frontera has two computing subsystems: a primary computing system focused on double- precision performance and a secondary subsystem focused on single-precision streaming memory computing. Frontera also has several storage systems, interfaces to the cloud and archive systems, and a set of nodes for hosting virtual servers. The primary computing system is Dell EMC on Intel Xeon Platinum 8280 (“Cascade Lake”) processors, 56 CPUs per node, 192GB of RAM per node, linked by the Mellanox Infiniband HDR and HDR-100 interconnect. The system consists of 8,008 compute nodes. Slurm [4] is a resource management system. Frontera’s peak performance amounts to 23.5 PFLOPS. Data processing on high-performance computing systems (High Performance Computer, HPC) has several major differences compared to working on ordinary clusters, which are traditionally used in high-energy physics: these are conceptually different systems, but the relative similarity of the organization of computational nodes and system software makes it possible to run the applied software of experiments on supercomputers without changing the source code. The differences are architecture- driven and must be taken into account when organizing processing. They can be presented as the following list: • each supercomputer is a machine built in a single copy and has unique characteristics inherent only to it; • two-factor authentication is a common type of authentication on an HPC, which reduces the availability of a resource for automatic data processing systems – it is not always possible to access a computing resource from the outside, as is usually done when working with the nodes of a grid environment – it is necessary to place services, responsible for communication with the workload management system, on login or service nodes on an HPC; • a shared file system requires delicate I/O management so as not to overload the file system and interfere with the work of other users; • cluster management system: each HPC has its installation of one of the versions of a specially adapted scheduler and resource manager, for example, PBS, PBS Pro or Slurm; • specialized system software; • inability to use grid middleware such as CVMFS clients, VOMS, data management tools; • lack of access to the Internet at compute nodes; • user and project policy usually requires strict use of a computing resource for a limited period of time; • usually the time for storing data on local storage is limited and not available throughout the project. From the above, it follows that to organize data processing on such kinds of computing systems, it is necessary to carry out a deep modernization of all components of the computing infrastructure of the experiment. 2. Software setup The PanDA workload management system manages the full cycle of data reconstruction jobs on Frontera, including both processing and merging intermediate results into the final ones [fig. 1]. Unlike the infrastructure implemented for Blue Waters, Frontera uses a new task management service named Harvester [5, 6, 7]. 81 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 Figure 1. COMPASS reconstruction workflow Harvester is a component of the PanDA ecosystem responsible for providing communication between WMS and pilots working on compute nodes. It can be located both on a dedicated node in the case of working with computing resources on the grid, and on a service node of high-performance computing systems. Harvester is a multithreaded service that uses an internal database to store all the necessary information. It can manage not one submission on a high-performance system, but a set, including various sizes and execution times. On supercomputers, using Harvester, it is possible to organize the processing of a multitude of large jobs. With the advent of this service, the issue of organizing processing on HPC systems has been completely resolved: for example, the ATLAS collaboration has fully integrated all available high-performance systems from the USA and Europe.[8]. Several modules were added to the code of Harvester to control the processing of COMPASS data. Everything needed for the COMPASS job code, such as local MySQL database execution, payload management, errors handling and stage-out, was transferred from the Multi-Job Pilot and added as Harvester plug-ins. Since Frontera is one of the most powerful computing systems in the world, much attention is paid to optimizing I/O operations – when performing a large number of jobs there is a risk of overloading the metadata server of the shared file system. For this reason, all the software needed to complete the job is moved to a temporary folder on the compute node. The database server used by the tasks, which stores the calibration information, is also copied there. All data transfer operations to and from the node are monitored to avoid overloading the file system. 3. Submission tuning The software infrastructure on Frontera includes 4 Harvester services, each of which is configured to perform 5 supercomputer tasks with 50 compute nodes each. In terms of grid, this is 56,000 individual jobs [fig. 2]. To handle such load, it was necessary to deploy an additional PanDA server, reserved only for working with Frontera. It was also required to increase the capacity of the database server storing information about jobs. The size of the jobs is due to the fact that on Frontera larger jobs are given priority over smaller jobs. The size of the jobs is selected in such a way that, on the one hand, it is fast enough to get access to resources, and on the other hand, so as not to overload the IT infrastructure of the experiment that controls data processing. 82 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 Figure 2. COMPASS jobs of Frontera Since the entire chain of data reconstruction is performed on Frontera, including reconstruction jobs, as well as merging jobs of each result type: mDST, histograms, and files of selected events, Harvester was tuned for optimal performance. The fact is that reconstruction jobs are usually completed within 7-8 hours, while merging jobs are completed within an hour. To organize such work, two queues were registered in PanDA: for reconstruction jobs and for merging jobs. In Harvester, each queue has its own settings: the queue for reconstruction jobs requests 50 nodes for 10 hours, and for merging 1 node for 2 hours. Exceeding the requested time is done to prepare the input data, to have time to stage-out the output and ensure the reliable execution of jobs in the event of possible slowdowns in the work of the shared file system. 4. PanDA server tuning Usually, when working on large computing systems, priority is given to larger jobs. In the case of Frontera, jobs of 50 nodes are considered as small and can spend up to a week waiting in the queue before reaching compute nodes. Such behavior is inconsistent with the concept of high- throughput computing, in which all components of the distributed computing infrastructure, such as PanDA, were developed. In the grid, strong connectivity with advanced monitoring tools is practiced: each job reports its status every 30 minutes. This is done to minimize the wait time of computing resources and ensure a stable number of jobs performed. When waiting in a queue for a week, such high connectivity becomes not only unnecessary, it becomes a generator of excess load on all components of the computing infrastructure. The PanDA server is restarted once a day to reset the logger handlers and download information about the certificate revocation list. However, even the fastest server reboot under high load results in the loss of monitoring packets directed by the Harvester services to the PanDA server. The omission of the message about the status of the job leads to the recognition of the jobs as lost: the PanDA server sends a message to Harvester to remove such a job from the queue. In a grid environment, each job is processed individually, and the loss of several jobs does not lead to noticeable consequences. In the case of high-performance systems, deleting one job results in the deletion of the entire submission of 2,800 jobs, which is completely unacceptable. To eliminate this behavior, the PanDA server was reconfigured. The logging level was set to CRITICAL, the fetch-crl service was disabled, and server reboot was disabled. A comparative profile of I/O operations before and after the reconfiguration is shown in Figure 3. Reconfiguring the PanDA server resolved the missed packets issue. Now the PanDA server can work for several months without restarting. 83 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 Figure 3. I/O operations profile on the PanDA server 5. Summary Thus, the use of high-performance systems places increased demands on all components of the data processing infrastructure. This is due to the abrupt load profile characteristic of computing on high-performance systems: in the case of working on a grid farm, jobs are delivered to computational nodes and launched one at a time, while in the case of working with a supercomputer, several thousand jobs are grouped into one submission and, after waiting in the queue, start almost simultaneously. In addition, submissions for a larger number of nodes on such systems receive a higher priority in the queue, so it is necessary to keep as many large submissions in the queue as possible. A new implementation of the pilot application was developed to work on Frontera. A diagram of the data processing organization on the Frontera HPC is shown in Figure 4. Figure 4. Data processing on Frontera Submissions for 50 computational nodes, each of which consists of 2,800 individual jobs, were used for processing. The supercomputer was utilized to carry out the entire chain of data reconstruction tasks processing, including the aggregation of results. The management of tasks and 84 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 jobs on this resource was fully integrated into the production system of the COMPASS experiment and was performed in the same way as data processing management on grid clusters [9, 10]. 6. Acknowledgements The study was supported by the Russian Science Foundation grant (project No. 19-71-30008). References [1] P. Abbon et al., The COMPASS experiment at CERN, Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, Vol. 577, pp. 455-518, 2007 [2] Texas Advanced Computing Centre, the University of Texas at Austin, available at https://www.tacc.utexas.edu/ (accessed 11.08.2021) [3] The top 500 list, available at https://www.top500.org/lists/top500/2019/06/ (accessed 9.09.2021) [4] Slurm workload manager, available at https://slurm.schedmd.com/ (accessed 9.09.2021) [5] A. Petrosyan, COMPASS Production System: Processing on HPC, CEUR Workshop Proceedings, Vol. 2267, pp. 139-144, 2018 [6] Harvester, available at https://github.com/HSF/harvester/wiki (accessed 10.09.2021) [7] F.H. Barreiro Megino et al., PanDA for ATLAS distributed computing in the next decade, Journal of Physics Conference Series, Vol. 898, 2017 [8] T. Maeno, Harvester: an edge service harvesting heterogeneous resources for ATLAS, EPJ Web of Conferences, Vol. 214, 2019 [9] A. Petrosyan, COMPASS Production System Overview, EPJ Web Conf., Vol. 214, 2019 [10] A. Petrosyan, D. Malevanniy, Distributed data processing of the COMPASS experiment, CEUR Workshop Proceedings, Vol. 2507, 2019 85