=Paper=
{{Paper
|id=Vol-2267/139-144-paper-25
|storemode=property
|title=COMPASS production system: processing on HPC
|pdfUrl=https://ceur-ws.org/Vol-2267/139-144-paper-25.pdf
|volume=Vol-2267
|authors=Artem Sh. Petrosyan
}}
==COMPASS production system: processing on HPC==
Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 COMPASS PRODUCTION SYSTEM: PROCESSING ON HPC A.Sh. Petrosyan Joint Institute for Nuclear Research, 6 Joliot-Curie, Dubna, Moscow region, 141980, Russia E-mail: artem.petrosyan@jinr.ru Since the fall of 2017 COMPASS processes data on heterogeneous computing environment, which includes computing resources at CERN and JINR. Computing sites of the infrastructure work under management of workload management system called PanDA (Production and Distributed Analysis System). At the end of December 2017, integration of Blue Waters HPC to run COMPASS production jobs has begun. Despite an ordinary computing site, each HPC has many specific features, which make it unique, such as: hardware, batch system type, job submission and user policies, et cetera. That is why there is no ready solution out of the box for any HPC, development and adaptation is needed in each particular case. PanDA Pilot has a version for processing on HPCs, called Multi-Job Pilot, which was prepared to run simulation jobs for ATLAS on Titan HPC. To run COMPASS production jobs, an extension of Multi-Job Pilot was performed. COMPASS Production system also was extended to allow to define, manage and monitor jobs, running on Blue Waters. Keywords: COMPASS, PanDA, workload management system, distributed data management, production system, HPC © 2018 Artem Sh. Petrosyan 139 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 1. Introduction Processing on High Performance Computers (HPC) machines has several features in comparison with processing on Grid sites, which are traditionally used for processing in High Energy Physics (HEP). These differences are driven by architecture and must be taken into account in order to organize processing smoothly and may be presented in a list of the following items: - unique hardware setup: each HPC is a unique machine; - two-factor authentication is a usual type of authentication on HPCs and reduces accessibility of the resource to automatic processing systems; - shared file system requires careful management of I/O operations in order to at least not to bother other users and system in common; - batch system: each HPC has its own installation of one of the versions of scheduler, such as PBS or PBS Pro; - unique set of software packages, managed by system administrators; - absence of Grid middleware software, such as CVMFS, VOMS clients, data management tools; - absence of Internet connection at working nodes; - user and project policy usually require strict usage of the resource during limited period of time. Also, storage is limited and available for a lifetime of the project. So, despite any Grid site, processing on HPC makes much more demands to each link of chain of processing in order to be to be safe and effective. Having an allocation on HPC does not mean that jobs will be executed faster than jobs of other users and projects. On a traditional Grid site user’s job runs in an isolated environment, while on HPC careful management of CPU and I/O signatures of running processes has a crucial role in order to keep system and processes of other users safe. In 2016 COMPASS [1] received first allocation on Blue Waters HPC [2], located in Urbana Champaign, University of Illinois. In late 2017 project of adaptation of COMPASS Production System [3-5] to work with Blue Waters has started. 2. Blue Waters overview Blue Waters is a Cray [6] hybrid machine composed of AMD 6276 "Interlagos" processors (nominal clock speed of at least 2.3 GHz) and NVIDIA GK110 (K20X) "Kepler" accelerators all connected by the Cray Gemini torus interconnect. Since COMPASS software, as long as any HEP software, is not yet able to run on Kepler nodes, only Interlagos nodes are used for data processing. There are 22 640 compute nodes with 96 compute nodes having 128 GB and the remaining have 64 GB. There are 4 228 compute nodes with 96 compute nodes having 64 GB and the remaining have 32 GB. There are 26.5PB of storage under management of Lustre available in total. There is also a 250+PB tape storage. Blue Waters is comprised of a robust and capable Local Area Network coupled with a redundant Wide Area Network to provide leadership class data transfer capabilities and resiliency. Through active monitoring and data collection this network is kept in optimal performance. At the same time, network access from worker nodes is not recommended, and results of processed jobs must be stored on a shared file system in the project or scratch directory in order to be later staged out or written to the tape for long term storage. It is recommended that Globus Online (GO) [7] is used for file transfers to and from Blue Waters. Blue Waters has dedicated import/export resources to provide superior I/O access to the filesystems. The batch environment is Torque/MOAB from Adaptive Computing which talk to the Cray's Application Level Placement Scheduler (ALPS) to obtain resource information. The aprun utility is used to start jobs on compute nodes. Its closest analogs are mpirun or mpiexec as found on many commodity clusters. Unlike clusters, the use of aprun is mandatory on Blue Waters, which is not a Linux cluster but a massively parallel system (MPP), in order to start any jobs including non-MPI ones that run on a single node. If the PBS [8] script does not use aprun to start 140 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 the application the latter will start on a service node, which is a shared resource, and that will be a violation of the usage policy. Aprun supports options that are unique to the Cray. There are flags to set process affinity by the node, control thread placement, and set memory policy between the nodes in a job. 3. Production system adaptation for processing on Blue Waters COMPASS production system automates all steps of data processing, including jobs generation, retries, consistency checks, stage-in and archive data to tape storage system. But, in case of Blue Waters raw data is being delivered manually via Globus Online endpoint by the project manager, thus, steps which cover stage-in and stage-out may be turned off. And, in order to enable processing on Blue Waters, task definition and job execution components of production system had to be changed. Full list of changes in production system is presented below: - tasks definition: site selection and raw data location on Blue Waters were added to user interface; - data management components: stage-in and stage-out are turned off for Blue Waters tasks; - jobs execution on site: PanDA Multi-Job Pilot [9-11] was used. More details to each item are presented below. 3.1. Tasks definition Since automatic data delivery to Blue Waters not yet available, a manual task assignment used in production system. Production manager selects site (Blue Waters) on which set of jobs will be executed. Interface of task definition page is presented on Figure 1. Figure 1. Task definition user interface 3.2. Data management In case of Blue Waters task data stage-in from Castor tapes to Castor disks is not needed. This step was turned off, together with stage-out step which moves data from EOS to Castor. Both these steps are done at the moment by production manager. All other steps of automation for task definition and jobs generation works in the same manner as for the regular tasks. 3.3. Jobs execution on Blue Waters For two previous steps there were few changes done in order to enable processing on Blue Waters. Much more changes had to be done in jobs execution layer. Blue Waters requires two-factor authentication to log in. It means that, in order to submit jobs to local batch system, user must be logged in, credentials can not be generated somewhere outside as it 141 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 usually done on a regular Grid sites: authentication is done basing on X509 user proxy of submitter. On HPCs access by X509 user proxy is not available either, even more: proxy can not be generated on the machine due to absence of VOMS clients packages. But, since Internet is enabled on a login nodes, job definitions may be requested from the outside. On HPC systems, if user is logged in, he can run management software which will request jobs from the remote source or simply submit jobs manually. In case of PanDA jobs execution layer the following scheme was used (Figure 2): Figure 2. Running jobs via Multi-Job Pilot on HPC - COMPASS software deployed by Blue Waters production manager in group directory, defined also in task definition interface of production system; - data is delivered by COMPASS group leader on Blue Waters to group project directory, defined also in task definition interface of production system; - Python daemon is used in order to organize pilots rotation on the service node (Edge node on the Figure 2). It allows to execute and keep desired amount of pilot processes. It also delivers X509 user proxy; - PanDA Multi-Job Pilot is used to run COMPASS payloads. Pilot works as management process, which requests and runs desired amount of jobs on HPC. The difference between standard Grid Pilot and Multi-Job Pilot is the following: Grid Pilot is being executed on a Grid node local resource manager while Multi-Job Pilot is also a management software: it prepares, sends submission to a local batch system and provides monitoring. Each Pilot performs one PBS submission. Size of the submission is defined in the configuration, on Blue Waters each Pilot runs 512 jobs on 16 nodes. If no such jobs are available on PanDA server, Pilot submits smaller jobs. 3.4. Calibration database Each COMPASS reconstruction job reads file from the calibration database. In case of CERN such database runs on a dedicated MySQL node and jobs access it directly via Internet. In case of Blue Waters two new aspects have appeared: - on the worker nodes Internet connectivity is limited and there is a recommendation from HPC administration not to use Internet connection; - amount of simultaneously running COMPASS jobs on Blue Waters may reach 200 000, which is more than 10 times more than on resources at CERN and database server simply can not handle such load. It is clear that in situation with unpredictable load jumps and low throughput, database must be deployed locally on Blue Waters. But even running one or two instances of database can not solve 142 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 problem, and a decision was made to run one database on each computing node. This approach has several pro and contra arguments: - contra: - too many database instances — if many submissions occupy 1000 of nodes, 1000 instances of database will be running; - pro: - database is running with one of 32 jobs on the node, first job which comes to the node, starts a database instance. Thus, CPU occupation of the node is 100%, no computing resources being lost; - if some job fails because database has not started, only 32 jobs will be lost; - no communication between jobs, running on different nodes, is needed. Since database runs on a local node, each job simply connects to a localhost; - PanDA Pilot takes care of the running processes and, when all jobs are finished, before leaving the computing node, it removes all zombie and any other running process at the node. So, each first job on the node, before executing COMPASS payload, starts a database instance. Such approach, in combination of submission size, which is 512 jobs on 16 nodes, allows to solve problem of calibration database overload by too high amount of clients — amount of instances is increasing with amount of running jobs and access to the database is limited by jobs, running on one single computing node. 3.5. Submission tuning On HPC resources expected time of job execution must be defined before job is submitted. Moreover, execution time must be defined to a bunch of jobs, in COMPASS case to 512 jobs. It is also very important to prepare submission of jobs with as much as possible same expected execution time. It allows to save resources and guarantee uniform load of CPUs on computing nodes. In COMPASS Production system amount of events in raw file is available and used to select and submit jobs, ordered by number of events. On HPC resource jobs, after being submitted, have to wait in the queue. Each job has a priority which is calculated basing on many parameters: load of the machine, size of submission, priority of project and user, requested execution time and requested queue, etc. In order to make jobs start faster, several techniques, described in the following paragraphs, are being used. Being an usual Grid jobs, COMPASS jobs are independent and may be run on one multicore machine without interfering each other. Since each single job is not communicating with others while running, no communication is done between nodes either. There is an option (flags=commtransparent) which may be set to identify a set of independently running jobs, it allows to start submission faster. All steps of processing are being performed on Blue Waters: reconstruction, merging of reconstruction job results, merging of histograms and merging of event dumps (filtered events are being selected and stored in event dumps for fast analysis). Each processing type jobs take different time to run. The longest ones are reconstruction jobs, they run up to 18 hours. The fastest are event dumps merging, they usually run 30 minutes. Merging of histograms takes up to 1 hour. OnHPC resource shorter submission usually starts faster then a longer one. In order to optimize submissions, a combination of logical queues are being used in PanDA and Blue Waters: via long queue in PanDA reconstruction jobs are being submitted to normal queue on Blue Waters; via short all merging jobs are being submitted to short queue on Blue Waters with different requested execution time. Using these methods allow jobs to spend less time in the queue. 4. Conclusion Software services, developed to process data of experiments on Large Hadron Collider, are turning from unique systems to the software products. Projects, such as PanDA, Rucio, AGIS, which were initially developed in the interest of one collaboration, now are used in various areas, not only in HEP experiments, helping to organize heterogeneous computing environment. One of such products, PanDA, is used to run processing of COMPASS experiment data at CERN since 2017, and now is prepared to run jobs on Blue Waters HPC. 143 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 COMPASS Production System was designed to run jobs predominantly at CERN, where COMPASS has an allocation, usually limiting amount of simultaneously running jobs to 15 000. When Condor computing element is not overloaded by other experiments, amount of running jobs may reach 20 000. At the same time, there are more than 20 000 computing nodes available on Blue Waters, each of them has 32 CPU. Target amount of jobs to be executed on Blue Waters is 150 000, which is almost 5 000 of occupied nodes. Processing during August 2018 has shown, that setup with one single PanDA server and Multi-Job Pilot on Blue Waters and standard Pilot on CERN Condor may handle 50 000 of simultaneously running jobs under management of one Production System. Such load was achieved on a setup with PanDA database, server and Auto Pilot Factory, running together at the same physical machine, deployed at JINR Cloud Service [12]. But, to reach the target reliably, computing infrastructure has to be changed: PanDA database, Server and Auto Pilot Factory, must run on a separate nodes. In case of very high load, more PanDA servers can be added. Such setup is flexible and reliable enough to handle target amount of running jobs. Migration to PanDA Harvester is also considered. References [1] Abbon P. et al. The COMPASS experiment at CERN // Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment. – 2007. – Vol. 577, Issue 3. – P. 455-518. [2] Blue Waters overview. – https://bluewaters.ncsa.illinois.edu/blue-waters-overview [3] Petrosyan A.Sh. PanDA for COMPASS at JINR // Physics of Particles and Nuclei Letters. – 2016. – Vol. 13, Issue 5. – P. 708-710. – https://link.springer.com/article/10.1134/S1547477116050393 [4] Petrosyan A.Sh., Zemlyanichkina E.V. PanDA for COMPASS: processing data via Grid // CEUR Workshop Proceedings, Vol. 1787. – P. 385-388. – http://ceur-ws.org/Vol-1787/385-388-paper-67.pdf [5] Petrosyan A.Sh. COMPASS Grid Production System // CEUR Workshop Proceedings, Vol. 2023. – P. 234-238. – http://ceur-ws.org/Vol-2023/234-238-paper-37.pdf [6] Cray. – https://www.cray.com/ [7] Globus Online. – https://www.globus.org/ [8] Adaptive Computing. – http://www.adaptivecomputing.com/ [9] Maeno T. et al. Evolution of the ATLAS PanDA workload management system for exascale computational science // Journal of Physics Conference Series. – 2014. – Vol. 513. – http://inspirehep.net/record/1302031/ [10] K. De, A. Klimentov, D. Oleynik, S. Panitkin, A. Petrosyan, J. Schovancova, A. Vaniachine, T. Wenaus on behalf of the ATLAS Collaboration. Integration of PanDA workload management system with Titan supercomputer at OLCF // Journal of Physics Conference Series – 2015. – Vol. 664. – http://iopscience.iop.org/article/10.1088/1742-6596/664/9/092020 [11] Klimentov A. et al. Next generation workload management system for big data on heterogeneous distributed computing // Journal of Physics Conference Series. – 2015. – Vol. 608. – http://inspirehep.net/record/1372988/ [12] Baranov A.V., Balashov N.A., Kutovskiy N.A., Semenov R.N. JINR cloud infrastructure evolution // Physics of Particles and Nuclei Letters. – 2016. – Vol. 13, Issue 5. – P. 672-675. 144