=Paper=
{{Paper
|id=Vol-3041/285-290-paper-53
|storemode=property
|title=JINR WLCG Tier1 & Tier2/CICC Accounting System
|pdfUrl=https://ceur-ws.org/Vol-3041/285-290-paper-53.pdf
|volume=Vol-3041
|authors=Ivan Kashunin,Valery Mitsyn,Tatiana Strizh
}}
==JINR WLCG Tier1 & Tier2/CICC Accounting System==
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 JINR WLCG TIER1 & TIER2/CICC ACCOUNTING SYSTEM I.A. Kashunin1,a, V.V. Mitsyn1, T.A. Strizh1 1 Meshcheryakov Laboratory of Information Technologies, Joint Institute for Nuclear Research, 6 Joliot-Curie, Dubna, Moscow region, 141980, Russia E-mail: a miramir@jinr.ru The problem of evaluating the efficiency of the JINR MLIT grid sites has always been topical. At the beginning of 2021, a new accounting system was created, it managed to fully cover the functionality of the previous system and further expand it. The article provides detailed information on the implemented accounting system. Keywords: Accounting, WLCG, Grid, Monitoring Ivan Kashunin, Valery Mitsyn, Tatiana Strizh Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 285 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 1. Original accounting system Since the beginning of 2001, MLIT JINR started creating a distributed system for processing, storing and analyzing experimental data from the experiments at the Large Hadron Collider (LHC) using grid technologies [1]. In 2003, the Russian segment of the LCG Global Infrastructure was organized under the Russian Data Intensive Grid Consortium (RDIG) [2], and the Tier2 grid site for data processing within the distributed computing infrastructure began functioning at JINR. The Linux operating system, the Torque batch processing system [3], the Maui task scheduler [4] and the software stack that ensures consistent work within the distributed grid infrastructure are the main software of the computing cluster. The Torque and Maui systems were finalized at MLIT and, by default, had a built-in script for collecting statistics on the batch system. The script was run with various parameters at a certain time and generated text files (Fig. 1) containing information about the operation of the system. The collected data was used to evaluate the effectiveness of jobs performed on the grid site, to account for them by users and compile various reports. Jobs were grouped according to their belonging to experiments and virtual organizations included in the RDIG, data processing from which was carried out on the site (lalice – Alice experiment at the LHC, lcms – CMS experiment at the LHC, etc.). Figure 1. Original accounting system 2. Prerequisites for the development of a new system In mid-2020, due to the obsolescence and lack of support for Torque and Maui, it was decided to switch to a new cluster management and task dispatching system for large and small Linux clusters, SLURM [5]. Unlike Torque and Maui, SLURM does not allow a number of different default parameters to be displayed: 286 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 ● CPUclock – CPU time spent on job execution for a certain period of time; ● Wallclock – total astronomical time spent by the job for a certain period of time; ● efficiency of using the cluster computing resources by the job. These parameters can be calculated from the database (DB) of the SLURM system, which is formed during the operation of the system. To do it, there is a need to write a special script and create your own accounting system, which will allow obtaining the necessary parameters upon request. 3. Development of an accounting system At the end of 2020, the SLURM system was put into operation. At the same time, the development of an accounting system started. The initial task was to develop a console version that would display data for a certain period of time in a tabular format on the screen. Similar to the original accounting system, this data can be stored in the form of reports. The main parameters that the accounting system should display for a certain period of time are: number of jobs grouped by the affiliation to an experiment/group; CPUclock of grouped jobs; Wallclock of grouped jobs; average number of cores used by the job in a group; efficiency of using the cluster computing resources by a job group. Terms such as CPUclock and Wallclock have analogs, i.e. total_cpu_time and elapsed_time in SLURM, respectively. SLURM has a special command sacct to display various system parameters on the screen. The output is in a tabular format, where the required parameters are displayed for each job. Due to this, it is possible to calculate the amount of CPUclock for a job group as the sum of total_cpu_time for each job for a certain period of time. Wallclock is calculated in the same way, only in this case the amount is calculated from elapsed_time. The efficiency of using the cluster computing resources by a job group can be calculated as the percentage of these parameters to the number of cores allocated for each job. This parameter is especially useful for detecting cases when a job reserves cores without using all the processor power. Thus, the poorly optimized code can be identified. Batch queues can also comprise jobs with a different number of cores required for computing. The script of the accounting system was developed taking into account this feature. As a result, an accounting system script was created, it allows displaying the main parameters on the screen (Fig. 2). Figure 2. Console view of the accounting system 287 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 The next task is to develop a visualization system. Currently, there are various options for choosing software for data visualization systems. Since 2014, a monitoring system [6] has been operating at the Multifunctional Information and Computing Complex (MIСС) [7], and one of its components is the Grafana visualization system [8]. It was decided to build a visualization system on its basis. That enabled easier integration into the existing monitoring system. To display statistics, the Grafana system receives data from a specific resource, i.e. a backend (database, software, socket, etc.). One of them is the MariaDB database [9]. To connect the accounting system with Grafana, special support for writing data to the accounting system script was added. The script is launched by the operating system through the standard service cron [10] and writes the parameters for the day, week, month and year to the database. 4. Vizualization system An interactive visualization system was created on the basis of Grafana, it allows providing the user with the most up-to-date analytical data: graphs and pie charts of CPUclock and Wallclock for a certain period of time; graphs for the use of the cluster computing resources by jobs; graph of the number of completed jobs; tabular version of the displayed data. As a result, an information display (Fig. 3) was created. Figure 3. Graphical representation of the accounting system To display all the information in the visualization system, panels that generate special menus were added. ● Аccount Activating the panel displays a list of names of user accounts/user groups, on behalf of which the job is launched, in the form of a drop-down menu. When a specific account is selected, all information on it is displayed in accordance with the specified parameter. The account value can be set as ‘total’ to display general information for the entire computing cluster. ● Time_period 288 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 All calculations of the accounting system are carried out taking into account only completed jobs. The longer the time interval for completed jobs accounting is, the more accurate the report data is. ● Сlock_type It displays the change in the type of parameters, such as Wallclock, CPUclock and their versions translated to HepSpec06 [11]. After changing this parameter, the top-most graph and pie charts change according to the selection. ● Data – time interval The date of both the beginning and the end of the data display can be set. Changing the date affects the display of all graphs and tables. Thus, by changing various parameters, the set of data required for displaying can be configured. 5. Conclusion In February 2021, various tests and the verification of the functionality of the accounting system, including data verification with the EGI Federation account [12], were performed. The test results showed that the difference between the calculations in the systems was 0.4%, which can be caused by different time frames and calculation algorithms. In March 2021, the accounting system was integrated into the general monitoring system of the MICC. Access to the pages of the accounting system for the WLCG (Worldwide LHC Computing Grid) sites [13], Tier1, Tier2 of MLIT JINR, is carried out via the main page of the Litmon monitoring system (Fig. 4). As a result, it became possible to back up the readings of the other accounting systems, as well as to display JINR specific tasks that are performed on the sites and related to tasks within the NICA project. The Tier1, Tier2 accounting system was put into operation. It completely covered the functionality of the original system and significantly expanded it due to the flexible configuration of the visualization system. The data collected by the original accounting system since 2018 was imported into the visualization system. Figure 4. Main page of the Litmon monitoring system 289 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 References [1] V. Korenkov, E. Tikhonenko, GRID Concept and Computer Technologies in the LHC Era, Physics of Elementary Particles and Atomic Nuclei (PEPAN), vol.32, p.6, 2001, in Russian. [2] A. Soldatov, V. Korenkov, V. Ilyin, Russian Segment of the LCG Global Infrastructure, Open Systems, N1, 2003, in Russian. [3] Torque. Available at: http://dipc.ehu.es/cc/computing_resources/jobs/batch_systems/torque/ (accessed 15.07.2021). [4] Maui. Available at: http://dipc.ehu.es/cc/computing_resources/jobs/batch_systems/torque/ (accessed 15.07.2021). [5] SLURM. Available at: https://sLURM.schedmd.com/documentation.html (accessed 15.07.2021). [6] I. Kashunin, V. Mitsyn, V. Trofimov, A. Dolbilov. Integration of the Cluster Monitoring System Based on Icinga2 at JINR LIT MICC // PEPAN Letters: http://www1.jinr.ru/Pepan_letters/panl_2020_3/14_kashunin.pdf, 2020. [7] Dolbilov А., et al. Multifunctional Information and Computing Complex of JINR: Status and Perspectives // CEUR Workshop Proc. 2019. V. 2507. P. 16 ‒ 22. [8] Grafana. Available at: https://grafana.com/ (accessed 15.07.2021). [9] MariaDb. Available at: https://mariadb.org/ (accessed 15.07.2021). [10] Crontab. Available at: https://ru.wikipedia.org/wiki/Cron (accessed 15.07.2021). [11] HepSpec06. Available at: https://www.gridpp.ac.uk/wiki/HEPSPEC06 (accessed 15.07.2021). [12] EGI. Available at: https://accounting.egi.eu/ (accessed 15.07.2021). [13] Worldwide LHC Computing Grid. Available at: https://wlcg.web.cern.ch/ (accessed 15.07.2021). 290