=Paper=
{{Paper
|id=Vol-3041/49-52-paper-8
|storemode=property
|title=Participation of Russian Institutes in the Processing and Storage of ALICE Data
|pdfUrl=https://ceur-ws.org/Vol-3041/49-52-paper-8.pdf
|volume=Vol-3041
|authors=Andrey Kiryanov,Andrey Zarochentsev
}}
==Participation of Russian Institutes in the Processing and Storage of ALICE Data==
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 PARTICIPATION OF RUSSIAN INSTITUTES IN THE PROCESSING AND STORAGE OF ALICE DATA A.K.Kiryanov1, A.K. Zarochentsev2,a 1 Petersburg Nuclear Physics Institute of NRC “Kurchatov Institute”, 1 Orlova Rocha, Gatchina, 188300, Russia 2 Saint Petersburg State University, 7-9 Universitetskaya emb., Saint Petersburg, 199034, Russia E-mail: a andrey.zar@gmail.com This report presents the results of the work of Russian institutes in the processing of ALICE experiment data during the last 3 years of the operation of the Large Hadron Collider (LHC) including the end of the LHC Run2 and the 1st year of the COVID-19 pandemic. The main problems and tasks facing both ALICE Grid Computing and its Russian segment before the LHC Run3 including the problems of support and modernization of existing resources are considered. Also, plans for the preparation for the operation of the LHC in the HL (High Luminosity) mode are presented. Keywords: GRID, ALICE, CERN, LHC, WLCG Andrey Kiryanov, Andrey Zarochentsev Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 49 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 1. Challenges In 2018 the second stage of the LHC or Run2 was completed and the "Long Shutdown" began - a long break and preparation for the 3rd stage - Run3 [1]. In addition to updating the experimental installations themselves, significant improvements in the data storage and data processing system were also required. Many central services and systems have undergone changes, and in this paper, we will discuss how these changes affected Russian Tier 1 and Tier 2 sites for ALICE. By the end of 2018, RDIG (Russian Data Intensive Grid) [10] provided the following sites for ALICE: Tier 1: RRC-KI-T1 Tier 2: RRC-KI-T2, JINR, IHEP, ITEP, PNPI, SPbSU, Troitsk, SARFRTI (SARFTI site was commissioned for ALICE production earlier in 2018, but was not yet officially registered in WLCG) [fig. 1]. Figure 1.ALICE RDIG sites in 2018 (from http://alimonitor.cern.ch/map.jsp) At the end of 2018 the coordinators of Russian sites were facing the following major challenges, which needed to be addressed in order to continue supporting the ALICE experiment in Russia: ● Complete the commissioning of the SARFTI site for WLCG ● Ensure the connection of Russian sites to the LHC ONE network [4] via IPv6 protocol ● Return the MEPhI site back to production ● Move away from unsupported CentOS 6 and CREAM CE [2] [3], replacing them by CentOS 7 and one of the supported CE flavours (ARC CE [6] or HTCondor CE [5]) The last bullet in this list required a complete reinstallation and reconfiguration of the majority of the site’s computing resources. 2. Work done To solve the last task - the change of computing resource manager - the analysis of the available options for the new managers was conducted, specifically the comparison between ARC 6 and HTCondor. After evaluating both options, ARC 6 was selected as a recommended option for RDIG sites because of its flexibility and ability to work together with various cluster level resources managers, including the Torque [7] which is very popular among RIDG. In 2020 the first ARC 6 test installation was conducted on the SPbSU site. It was registered and tested for the OPS virtual organization, and later, after successful functional testing, for the ALICE virtual organization as well. The documentation covering in detail the installation and configuration procedures for the ALICE Grid sites was created [8], indicating a number of common problems and proposed solutions, including the code changes that are necessary to support new versions of Torque resource manager [9]. This patch was integrated into the ARC 6 codebase in October 2021. Another problem identified only 50 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 on production workloads was an incorrect interpretation of a virtual memory requirement from the XRSL script that was translated into physical memory requirement for Torque script, causing submission error due to the lack of resources. This pach was integrated into the ARC 6 codebase earlier in 2021. Some other patches related to publishing of resource and queue information to the LDAP information system are still being tested on RDIG sites. The transition of RDIG computing resources to the new operating system did not cause substantial technical problems, but rather organizational - some sites could not start this process for a long time. On a number of sites the OS upgrade on storage systems was combined with equipment upgrades, in some cases due to obsolescence of the old equipment, in other cases because of a necessary maintenance like hard disk replacement. In this situation it was necessary to plan and act carefully to avoid any damage or loss of the precious experiment data on the storage systems. Initially one of the disk servers containing the data files was drained with all of the files moved to the other servers. After the drain the server was replaced with a new one with the old disks replaced and the OS upgraded. Then the same procedure was repeated with other servers. In the case of the JINR site with 8 servers filled by 65%, the process of draining a single server took up to several days and the overall migration process took about a month. In the case of the SRFTI site with 4 servers the process of draining a single server took several times longer because of the problems with hardware. Given that the drained servers already had problems with disks and a damaged RAID array, it was decided that the time required for migration does not justify the saving of existing files. At the same time there were still sites (e.g. PNPI) with an outdated pure XRootD storage system, the migration of which to a new storage system will require some additional work in the future. In 2021 the SPbSU site, the last of the Russian sites working for ALICE, was finally connected to the LHC ONE network with IPv6 protocol support. It required both an installation of a new network equipment and a reconfiguration of an existing one because up to that point SPbSU network did not support the IPv6 protocol. As a result all ALICE RDIG sites are currently connected to the LHC ONE network with IPv6 protocol. Figure 2. Availability/reliability of RDIG sites, June-October 2021 (from https://argo.egi.eu/) As it was already mentioned, plans for the 2018 included the return of the MEPhI site back to production, but the COVID-19 pandemic that began in 2020 has made strong changes to the work plans of many. In a number of institutions, for a long time access to the servers was physically limited, just like the number of employees who had the opportunity to conduct the necessary work. As a direct consequence of these circumstances, the MEPhI site was not returned to production, and, moreover, one of the sites that was working in 2018, RRC-KI-T2 did not manage to promptly upgrade its working nodes to the CentOS 7 despite ALICE requirements and was excluded from the ALICE experimental data processing. As a result, the pledged resources from Russia for 2022 decreased in comparison to the previous years, despite the fact that some sites including JINR have managed to increase resources for ALICE [tab.1]. Also, as a consequence of pandemic and inability to service resources in a timely manner the availability and reliability of RDIG sites decreased as well [fig. 2]. 51 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 3. Conclusions According to the results of the last three years, of which the last year and a half went under the COVID-19 pandemic, it can be said that most of the RDIG sites fulfilled their obligations for the ALICE experiment in terms of computing and storage resources and software updates, but there is still quite a lot of room for improvement. Table 1. Pledged RDIG resources for ALICE, years 2020-2022 2020 2021 2022 DISK(TB) CPU(HEPSPEC) DISK(TB) CPU(HEPSPEC) DISK(TB) CPU(HEPSPEC) JINR 1200 12000 2000 13500 2300 15525 NRC KI 316 4488 316 4488 0 0 IHEP 297 2631 314 3017 314 3017 ITEP 180 2700 180 2700 180 2700 PNPI 168 2640 168 2640 168 2640 INR RAS 113 641 113 641 113 641 SPbSU 158 3696 158 3696 158 3696 SARFTI 210 7466 210 7466 210 7466 SUM 2642 36262 3459 38148 3443 35685 4. Acknowledgements This work is supported by the SPbSU grant ID:75252518 References [1] https://lhc-commissioning.web.cern.ch/schedule/LHC-long-term.htm [2] https://wlcg-ops.web.cern.ch/cream-ce [3] https://twiki.cern.ch/twiki/bin/view/LCG/CreamMigrationTaskForce [4] https://lhcone.web.cern.ch/ [5] https://research.cs.wisc.edu/htcondor/ [6] http://www.nordugrid.org/arc/arc6/ [7] https://support.adaptivecomputing.com/torque-resource-manager-documentation/ [8] https://twiki.cern.ch/twiki/bin/view/Main/ARCSiteInstallationForALICEAndWLCG [9] https://github.com/nordugrid/arc/pull/9 [10] http://grid.jinr.ru/?page_id=79 52