=Paper=
{{Paper
|id=Vol-3041/49-52-paper-8
|storemode=property
|title=Participation of Russian Institutes in the Processing and Storage of ALICE Data
|pdfUrl=https://ceur-ws.org/Vol-3041/49-52-paper-8.pdf
|volume=Vol-3041
|authors=Andrey Kiryanov,Andrey Zarochentsev
}}
==Participation of Russian Institutes in the Processing and Storage of ALICE Data==
<pdf width="1500px">https://ceur-ws.org/Vol-3041/49-52-paper-8.pdf</pdf>
<pre>
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                            Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


        PARTICIPATION OF RUSSIAN INSTITUTES IN THE
          PROCESSING AND STORAGE OF ALICE DATA
                           A.K.Kiryanov1, A.K. Zarochentsev2,a
  1
      Petersburg Nuclear Physics Institute of NRC “Kurchatov Institute”, 1 Orlova Rocha, Gatchina,
                                             188300, Russia
  2
      Saint Petersburg State University, 7-9 Universitetskaya emb., Saint Petersburg, 199034, Russia

                                    E-mail: a andrey.zar@gmail.com


This report presents the results of the work of Russian institutes in the processing of ALICE
experiment data during the last 3 years of the operation of the Large Hadron Collider (LHC) including
the end of the LHC Run2 and the 1st year of the COVID-19 pandemic. The main problems and tasks
facing both ALICE Grid Computing and its Russian segment before the LHC Run3 including the
problems of support and modernization of existing resources are considered. Also, plans for the
preparation for the operation of the LHC in the HL (High Luminosity) mode are presented.


Keywords: GRID, ALICE, CERN, LHC, WLCG


                                                                 Andrey Kiryanov, Andrey Zarochentsev


                                                             Copyright © 2021 for this paper by its authors.
                    Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


                                                    49
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                            Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


1. Challenges
         In 2018 the second stage of the LHC or Run2 was completed and the "Long Shutdown" began
- a long break and preparation for the 3rd stage - Run3 [1]. In addition to updating the experimental
installations themselves, significant improvements in the data storage and data processing system were
also required. Many central services and systems have undergone changes, and in this paper, we will
discuss how these changes affected Russian Tier 1 and Tier 2 sites for ALICE.
      By the end of 2018, RDIG (Russian Data Intensive Grid) [10] provided the following sites for
ALICE:
        Tier 1: RRC-KI-T1
          Tier 2: RRC-KI-T2, JINR, IHEP, ITEP, PNPI, SPbSU, Troitsk, SARFRTI (SARFTI site was
commissioned for ALICE production earlier in 2018, but was not yet officially registered in WLCG)
[fig. 1].


             Figure 1.ALICE RDIG sites in 2018 (from http://alimonitor.cern.ch/map.jsp)

        At the end of 2018 the coordinators of Russian sites were facing the following major
challenges, which needed to be addressed in order to continue supporting the ALICE experiment in
Russia:
    ●   Complete the commissioning of the SARFTI site for WLCG
    ●   Ensure the connection of Russian sites to the LHC ONE network [4] via IPv6 protocol
    ●   Return the MEPhI site back to production
    ●   Move away from unsupported CentOS 6 and CREAM CE [2] [3], replacing them by CentOS
        7 and one of the supported CE flavours (ARC CE [6] or HTCondor CE [5])
         The last bullet in this list required a complete reinstallation and reconfiguration of the majority
of the site’s computing resources.


2. Work done
         To solve the last task - the change of computing resource manager - the analysis of the
available options for the new managers was conducted, specifically the comparison between ARC 6
and HTCondor. After evaluating both options, ARC 6 was selected as a recommended option for
RDIG sites because of its flexibility and ability to work together with various cluster level resources
managers, including the Torque [7] which is very popular among RIDG. In 2020 the first ARC 6 test
installation was conducted on the SPbSU site. It was registered and tested for the OPS virtual
organization, and later, after successful functional testing, for the ALICE virtual organization as well.
The documentation covering in detail the installation and configuration procedures for the ALICE
Grid sites was created [8], indicating a number of common problems and proposed solutions,
including the code changes that are necessary to support new versions of Torque resource manager [9].
This patch was integrated into the ARC 6 codebase in October 2021. Another problem identified only

                                                    50
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                            Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


on production workloads was an incorrect interpretation of a virtual memory requirement from the
XRSL script that was translated into physical memory requirement for Torque script, causing
submission error due to the lack of resources. This pach was integrated into the ARC 6 codebase
earlier in 2021. Some other patches related to publishing of resource and queue information to the
LDAP information system are still being tested on RDIG sites.
         The transition of RDIG computing resources to the new operating system did not cause
substantial technical problems, but rather organizational - some sites could not start this process for a
long time. On a number of sites the OS upgrade on storage systems was combined with equipment
upgrades, in some cases due to obsolescence of the old equipment, in other cases because of a
necessary maintenance like hard disk replacement. In this situation it was necessary to plan and act
carefully to avoid any damage or loss of the precious experiment data on the storage systems. Initially
one of the disk servers containing the data files was drained with all of the files moved to the other
servers. After the drain the server was replaced with a new one with the old disks replaced and the OS
upgraded. Then the same procedure was repeated with other servers. In the case of the JINR site with
8 servers filled by 65%, the process of draining a single server took up to several days and the overall
migration process took about a month. In the case of the SRFTI site with 4 servers the process of
draining a single server took several times longer because of the problems with hardware. Given that
the drained servers already had problems with disks and a damaged RAID array, it was decided that
the time required for migration does not justify the saving of existing files. At the same time there
were still sites (e.g. PNPI) with an outdated pure XRootD storage system, the migration of which to a
new storage system will require some additional work in the future.
        In 2021 the SPbSU site, the last of the Russian sites working for ALICE, was finally
connected to the LHC ONE network with IPv6 protocol support. It required both an installation of a
new network equipment and a reconfiguration of an existing one because up to that point SPbSU
network did not support the IPv6 protocol. As a result all ALICE RDIG sites are currently connected
to the LHC ONE network with IPv6 protocol.


        Figure 2. Availability/reliability of RDIG sites, June-October 2021 (from https://argo.egi.eu/)

         As it was already mentioned, plans for the 2018 included the return of the MEPhI site back to
production, but the COVID-19 pandemic that began in 2020 has made strong changes to the work
plans of many. In a number of institutions, for a long time access to the servers was physically limited,
just like the number of employees who had the opportunity to conduct the necessary work. As a direct
consequence of these circumstances, the MEPhI site was not returned to production, and, moreover,
one of the sites that was working in 2018, RRC-KI-T2 did not manage to promptly upgrade its
working nodes to the CentOS 7 despite ALICE requirements and was excluded from the ALICE
experimental data processing. As a result, the pledged resources from Russia for 2022 decreased in
comparison to the previous years, despite the fact that some sites including JINR have managed to
increase resources for ALICE [tab.1]. Also, as a consequence of pandemic and inability to service
resources in a timely manner the availability and reliability of RDIG sites decreased as well [fig. 2].


                                                    51
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                            Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


3. Conclusions
          According to the results of the last three years, of which the last year and a half went under the
COVID-19 pandemic, it can be said that most of the RDIG sites fulfilled their obligations for the
ALICE experiment in terms of computing and storage resources and software updates, but there is
still quite a lot of room for improvement.

                                         Table 1. Pledged RDIG resources for ALICE, years 2020-2022
                       2020                          2021                          2022
            DISK(TB) CPU(HEPSPEC) DISK(TB) CPU(HEPSPEC) DISK(TB) CPU(HEPSPEC)
JINR          1200       12000      2000       13500      2300   15525
NRC KI         316       4488        316       4488         0    0
IHEP           297       2631        314       3017        314   3017
ITEP           180            2700           180            2700           180     2700
PNPI           168            2640           168            2640           168     2640
INR RAS        113            641            113            641            113     641
SPbSU          158            3696           158            3696           158     3696
SARFTI         210            7466           210            7466           210     7466
SUM           2642            36262         3459            38148         3443     35685


4. Acknowledgements
         This work is supported by the SPbSU grant ID:75252518

References
[1] https://lhc-commissioning.web.cern.ch/schedule/LHC-long-term.htm
[2] https://wlcg-ops.web.cern.ch/cream-ce
[3] https://twiki.cern.ch/twiki/bin/view/LCG/CreamMigrationTaskForce
[4] https://lhcone.web.cern.ch/
[5] https://research.cs.wisc.edu/htcondor/
[6] http://www.nordugrid.org/arc/arc6/
[7] https://support.adaptivecomputing.com/torque-resource-manager-documentation/
[8] https://twiki.cern.ch/twiki/bin/view/Main/ARCSiteInstallationForALICEAndWLCG
[9] https://github.com/nordugrid/arc/pull/9
[10] http://grid.jinr.ru/?page_id=79


                                                    52

</pre>