=Paper= {{Paper |id=Vol-3041/266-269-paper-49 |storemode=property |title=IHEP Tier-2 Computing Center: Status and Operation |pdfUrl=https://ceur-ws.org/Vol-3041/266-269-paper-49.pdf |volume=Vol-3041 |authors=Viktor Kotliar,Victor Gusev,Vladimir Kukhtenkov }} ==IHEP Tier-2 Computing Center: Status and Operation== https://ceur-ws.org/Vol-3041/266-269-paper-49.pdf
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021



        IHEP TIER-2 COMPUTING CENTER: STATUS AND
                       OPERATION
                          V. Kotliara, V. Gusev , V. Kukhtenkov
 Institute for High Energy Physics named by A.A. Logunov of National Research Center “Kurchatov
                 Institute”, Nauki Square 1, Protvino, Moscow region, Russia, 142281

                                    E-mail: a viktor.kotliar@ihep.ru


RU-Protvino-IHEP site is one of three biggest WLCG Tier-2 centers in Russia. The computing
infrastructure serves for "big four" high energy physics experiments such as Atlas, Alice, CMS, LHCb
and local experiments at IHEP such as OKA, BEC, radio biology stands and others. In this work the
current status of the computing capacities, networking and engineering infrastructure will be shown as
well as the contribution of the grid site to the collaboration experiments.

Keywords: WLCG, GRID, Tier-2



                                                     Viktor Kotliar, Victor Gusev, Vladimir Kukhtenkov



                                                             Copyright © 2021 for this paper by its authors.
                    Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).




                                                   266
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021




1. Introduction
         RU-Protvino-IHEP site participates in the Worldwide LHC Computing Grid [1] for almost
two decades. Started from a few Pentium III computers with several services for grid infrastructure
IHEP participated in the EGEE I – EGEE III projects and became one of the core site in the Russian
Data Intensive Grid (RDIG). After increasing network bandwidth to 2x10Gb/s and connecting to
LHCONE network grid site in Protvino became one of the biggest Tier-2 site in Russia after JINR
with more than 150 computing nodes (~3k CPU or ~26k HEP-SPEC06) and 2.5PB disks space. Now
site serves for big four LHC experiments (Atlas, Alice, CMS, LHCb) and many physical experiments
inside the Institute. Used schema for CPU usage with sharing them among Grid and not Grid
experiments allowed to achieve effective resources usage over the years in 24x7 mode.


2. Grid site overview
         IHEP cluster is a cluster with shared CPU resources among Grid and Internal experiments [2].
In the core of the cluster is the Kerberos5 “Single Sign on” system. It is used by local users to get an
access to the resources of the cluster directly without any grid services. It means that a batch system
on the cluster supports Kerberos tickets and these tickets are forwarded to all Working Nodes too. As
soon as we also use Andrew File System for user’s home directories the cluster must provide support
for AFS tokens which are based on Kerberos tickets. The next major service is a Lustre parallel cluster
file system which is shared across all working nodes. It allows to local users to perform data analysis
as faster as possible. Both for the long data store and for the store of the RAW data we use CASTOR2
[3] (the CERN Advanced STORage manager). It is a HSM system that has been using in IHEP for
fifteen years. We store data on LTO4 and LTO5 tapes with tape libraries and stand alone tape drives.
In the core of the cluster a multicomponent CMS [4] (cluster management system) is used [fig.1].




                                   Figure 1. IHEP cluster management system

        All grid services including ARC CE, BDII, APEL, VO BOX and all services of CMS works
under Xen hypervisors on the high availability clustered system as virtual machines. This technique
allows using resources as much as needed for the current setup and dynamically adjusting them if it
will be necessary in the future ensuring availability. On the site several physical servers provide
services for the site infrastructure:
                Network gateway servers which provide NAT (network address translation or
                 masquerade) for the site internal network;
                DNS (Domain Name Servers) for the site internal network and for caching DNS
                 queries on the cluster;

                                                   267
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021



                Squid http proxy servers for caching CVMFS [5] (CERN Virtual Machine File
                 System) requests and special frontier proxies for LHC experiments to cache Oracle
                 DB (data base) queries to the central repositories.
        Grid middleware is UMD4 [6] and base OS (operating system) on the cluster is CentOS 7
64bit. For GPU’s we use CUDA 7.5 . There are Ansys 19.0 and Mathematica 8.0 installed on several
nodes on the cluster and Intel® Fortran Composer XE 64bit is installed on UI (User Interface). All this
software allows to use more flexibly computing farm at IHEP as a multi-purpose computing cluster.
3. Available resources
     Grid center at IHEP has storage and computing resources that it declares for usage to the
WLCG infrastructure. At the moment there are:
                3044 CPU or 26875 HEP-SPEC06;
                2183TB disks space by experiments: ATLAS 1185TB, CMS 395TB, ALICE 314TB,
                 LHCb 289TB;
                2x10Gb/s Internet channels (LHCONE shared with Russian Data Intensive Grid);
                Manpower is three system administrators.
         All four LHC experiments have their own disk resources and CPU resources shared among all
of them on a fair share basis. The main storage technologies which are used on site are dCache and
xrootd. They allow to build distributed storage systems for grid computing. Access to outside world is
done by two 10Gb/s links where IHEP is connected to the LHCONE computing network for scientific
research. Manpower for the site support is three system administrators with an administration rota on
weekly basis. Such schema makes administrators available for research tasks, programming, learning
new techniques. The site counts as third big Grid site in Russia [fig. 2].




             Figure 2. A history of the contribution in CPU time for big Russian Grid sites


4. Site usage by virtual organizations
         As soon as the computing cluster at IHEP is shared among four Grid experiments, following
fair share schema implemented for CPU resources: ATLAS 52%, CMS 30%, ALICE 12%, LHCb 6%.
It needs to be taken into account that the CPUs also shared with the local IHEP experiments like BEC,
OKA, etc. So the real usage for the last year represents as ATLAS 40%, CMS 17%, ALICE 13%,
LHCb 30%. The main consumer for the site is the ATLAS experiment where RU-Protvino-IHEP site
has a status of Tier-2D alpha which means a Tier-2 site with availability and accessibility more than


                                                   268
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021



95%. The site contributes with 20% of all ATLAS jobs in RU-cloud and 10% of all RU-cloud data
stored at IHEP. For CMS the percent of jobs computed on the site is around 6% inside Russian
segment. For ALICE it is 7% and for LHCb 10%. For LHCb, IHEP has also a status of the site Tier-
2D (with data) and it is only Tier-2 site with such status in Russia. IHEP site shows very good stability
and has CPU usage efficiency near 94% [fig. 3] which is above average CPU efficiency for Grid sites
in WLCG.




                                 Figure 3. Running jobs and efficiency per year
        To achieve such parameters and to be able to run without downtimes some works performed:
               Power connections to IHEP data center were modified enabling auto switching on
                input power sources in case of power cuts;
               Constant system and software upgrades seamlessly made with a help of the cluster
                management system;
               Main works focused on supporting current infrastructure availability, reliability and
                efficiency.
5. Conclusion
         The WLCG Grid site at IHEP shows a high stability with a big pledge of resources to the LHC
experiments. Performed works and gained experience in the grid computing makes the site ready for
LHC run 3, which planned for next year.
         IHEP uses leading technologies in the computing science for computing clusters with batch
systems to run jobs. There are many research works and tests inside the site infrastructure ongoing.
Self-healing systems, containers and high availability systems and technologies are going to increase
stability, availability and interoperability of the commuting infrastructure at IHEP making it ready to
further increasing resources and system complexity to achieve new goals in the HEP computing.
References
[1] The Worldwide LHC Computing Grid project [WLCG]. Available at: https://wlcg.web.cern.ch
(accessed 22.09.2021)
[2] Kotliar V. IHEP cluster for Grid and distributed computing // CEUR Workshop Proceedings.
February 2017: Vol. 1787.- pp. 312-316
[3] The        CERN       Advanced        STORage      manager         [CASTOR].         Available      at:
https://castor.web.cern.ch/castor/ (accessed 22.09.2021)
[4] Ezhova, V., Kotliar, A., Kotliar, V., Popova, E. Multicomponent cluster management system for
the computing center at IHEP // CEUR Workshop Proceedings. 2018: Vol. 2267. - pp. 37-43.
[5] The CernVM File System [CVMFS]. Available at: https://cernvm.cern.ch/fs/ (accessed
22.09.2021)
[6] The Unified Middleware Distribution [UMD]. Available at: https://wiki.egi.eu/wiki/Middleware
(accessed 22.09.2021)



                                                   269