Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                            Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


       CURRENT STATUS OF THE MICC: AN OVERVIEW
       A. Baginyan, A. Balandin, N. Balashov, A. Dolbilov, A. Gavrish,
      A. Golunov, N. Gromova, I. Kashunin, V. Korenkov, N. Kutovskiy,
       V. Mitsyn, I. Pelevanyuk, D. Podgainy, O. Streltsova, T. Strizha,
             V. Trofimov, A. Vorontsov, N. Voytishin, M. Zuev
     Meshcheryakov Laboratory of Information Technologies, Joint Institute for Nuclear
            Research, 6 Joliot-Curie, Dubna, Moscow region, 141980, Russia

                                         E-mail: a strizh@jinr.ru


The Multifunctional Information and Computing Complex (MICC) of the Joint Institute for Nuclear
Research (JINR) runs Tier-1, which supports the NICA experiments and the LHC CMS experiment,
Tier-2, which supports all LHC experiments, as well as the NICA experiments and other HEP
experiments with JINR’s participation, cloud computing for neutrino physics experiments (Baikal-
GVD, JUNO, DANS, etc.), as well as for the JINR Member States’ clouds, the “Govorun”
supercomputer for all JINR programs, as well as for the NICA experiments and research in the field of
machine learning and quantum computing. This activity is aimed at ensuring the further development
of the network, information and computing infrastructure of JINR for the research and production
activities of the Institute and its Member States on the basis of state-of-the-art information
technologies in accordance with the JINR Seven-Year Plan of development for 2017-2023. This paper
describes the current state of the MICC.


Keywords: MICC, NICA, WLCG, grid, Tier1, Tier2, cloud, HPC, distributed computing


              Andrey Baginyan, Anton Balandin, Nikita Balashov, Andrey Dolbilov, Andrey Gavrish,
         Alexey Golunov, Natalia Gromova, Ivan Kashunin, Vladimir Korenkov, Nikolay Kutovskiy,
        Valery Mitsyn, Igor Pelevanyuk, Dmitry Podgainy, Oxana Streltsova, Tatiana Strizh, Vladimir
                                     Trofimov, Alexey Vorontsov, Nikolay Voytishin, Maxim Zuev


                                                             Copyright © 2021 for this paper by its authors.
                    Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


                                                    1
 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                             Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


 1. Introduction
         Starting from 2017, the computing facilities of the JINR Meshcheryakov Laboratory of
 Information Technologies (MLIT) operate within the project “Multifunctional Information and
 Computing Complex” (MICC) [1]. The main aim of the project is the further development of the
 network, information and computing infrastructure of JINR for the research activities of the Institute
 and its Member States on the basis of state-of-the-art information technologies in accordance with the
 JINR Seven-Year Plan of development for 2017-2023.
          The MICC is considered as a unique basic facility of JINR and plays a decisive role in
 scientific research, which requires modern computing power and storage systems. The uniqueness of
 the MICC is ensured by the combination of all modern information technologies from the network
 infrastructure with a bandwidth of 2x100 Gbit/s to 4x100 Gbit/s, the distributed data processing and
 storage system based on grid technologies and cloud computing, the hyperconverged liquid-cooled
 high performance computing infrastructure for supercomputer applications. Multifunctionality, high
 reliability and availability in a 24x7x365 mode for computing, scalability and high performance, a
                                               reliable data storage system, information security and
                                               an advanced software environment are the main
                                               requirements that the MICC meets. The JINR
                                               computer infrastructure includes the IT ecosystem for
                                               the NICA [2] project experiments (BM@N, MPD,
                                               SPD), which, thanks to grid technologies (DIRAC
                                               Interware [3]), embraces all computing components
                                               and storage systems; the Tier1 grid site for the CMS
                                               experiment at the LHC [4]; Tier2, which provides
                                               support for the experiments at the LHC (ATLAS,
                                               ALICE, CMS), FAIR (CBM, PANDA) and other
                                               large-scale experiments, as well as support for users of
                                               the JINR Laboratories and participating countries; the
                                               integrated cloud environment of the participating
                                               countries to support users and experiments (NICA,
                                               ALICE, BESIII, NOvA, Baikal–GVD, JUNO, etc.);
Figure 1. Diagram of the MICC structure
                                               the HybriLIT [5] platform with the “Govorun”
                                               supercomputer as the main resource for HPC (fig. 1).


 2. Engineering infrastructure
         The MICC computing facilities at MLIT JINR are hosted in a single computing hall of 900 m2
 of floor-space on the 2nd floor of the MLIT building. It was built in the late 1970s for hosting
 mainframe computers. After a number of refurbishments throughout the years, it currently consists of
 eight separate IT equipment modules (fig. 2) with 2 MW power and slightly different features:
     ●   Module 1 and module 2: 22.55 m2 of floor-space each, 33 server racks and 20 kW power per
         rack;
     ●   Tier1 module: 29.33 m2 of floor-space, 16 server racks and 35 kW power per rack;
     ●   Tape library space: 13 m2 of floor-space, two installations IBM TS3500 and IBM TS4500
         with a total capacity of 51 PB;
     ●   “Govorun” supercomputer: 1.97 m2 of floor-space, 4 racks and 100 kW power per rack;
     ●   Small module that hosts critical services of a standard business computing type
         (administrative systems and databases, etc.);
     ●   Module 4: 36.12 m2 of floor-space, 20 server racks and 35 kW power per rack;
     ●   Network equipment module that hosts the main network services for the MICC, JINR local
         and wide area networking.


                                                     2
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                            Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


        The six modules mainly host the “physics processing” of different experiments and use
different technologies for computing such as grid, cloud and HPC.


                                 Figure 2. Layout of the MICC hall equipment

        All power feeds are UPS backed up with an approximate autonomy of 10-15 minutes. In
addition, there are two diesel generator backups for critical services. The majority of racks are
nowadays equipped with intelligent (switched and metered) power distribution units (PDU), which
enables a fine-grained monitoring of power consumption.
        The Tier1 module and module 4 are air-cooled with in-row racks arranged between server
racks for cold-air containments. Modules 1 and 2 are air-cooled, and the cold air is blown through
large ducts underneath the false floor, where it diffuses into cold aisles through perforated floor tiles.
The “Govorun” supercomputer is fully “hot” water-cooled, which allows for a power density of 100
kW per rack and PUE = 1.06.
        All engineering equipment that provides both the guaranteed energy supply to the MICC and
the cooling system is located on the first and basement floors of the building. Only chillers and diesel
generators are located on the territory adjacent to the MLIT building.


3. Networking
        One of the most important components of the MICC, providing access to the resources and the
possibility to work with experimental data processing and computing, is the network infrastructure. In
the frames of these works, it is necessary to ensure the reliable and fault-tolerant operation of all
network components of the infrastructure: external telecommunication channels, the JINR backbone
network with a multisite cluster network and the local MICC network.
        At the moment, external telecommunication channels are presented by the JINR-Moscow
optical link with a 3x100 Gbit/s capacity, the direct JINR-CERN 100 Gbit/s link for LHCOPN,
connecting all WLCG Tier1 centers with the CERN Tier0 and Tier1 centers with each other and the
JINR-Amsterdam 100 Gbit/s link for the LHCOPN, LHCONE, GEANT networks, direct channels up
to 100 Gbit/s for communication using RU-VRF technology with the collaboration of RUHEP


                                                    3
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                            Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


(Gatchina, NRC Kurchatov Institute, Protvino) and with the RUNNet, RASnet networks. IPv6 routing
for the Tier1 and Tier2 sites is implemented.
         The local area network (LAN) is presented by the JINR LAN 2x100 Gbit/s backbone and the
distributed MultiSite Cluster Network between the DLNP and VBLHEP sites (4x100 Gbit/s).
        The internal MICC network has the Tier1 segment built on the Brocade factory with a
throughput of 80 Gbit/s. The EOS data storage system, Tier2, cloud environment, and “Govorun”
supercomputer segments are built on the Dell and Cisco equipment. Ports up to 10 Gbit/s and 100
Gbit/s are used to connect server components on access level switches in the MICC network core,
built on Cisco Nexus 9504 and Nexus 9336C switches with an N x 100 Gbit/s port bandwidth.
        The internal network of the “Govorun” supercomputer consists of three main parts: a
communication and transport network, a control and monitoring network and a task control network.
The communication and transport network uses Intel OmniPath 100 Gbit/s technology. The network is
built on a "thick tree" topology based on 48-port Intel OmniPath Edge 100 Series switches with full
liquid cooling. The control and monitoring network enables the unification of all compute nodes and
the control node into a single Fast Ethernet network. This network is built using Fast Ethernet switches
HP 2530-48. The job control network connects all compute nodes and the control node into a single
Gigabit Ethernet network. The job control network is built using HPE Aruba 2530 48G switches.


4. Grid infrastructure
         The first level grid resource center (T1_RU_JINR) is now used to process and store data for
the CMS experiment and to perform simulations for the NICA project. At present, there are 16,096
cores with a total performance of 253,135.18 HEP-SPEC06 and an average of 15.73 HEP-SPEC06 per
core. The software and compilers used are CentOS Scientific Linux release 7.9, gcc (GCC) 4.4.7, C
++(g ++ (GCC) 4.4.7), GNU Fortran (GCC) 4.4.7, dCache-5.2 for data storage, Enstore 6.3 for tape
libraries and FTS.
        From the end of 2020, the entire JINR grid sites processing resources were migrated from
CREAM-CE and Torgue-Maui to the ARC-SE [6] Compute Element and the SLURM [7] Batch
System (adapted to kerberos and AFS). It involved the migration of more than 20,000 CPU cores, as
well as new policies for both the CE and the Batch System. To support NICA computing FairSoft,
FairRoot and MPDroot were installed.
       In terms of performance (number of processed events, jobs per site, etc.) JINR Tier1 ranks
second among the WLCG (Worldwide LHC Computing Grid) [8] Tier1 centers (FNAL, JINR,
CCIN2P, KIT, CNAF, RAL, PIC) for the CMS experiment, and about 22% of the sum CPU work was
performed.
         One of the main functions of Tier1 level centers is to provide data exchange with all global
sites that run CMS jobs. Since the beginning of the year, 8.5 PB of data from more than 180 grid sites
has been transferred to Tier1, and more than 10 PB of data has been downloaded.
        The JINR Tier2 site is the most productive in the Russian RDIG Consortium (Russian Data
Intensive Grid) [9]. Over 61% of the total CPU time in the RDIG is used for computing on our Tier2.
There are ~7,700 cores with a total performance of 121,076.99 HEP-SPEC06. The Tier2 resources are
used by all four LHC experiments, as well as by the NICA experiments, ILC, NOvA, BES, JUNO, etc.
The software and the batch system are the same as for Tier1.


5. Cloud computing
        The task of JINR cloud computing [10] is twofold. Firstly, it is necessary to provide JINR
users with cloud services for conducting research, and secondly, to ensure the functioning and
expansion of the distributed computing environment using the cloud resources of the JINR Member
States.


                                                    4
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                            Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


        At the moment, there are 200+ physical servers, 5,000+ non-hyperthreaded CPU cores, 60+
TB of RAM and the KVM hypervisor only. The software used is a custom opennebula collector for
the prometheus TSDB, prometheus alertmanager and grafana dashboard cloud.
        Ceph-based storages are used for:
    ●   general purpose for VMs (RADOS block devices, cephfs, and object storage) – 1.1 PiB of raw
        disk space with 3x replicas;
    ● dedicated for the NOvA experiment – 1.5 PiB of raw disk space with 3x replicas;
    ● SSD-based for VMs with intensive disk I/O – 419 TiB of raw disk space with 3x replicas.
       All three ceph-based storages are monitored by the prometheus plugin, prometheus
alertmanager and grafana dashboards.
        One of the most important trends in cloud technologies is the development of methods for
integrating various cloud infrastructures. In order to join the cloud resources of partner organizations
from the JINR Member States to solve common tasks, as well as to distribute a peak load across them,
the DIRAC Interware was put into operation [11]. It enables the integration of the JINR cloud with the
partner organizations via DIRAC. The geography of organizations that share part of their resources
using the distributed cloud infrastructure is illustrated in Figure 3.


                    Figure 3. JINR Member States’ cloud environment: participants

        The main users of cloud computing are the Baikal-GVD, BESIII, DayaBay, JUNO, NOvA,
DUNE experiments. At the end of 2020, Baikal-GVD started using clouds via DIRAC for Monte-
Carlo simulation.
       The cloud resources of both the Institute and the organizations of its participating countries,
which are free from the main activity of scientific computing, are successfully involved in research on
COVID-2019 within the Folding@Home platform.

6. HybriLIT platform and “Govorun” supercomputer
        The unified software and information environment of HybriLIT (fig. 4) can be used as an
educational platform and as a testing polygon, which is aimed at exploring the possibilities of novel
computing architectures, IT solutions, developing and debugging their applications, furthermore,
carrying out calculations on the supercomputer, which enables the efficient use of the supercomputer
resources.
      In the HybriLIT environment, the latest versions of over 20 software packages, in particular,
GSL, FairSoft, FairRoot, PyROOT with add-ons for BmnRoot and MpdRoot, SMASH, Valgrind,
ABINIT, Wien2k, Amber, AmberTools, DIRAC, ELPA, FLUKA, LAMMPS, FreeSurfer, FSL,


                                                    5
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                            Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


MRIConvert, GROMACS, FORM, SMILEI COMSOL, Maple, Mathematica, etc., were implemented
and are supported at the request of user groups.


          Figure 4. Unified software and information environment of the HybriLIT platform

       The main HPC resource of the MICC is the “Govorun” supercomputer. The second stage of its
upgrade was in 2019, and now the full peak performance is 1.7 PFlops for single precision and 860
TFlops for double precision. The 288 TB storage has an I/O speed of >300 GB/s. Like most
supercomputers “Govorun” has:
    ●   GPU component: 5 NVIDIA DGX-1 servers with 8 NVIDIA Tesla V100 GPUs (NVIDIA
        Volta), 40,960 cores CUDA on one NVIDIA, NVLink 2.0 wire (a bandwidth of up to
        300 Gbit/s);
     ● CPU component: 21 RSC Tornado nodes based on Intel® Xeon Phi™ (Intel® Xeon Phi™
        7290 processors (72 cores), Intel® Server Board S7200AP, Intel® SSD DC S3520 (SATA,
        M.2), 96GB DDR4 2400 GHz RAM, Intel® Omni-Path 100 Gbit/s adapter), 88 RSC Tornado
        nodes based on Intel® Xeon® Scalable gen 2 (Intel® Xeon® Platinum 8268 processors (24
        cores), Intel® Server Board S2600BP, Intel® SSD DC S4510 (SATA, M.2), 2x Intel® SSD
        DC P4511 (NVMe, M.2) 2TB, 192GB DDR4 2933 GHz RAM, Intel® Omni-Path 100 Gb/s
        adapter;
     ● 14 storage modules of the fast scalable parallel file system (Lustre, EOS, etc.) based on Intel®
        SSD DC P4511 (NVMe, M.2) 2TB with a total capacity of 288 TB;
     ● additionally “Govorun” has 4 special nodes with 12 high-speed, low-latency solid state drives
        Intel® Optane™ SSD DC P4801X 375GB M.2 Series with Intel® Memory Drive Technology
        (IMDT), which allows getting 4.2 TB for very hot data per server.
        The CPU component of “Govorun” is a hyperconverged software-defined system and ranks
16th (DAOS-10 node) in the current edition of the IO500 list (July 2021). Now the “Govorun” system
has unique properties for the flexibility of customizing the user’s job, ensuring the most efficient use
of the computing resources of the supercomputer.
        The resources of the “Govorun” supercomputer are used by scientific groups from all the
Laboratories of the Institute within 25 themes of the JINR Topical Plan for solving a wide range of
tasks in the field of theoretical physics, as well as for the modeling and processing of experimental
data.
       First of all, the resources are used to study the properties of quantum chromodynamics (QCD)
and Dirac semimetals in a tight-binding mode under extreme external conditions using lattice
modeling. The given study entails the inversion of large matrices, which is performed on video cards
(GPU), as well as massive parallel CPU calculations, to implement the quantum Monte-Carlo method.
        Other hot topics of resource usage are storing, processing and analyzing data within the NICA
megascience project [12]. To speed up the simulation and reconstruction of events for the NICA MPD
experiment, a hierarchical structure of working with data was implemented on the “Govorun”
supercomputer. Events of the MPD experiment are simulated and reconstructed on the ultrafast data
storage system under the Lustre file system management with a subsequent transfer to semi-cold
storages and to the tape library for long-term storage. About 2 million events were generated for the


                                                    6
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                            Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


MPD experiment using the hierarchical structure of working with data. The acceleration of
calculations on the upgraded supercomputer in comparison with the previous configuration was 1.45
times.
        The HybriLIT platform is widely used for investigations based on machine and deep learning
technologies. An example is the use of computer vision algorithms to accelerate experimental data
processing by automating the morphological classification of neural cells, track interpretation and
reconstruction algorithms, etc.
        Research on the development of applied quantum algorithms using a quantum simulator
started within the project “Superheavy nuclei and atoms: the limits of the masses of nuclei and the
boundaries of D.V. Mendeleev’s Periodic Table”. The pie charts below demonstrate “Govorun”
resources usage by groups of users.


 Figure 5. Distribution of the “Govorun” resources by user groups: NICA experiments, BM@N and
   MPD; BLTP – Bogoliubov Laboratory of Theoretical Physics, LRB – Laboratory of Radiation
  Biology, FLNR – Flerov Laboratory of Nuclear Reactions, LIT – Meshcheryakov Laboratory of
                                     Information Technologies
7. Data storage
         The MICC data storage operates on two systems, namely, IBM TS3500 and IBM TS4500 tape
libraries with a total capacity of 51 PB used for long-term and archival data storage, and the EOS [13]
disk storage ~ 7 PB intended for storing and accessing large amounts of information, including
distributed collective data generation, raw data storage, data conversion and analysis. EOS is common
for Tier1, Tier2, Cloud and HybriLIT platform users.
8. Monitoring and accounting
        The developed integrated monitoring system of the MICC allows receiving information from
various components of the computing complex: engineering infrastructure, network, compute nodes,
batch systems, data storage elements, grid services, which guarantees a high level of reliability of the
entire MICC. The Litmon monitoring system is modular and distributed [14]. The role of the multi-
level monitoring system for the MICC is to combine the existing systems and solve the problem of
providing high-level information about the computing complex and its services. At present, an
accounting system for the JINR Tier1 and Tier2 grid sites [15] was developed on top of Grafana
dashboards and integrated into the Litmon monitoring system.

9. Conclusion
        The development of the JINR distributed environment is aimed at creating a technological
frame that enables scientific research at JINR to be conducted in a unified information and computing
environment, incorporating a multitude of technological solutions, concepts and practices. Such an
environment has to combine supercomputer (software-defined hyperconverged server solutions), grid
technologies, cloud computing and systems to provide the best approaches for the solution of different
kinds of scientific and applied tasks. The essential requirements for this environment are scalability,
interoperability and adaptability to new technical solutions. The transition to distributed experimental
data processing and storage based on modern technologies is a necessary condition for the successful


                                                    7
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                            Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


participation of physicists of JINR and the JINR Member States’ institutes in the NICA project at
JINR, as well as in other worldwide experiments and applied studies performed in collaboration with
JINR scientists. It should be mentioned that HPC is also needed for theoretical investigations.
         During last few years, the renovation of the engineering infrastructure (power supply and
cooling systems) was in progress. We gradually modernized the local area and MICC networks, and
configured the 4x100 Gbps multisite cluster network for the IT ecosystem of the NICA project. We
are constantly enhancing the performance of the JINR Tier1 and Tier2 grid sites in accordance with
the requirements of the experiments at the LHC, as well as increasing the data storage capacity on the
HDD up to 7.35 PB and the tape robot capacity up to 51 PB. The “Govorun” supercomputer currently
has a total performance of 0.86 Pflops, and its performance is projected to expand from year to year.
        The MICC is a dynamically evolving IT platform that responds to the rapidly developing IT
world. The promising directions of modern information technologies are Artificial Intelligence and
Robotics, as well as Quantum Technologies and Big Data Analytics. The development of the scientific
IT-ecosystem will depend on novel technologies for acquiring, analyzing and sharing data. Thus, such
a system must be very flexible and open to new computing methods such as quantum, cognitive
calculations, machine learning methods and data mining, as well as to any developments of new
algorithmic bases.

References
[1] A.G. Dolbilov, I.A. Kashunin, V.V. Korenkov et al. Multifunctional Information and
Computing Complex of JINR: Status and Perspectives. CEUR Workshop Proceedings (CEUR-
WS.org), 2019. V. 2507. P. 16 ‒ 22. Available at: http://ceur-ws.org/Vol-2507/16-22-paper-3.pdf
[2]   NICA (Nuclotron-based Ion Collider fAсility): http://nica.jinr.ru/
[3]   DIRAC Interware: https://dirac.readthedocs.io/en/latest/index.html
[4] A.S. Baginyan, A.I. Balandin, A.G. Dolbilov et al. Grid at JINR. CEUR Workshop Proc. 2019.
V. 2507. P. 321 ‒ 325. Available at: http://ceur-ws.org/Vol-2507/321-325-paper-58.pdf
[5]   HybriLIT Platform: https://micc.jinr.ru/?id=30
[6]   ARC Compute Element (CE): https://www.nordugrid.org/arc/ce/
[7]   SLURM. Available at: https://slurm.schedmd.com/documentation.html (accessed 15.07.2021).
[8]   The Worldwide LHC Computing Grid (WLCG): http://wlcg.web.cern.ch/LCG
[9] A. Soldatov, V. Korenkov, V. Ilyin, Russian Segment of the LCG Global Infrastructure, Open
Systems, N1, 2003, in Russian
[10] N.A. Balashov, A.V. Baranov, N.A. Kutovskiy et al. Present status and main directions of the
JINR cloud development. CEUR Workshop Proc. 2019. V. 2507. P. 185 ‒ 189. Available at:
http://ceur-ws.org/Vol-2507/185-189-paper-32.pdf
[11] N.A. Balashov, R.I. Kuchumov, N.A. Kutovskiy et al. Cloud integration within the DIRAC
Interware. CEUR Workshop Proc. 2019. V. 2507. P. 256 ‒ 260. Available at: http://ceur-ws.org/Vol-
2507/256-260-paper-45.pdf
[12] D.V. Belyakov, A.G. Dolbilov, A.N. Moshkin et al. Using the “Govorun” supercomputer for the
NICA megaproject. CEUR Workshop Proc. 2019. V. 2507. P. 16 ‒ 22. Available at: http://ceur-
ws.org/Vol-2507/16-22-paper-3.pdf
[13] EOS Open Storage: http://eos.web.cern.ch/
[14] I. Kashunin, V. Mitsyn, V. Trofimov, A. Dolbilov. Integration of the Cluster Monitoring System
Based on Icinga2 at JINR LIT MICC. PEPAN Letters, v. 17, No 3(228), P. 345–352. Available at:
http://www1.jinr.ru/Pepan_letters/panl_2020_3/14_kashunin.pdf
[15] I.A. Kashunin, V.V. Mitsyn, T.A. Strizh. JINR WLCG Tier1 & Tier2/CICC accounting system.
Ibid.

                                                    8