=Paper=
{{Paper
|id=Vol-3041/275-279-paper-51
|storemode=property
|title=Quantitative and Qualitative Changes in the JINR Cloud Infrastructure
|pdfUrl=https://ceur-ws.org/Vol-3041/275-279-paper-51.pdf
|volume=Vol-3041
|authors=Nikita Balashov,Igor Kuprikov,Nikolay Kutovskiy,Alexandr Makhalkin,Yelena Mazhitova,Roman Semenov
}}
==Quantitative and Qualitative Changes in the JINR Cloud Infrastructure==
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 QUANTITATIVE AND QUALITATIVE CHANGES IN THE JINR CLOUD INFRASTRUCTURE N.A. Balashov1, I.S. Kuprikov2, N.A. Kutovskiy1,a, A.N. Makhalkin1, Ye. Mazhitova1,3, R.N. Semenov1,4 1 Meshcheryakov Laboratory of Information Technologies, Joint Institute for Nuclear Research, 6 Joliot-Curie, Dubna, Moscow region, 141980, Russia 2 Dubna State University, 19 Universitetskaya str., Dubna, Moscow region, 141980, Russia 3 Institute of Nuclear Physics, 050032, 1 Ibragimova str., Almaty, Kazakhstan 4 Plekhanov Russian University of Economics, 36 Stremyanny per., Moscow, 117997, Russia E-mail: a kut@jinr.ru The high demand for JINR cloud resources facilitated its sufficient growth. It triggered changes that needed to be made to overcome encountered problems and keep QoS for users: the main part of computational resources was reorganized as pre-deployed worker nodes of the HTCondor-based computing element to decrease the load on OpenNebula services during mass jobs submission; new SSD-based ceph pool for RBD VM disks with strong disk I/O requirements; dedicated ceph-based storage for the NOvA experiment; reorganized from scratch “Infrastructure-as-a-Code” approach based on a role and profile model implemented with the help of foreman and puppet; migration to a prometheus-based software stack for resource monitoring and accounting, and some other changes. Keywords: cloud computing, OpenNebula, clouds integration, cloud bursting, DIRAC, ceph Nikita Balashov, Igor Kuprikov, Nikolay Kutovskiy, Alexandr Makhalkin, Yelena Mazhitova, Roman Semenov Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 275 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 1. Introduction The JINR cloud is one of the components of the Multifunctional Information and Computing Complex (MICC) [1] hosted at the Meshcheryakov Laboratory of Information Technologies. The JINR cloud has been actively developed during last few years. The amount of its resources grew substantially. In this regard, some changes in the service architecture were made. A set of services hosted in the JINR cloud was expanded. 2. Changes in the virtualization part Over the past few years (since the Grid’2018 conference), the amount of JINR cloud [2] resources has sufficiently increased. There are now 176 servers for virtual machines (VMs) in total (+96 servers since Grid’2018 contributed by the JUNO and NOvA experiments – 90 and 6 servers respectively), 5,044 CPU non-hyperthreaded cores (+3400) and 60 TB of RAM (+52 TB).The amount of RAM per CPU core varies from 5.3 GB up to 16 GB. Thousands of jobs simultaneously running on the increased amount of cloud resources led to the saturation of the 10 Gbps network link, which connected the cloud to the JINR backbone (Fig. 13). It resulted in the misbehavior of services deployed in the JINR cloud. Some measures were taken, including switching to faster network equipment with a higher bandwidth. Figure 13. 10 Gbps network link load from 6 to 17 June 2021 Due to the lack of manpower to support the OpenVZ driver for the OpenNebula platform, on which the JINR cloud is based, it was decided to drop support for the driver in the JINR cloud. All users’ OpenVZ containers were migrated to KVM-based VMs. Most of the KVM VMs have disks as block devices in the ceph-based software-defined storage. 3. Changes in the storage part In addition to the general-purpose ceph-based storage with a total raw capacity of 1.1 PiB, two new storage elements (SEs) were deployed: the first is for NOvA experiment needs only, and the second is the pure SSD-based ceph storage for a set of production services and users with high demands in terms of disk I/O. The main parameters of all these cloud SEs are listed in Table 1. 276 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 Table 6. Cloud storage elements characteristics Name Disk Consumers Ceph Raw capacity, Replication Connectivity type version PiB Regular cloud HDD all 14.2.21 1.1 3x 2x10GBase-T storage NOvA storage HDD NOvA exp. 15.2.11 1.5 3x 2x10GBase-T Fast cloud SSD High disk I/O 15.2.13 0.419 3x 4x10GBase-T storage users (bonding) + 2x100Gbps – to be connected 4. Monitoring, alerting and accounting JINR cloud servers and some of its services are monitored by the Nagios software [3]. Apart from that, JINR cloud metrics are gathered via a custom collector developed several years ago at JINR. It stores collected data into the InfluxDB time series database (TSDB). Moreover, metrics from all cloud ceph servers are aggregated using ceph prometheus plugins and the Prometheus TSDB [4]. As one can see, the set of software components used for JINR cloud metrics collection is quite wide. Keeping it consistent and up to date takes some effort. To reduce it, it was decided to switch to a prometheus-centric software stack: node_exporters were deployed on all cloud servers to provide Prometheus scrapers with servers state data. The OpenNebula collector was modified to be capable to provide metrics for the Prometheus TSDB as well (except InfluxDB). Alerting is implemented at the Prometheus level. Grafana is used for data visualization. JINR cloud accounting is based on data gathered by the OpenNebula collector. Its visualization is performed using grafana dashboards (an example is given in Fig. 14). Figure 14. Grafana dashboard with JINR cloud metrics Another source of information about the JINR cloud is OpenNebula log files. They are collected with the help of filebeat [5] and sent to ElasticSearch via logstash. Data visualization is carried out in Kibana [6]. 5. Infrastructure management and hardware inventory The JINR cloud is managed following the “Infrastructure-as-a-Code” approach when host provisioning and management are done through machine-readable definition files. For this, the Foreman [7] and Puppet [8] software are used. All host definition files (called “manifests” in terms of the puppet software) implement the role and profile method. Manifests are stored in the JINR git control versioning system [9]. Sensitive information (RSA/DSA keys, passwords, etc.) is kept in HashiCorp Vault [10]. The hardware inventory of the JINR cloud is performed on the basis of the iTop software [11], which is Information Technology Service Management (ITSM) and IT Infrastructure Library (ITIL). All data on the JINR cloud hardware (server vendors, models, locations, statuses, RAID controllers, 277 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 installed network cards, spare parts, used and free IP addresses, and much more) is kept in the iTop- based service. 6. Resource utilization JINR cloud resources are used to solve a wide range of tasks: various services for the NOvA/DUNE, Baikal-GVD, JUNO, Daya Bay experiments; COMPASS production system [12] components; UNECE ICP Vegetation data management system; service for the disease detection of agricultural crops using advanced machine learning approaches; service for scientific and engineering computations; Grafana-based service for data visualization; jupyterhub head and execute nodes for it; gitlab and its runners, as well as a few others. Brief information about some of them is provided below. 6.1. Neutrino computing platform (NCP) Within the cooperation of the Dzhelepov Laboratory of Nuclear Problems (DNLP) and the Meshcheryakov Laboratory of Information Technologies (MLIT), a computing platform for neutrino experiments (NCP) was created. It consists of a set of HTCondor-based services (submit nodes, cluster manager, worker nodes, computing elements) for the NOvA, DUNE and JUNO experiments, several general-purpose interactive virtual machines, a forum for the Baikal-GVD experiment. At the time of writing this article, 2,000 CPU cores for JUNO users and 1,020 CPU cores for NOvA and DUNE users are exposed via HTCondor-CE. To optimize the utilization of NCP resources, it was proposed to share them among the most resource-consuming DLNP neutrino experiments (NOvA, DUNE, JUNO and Baikal-GVD). The Cloud Meta-Scheduler (CMSched) [13] is intended to implement such sharing by the dynamic scaling of the HTCondor cluster on demand. The CMSched prototype was deployed and is now under testing. 6.2. Service for scientific and engineering computations A service for scientific and engineering computations [14] was developed to simplify the usage of JINR MICC resources by providing scientists with an intuitive web interface to run computational jobs. The list of supported applications was extended sufficiently and now includes the following: 1) Long Josephson junction (JJ) stack simulation, 2) Superconductor-Ferromagnetic-Superconductor Josephson junction simulation, 3) Annular Array of JJs average, 4) Long Josephson junction coupled with a ferromagnetic thin film, 5) Stack of short JJs, 6) Stack of short JJs with LC shunting. 7. Conclusion JINR cloud resources are growing, as is the number of its users. Most of the hardware contribution to the JINR cloud is made by neutrino experiments. Quantitative changes entail changes in the architecture: splitting ceph storage into several instances, ceph with SSD disks for VMs sensitive to disk I/O, network upgrade. The migration from nagios-based monitoring to prometheus- based one is in progress. There is ongoing work to enhance the degree of automation of the provisioning and management of JINR cloud servers, as well as of deployed services, by adding more profiles and roles in the foreman and puppet systems. 278 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 8. Acknowledgment The work on the storage for the NOvA experiment is supported by the Russian Science Foundation grant, project #18-12-00271. The work on the development of the service for scientific and engineering computations is supported by the Russian Science Foundation grant, project #18-71-10095. References [1] A.G. Dolbilov et al., Multifunctional Information and Computing Complex of JINR: Status and Perspectives // Proc. of the 27th International Symposium NEC’2019, Budva, Montenegro, Vol. 2507, 2019, pp. 16-22 [2] Balashov N.A. et al., Present Status and Main Directions of the JINR Cloud Development // Proceedings of the 27th International Symposium Nuclear Electronics and Computing (NEC’2019), CEUR Workshop Proceedings, ISSN:1613-0073, vol. 2507 (2019), pp. 185-189 [3] Nagios monitoring software official web portal. Available at: https://www.nagios.org (accessed 03.09.2021) [4] Prometheus web portal. Available at: https://prometheus.io (accessed 03.09.2021) [5] Home page of the filebeat component of the ElasticSearch software stack. Available at: https://www.elastic.co/beats/filebeat (accessed 06.09.2021) [6] Balashov N.A. et al., Using ELK Stack for Event Log Acquisition and Analysis // Modern Information Technologies and IT Education, Vol. 17, #1, 2021, ISSN: 2411-1473, pp. 125-134. DOI 10.25559/SITITO.17.202101.731 [7] Foreman software web portal. Available at: https://theforeman.org (accessed 06.09.2021) [8] Puppet software web portal. Available at: https://puppet.com (accessed 06.09.2021) [9] JINR git portal. Available at: https://git.jinr.ru (accessed 06.09.2021) [10] HashiCorp Vault web portal. Available at: https://www.vaultproject.io (accessed 06.09.2021) [11] iTop software web portal. Available at: https://www.combodo.com/itop-193 (accessed 06.09.2021) [12] Petrosyan A., COMPASS Production System Overview // EPJ Web of Conferences, Vol. 214, 2019. DOI: 10.1051/epjconf/201921403039 [13] N. Balashov, N. Kutovskiy, N. Tsegelnik, Resource Management in Private Multi-Service Cloud Environments // to be appeared in the proceedings of the 9 th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 [14] N. Balashov et al., JINR Cloud Service for Scientific and Engineering Computations // Modern Information Technologies and IT Education, Vol. 14 (1), 2018, pp. 57-68. DOI: 10.25559/SITITO.14.201801.061-072 279