Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                            Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


RESOURCE MANAGEMENT IN PRIVATE MULTI-SERVICE
            CLOUD ENVIRONMENTS
                          N. Balashov1,a, N. Kutovskiy1, N.Tsegelnik2
   1
       Meshcheryakov Laboratory of Information Technologies, Joint Institute for Nuclear Research,
                             6 Jolio-Curie st., Dubna, 141980, Russia
       2
           Bogoliubov Laboratory of Theoretical Physics, Joint Institute for Nuclear Research,
                              6 Jolio-Curie st., Dubna, 141980, Russia

                                          E-mail: a balashov@jinr.ru


The JINR cloud infrastructure hosts a number of cloud services to facilitate scientific workflows of
individual researchers and research groups. Although batch processing systems are still the major
compute power consumers of the cloud, new auxiliary cloud services and tools are being adopted by
researchers and are gradually changing the landscape of the cloud environment. While such services,
in general, are not so demanding in terms of computational capacity, they can have spikes of demand
and can dynamically scale to keep the service availability at a reasonable level. Moreover, these
services may need to compete for resources due to the limited capacity of the underlying
infrastructure. This paper discusses how resource distribution can be managed in such a dynamic
environment with the help of a cloud meta-scheduler.

Keywords: cloud computing, virtualization, distributed computing


                                                     Nikita Balashov, Nikolay Kutovskiy, Nikita Tsegelnik


                                                                Copyright © 2021 for this paper by its authors.
                       Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


                                                       64
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                            Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


1. Introduction
        The JINR cloud [1] is built on the OpenNebula platform, which implements the Infrastructure-
as-a-Service model, and is used to provide virtual machines on an individual basis to users (who are
mainly researchers and engineers from JINR and partner organizations), as well as to host some multi-
user systems and provide them as cloud services. Examples of such cloud services include GitLab with
its Continuous Integration tooling, the HTCondor batch cluster and the JupyterHub virtual cluster. The
services consist of a number of virtual machines playing different roles in these systems, and their
structure is shown in Figure 1.
        The cloud provides two types of resources: shared resources, which are in common use by all
JINR participants, and resources of the so-called Neutrino Platform, which are owned by several
neutrino experiments JINR participates in.


                      Figure 1. JINR cloud and example of the services structure

         The abovementioned cloud services are sometimes underutilized [2] for different reasons,
partially due to their different usage models. When services have a fixed amount of resources
provided, the underutilization of resources in one service results in the underutilization of the
underlying hardware, even though idle resources can be utilized by other services in such cases. In the
following sections, we will describe a possible approach to dealing with cloud services resource
underutilization using dynamic resource redistribution with the help of the Cloud Meta-scheduler we
are developing.


2. Background on resources underutilization
        As mentioned in the introductory part, the underutilization of cloud resources can occur for
various reasons. For example, interactive services (like JupyterHub or interactive nodes of the
HTCondor cluster) are usually underutilized at night or during holidays. Figure 2 illustrates a typical
CPU usage profile of an interactive machine. It is clearly seen that this machine was not used at all at
night and in the morning, then the load increased after lunch and dropped by the end of the working
day. Sometimes people can work at night too.


                                                    65
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                            Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


                    Figure 2. Typical CPU usage profile of an interactive machine

        On the contrary, batch clusters are usually better loaded in terms of hardware usage since
batch jobs do not need sleep and can run for hours or even days without a stop. However, resources
dedicated to individual projects or experiments can also sometimes encounter periods of inactivity (for
example, when data production stops during detector maintenance periods), and these periods can be
quite long, up to a few weeks (Fig. 3).


                 Figure 3. Grid jobs rate on the HTCondor instance of the JINR cloud

         Thus, in most cases, underutilization can be considered normal (and expected) because the
system efficiency can be defined in different ways depending on the purpose of the system in question.
For example, with batch systems we usually want to maximize hardware utilization, while in the case
of interactive systems like JupyterHub, we try to keep the system responsive and for this reason it is
fine to keep a reasonable amount of resources idle and ready to serve incoming users.
        In certain cases, hardware utilization can be easily improved by redistributing resources
between different cloud services. For instance, at night, most of the interactive nodes can be stopped in
favor of additional batch cluster worker nodes reverting everything back in the morning. The same
applies to the owned resources in batch systems: when the experiment knows that its resources will not
be used for a long period of time, these resources can be shared with other experiments using the same
technique, i.e. scaling in unused resources and scaling out systems that need them. Nevertheless,
standard cloud tools do not give us convenient control over the cloud services scaling, taking into
account the interests of all services running in the environment, as well as the interests of different
working groups that own some fraction of cloud resources. To deal with the issue, we started the
Cloud Meta-scheduler project.


3. Cloud Meta-scheduler
         The project goal is to provide resource managers and users of JINR’s cloud services with
convenient tools for managing and monitoring resource distribution between the services and resource
owners, as well as for creating and approving resource lease requests, with a scheduler component in
its core, which handles the actual scaling of the services.


                                                    66
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                            Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


                             Figure 4. System context diagram of the project

        Although the general idea of such a system seems simple, the implementation of the actual
scaling of cloud services encounters intricate details, as different services (or even their different parts)
can be scaled for different reasons. For example, three different components of the HTCondor cluster
may be scaled for the following reasons:
    •   schedulers (virtual machines that operate the job queue) – to maximize the job submission
        rate;
    •   interactive nodes – to keep them responsive;
    •   worker nodes – to improve the throughput of the cluster.
        For this reason, we started the development with the prototype of a meta-scheduler component
to study possible technical solutions, to discover some potential pitfalls and better understand the
requirements for the system under development.
         Python was chosen as the primary development language because of its rapidly growing
popularity in data science, which makes it possible to involve data science students in the development
of the project with the potential to apply data analytics for incorporating more complex scheduling
schemes [2-5]. To implement the microservices approach [6] in the prototype architecture, the Pyro
library [7] was used. It wraps Python objects and allows using them in a distributed system as regular
Python objects, while Pyro takes care of all the network communication. The main components of the
developed prototype (Fig. 5) include:
    •   Scheduler daemon – runs the scheduling loop;
    •   HTCondor API microservice – implements communication with the HTCondor cluster;
    •   Cloud API microservice – implements communication with the JINR cloud.
        The microservices approach gives the system additional flexibility that may be needed for
large-scale deployments. For instance, the HTCondor API implemented as a service can be run on the
same machine as the scheduler and can communicate HTCondor via SSH; however, it can also be run
on the HTCondor scheduler machine, directly executing shell commands to talk to HTCondor and
then communicate back the information to the scheduler over the network using the specialized Pyro
wire protocol.


                                                     67
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                            Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


                        Figure 5. Meta-scheduler prototype components scheme

        The current early version of the prototype implements only the simple automatic scaling of
HTCondor worker nodes, depending on the job queue size: when there are idle jobs in the queue, more
worker nodes are created (if there are common resources available), and once the jobs are completed,
the nodes are deleted.
       The further development of the prototype is planned in the following stages:
   •    Add multi-service support to the scheduler;
   •    Add multi-role services support;
   •    Develop a web interface (most likely based on the Django framework [8]) for users and
        resource managers.


4. Conclusion
        The IT industry and data science are rapidly evolving, new technologies are emerging and
becoming popular, and the changing IT landscape sets new challenges in computing infrastructures
management. To keep up with the evolution of computing models and environments, we need to
develop novel tools to help us efficiently handle their growing complexity. In this paper, we have
described the idea and development course of one such tool designed for the dynamic load-balancing
of multi-service cloud environments to improve computing resources utilization in cloud
environments.


                                                    68
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                            Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


References
[1] N. Balashov, A. Baranov, N. Kutovskiy, A, Makhalkin, Ye. Mazhitova, I, Pelevanyuk, R.
Semenov Present Status and Main Directions of the JINR Cloud Development //Proc. of 27th
International Symposium NEC-2019, Budva, Montenegro. 2019. Vol. 2507. P. 185–189.
[2] M. Armbrust et al. Above the clouds: a Berkeley view of cloud computing //Electrical
engineering and computer sciences, Technical Report No. UCB/EECS-2009-28, University of
California at Berkeley, February 2009.
[3] Jain N., Raghu B., Khanaa V. Probabilistic Model for Resource Demand Prediction in Cloud
//Turkish Journal of Computer and Mathematics Education (TURCOMAT). 2021. – Vol. 12 (6). P.
1766-1771.
[4] Golshani E., Ashtiani M. Proactive auto-scaling for cloud environments using temporal
convolutional neural networks //Journal of Parallel and Distributed Computing. 2021. Vol. 154. P.
119-141.
[5] Nwe K. M., Oo M. K., Htay M. M. Efficient resource management for virtual machine allocation
in cloud data centers //2018 IEEE 7th Global Conference on Consumer Electronics (GCCE). 2018. P.
419-420.
[6] Larrucea X. et al. Microservices //IEEE Software. 2018. Vol. 35 (3). P. 96-100.
[7] Pyro - Python Remote Objects. Available at: https://pyro5.readthedocs.io (accessed 20.08.2021).
[8] Django: The web framework for perfectionists                     with    deadlines.    Available    at:
https://www.djangoproject.com (accessed 20.08.2021).


                                                    69