Smart Cloud Scheduler N.A. Balashov1,a, A. V. Baranov1, I.S. Kadochnikov1, V.V. Korenkov1,2, N.A. Kutovskiy1,2, A.V. Nechaevskiy1, I.S. Pelevanyuk1 1 Joint Institute for Nuclear Research, 6 Joliot-Curie street, Dubna, Moscow region, 141980, Russia 2 Plekhanov Russian University of Economics,36 Stremyanny per., Moscow, 117997, Russia E-mail: a balashov@jinr.ru The rapid development of cloud technologies has led to the partial shift of scientific computations to the Cloud computing model. At present, a lot of scientific centers establish their own private virtualized datacenters. A cloud infrastructure deployed at the Laboratory of Information Technologies of the Joint Institute for Nuclear Research (JINR) and based on Infrastructure as a Service (IaaS) model has been extensively used by the JINR engineers and researches for the last few years. The authors of this paper carried out the analysis of the experi- ence gained in usage and management of the JINR cloud service and revealed some weak spots in its function- ing. We show that one of the key problems is an inefficient utilization of the cloud resources that may also be inherent to other scientific clouds. One possible way to cope with the problem is through dynamic reallocation and consolidation of virtual machines, and we also give an overview of the smart cloud scheduling method pro- posed by the authors earlier. This method allows one to release underloaded cloud resources and to put them back to the pool of free resources. This work presents a review of several possible strategies to reutilize the freed cloud resources and propose a new strategy based on integration of the cloud service with a batch system (e.g. HTCondor). This novel strategy provides a way for making the reutilization of free but allocated resources more stable by means of the dynamically deployed additional virtual batch system worker-nodes which can be safely destroyed on demand releasing reutilized cloud resources. Keywords: cloud computing, virtualization, self-organization, intelligent control, datacenters, VM consoli- dation The work was supported by the RFBR grant 15-29-07027 © 2016 Nikita N. Balashov, Alexander V. Baranov, Ivan S. Kadochnikov, Vladimir V. Korenkov, Nikolay A. Kutovskiy, Andrey V. Nechaevskiy, Igor S. Pelevanyuk 114 1. Introduction Modern cloud platforms are characterized by high elasticity and scalability and in some cases by its universality that led as a result to wide use in commercial area. In science application of cloud technologies continue gaining popularity and they are used in development of complex distributed computing systems for processing of big volumes of data [Timm et al., 2015] as well as in supporting information systems (IS) functioning and providing a universal computing resource to the end users, which gives a more simple access to computing resources needed for solving of their everyday tasks. An example of one of such scientific cloud computing environments is a cloud service of the Joint Institute of Nuclear Research (JINR) [Baranov, Balashov, Kutovskiy, Semenov, 2016]. Accumu- lated for the last several years experience in working with the cloud computing environment was ana- lyzed by the JINR experts and allowed to make a conclusion that computing resources of the JINR cloud service are used not efficiently enough. As a result a task was set to analyze the reasons of the low utilization efficiency and to develop methods and strategies to increasing the efficiency. 2. Problem description and its cause (analysis of cloud usage) Complexity of modern IS and applications, ranging from long-running scientific simulations to transactional applications, makes the problem of assessment of the sufficient resources almost unsolv- able and for this reason most of the users tend to stick to the same strategy when requesting computing resources: request as much resources as possible. Following this strategy inevitably leads to the ineffi- cient use of computing resources from the point of view of those who provides these resources due to underutilization of the provided computing capacities. JINR cloud service is a universal computing resource whose users can be conditionally split into 4 groups: x Software developers; x System administrators; x Physicists; x Automated systems. Profile of computational load of virtual machines (VM) of software developers is characterized by high enough resources utilization during a short period of time. Generally it is a result of setting up system environment, installation of additional software and libraries that are required for the further work. This period generally lasts from one to several days and directly depends on the complexity of the project to be developed. In the following period VM is characterized by low resources utilization and periodic short-term load increases, sometimes reaching the peak load. Increased load during this period arise mostly during developed software builds, compilation and testing. Frequency of these loads directly depends on the intensity of the particular project development. By the end of the main software development cycle generally starts a period of decreasing load frequency proportionally to decreasing intensity of the project development process. Such virtual machines often live during the whole life-cycle of the project and are deleted only after development is finished or when an owner of the VM leaves the development team of the particular project. VMs of the first two groups of users all have one common similarity - the out of hours VMs load drops significantly. Profiles of VMs of the third group of users, physicists, is characterized by the long enough (sometimes up to month) time intervals having peak loads alternating with close to minimal ones (CPU utilization may drop down to 0%). VMs of the fourth group of the JINR cloud users are the VMs automatically created by different IS and workload management systems (WMS) of various physical experiments. Currently these are the WMS of BES-III, ALICE and NOvA experiments. 115 Also we should mention virtual machines which host commissioned information systems, users of which potentially include all the employees of JINR - the workload in such VMs directly depends on popularity of the particular IS among its users. It is clear that such usage of cloud resources leads to highly inefficient distribution of cloud re- sources and Figures 1 and 2 demonstrate that the problem remains with the growth of the JINR cloud service. Fig. 1. JINR cloud CPU load Fig. 2. JINR cloud memory load 3. Method of dynamic cloud resources reallocation To solve the above-mentioned problem of ineffective utilization of JINR cloud resources the au- thors of the article proposed a method of dynamic virtual machines reallocation in cloud environment based on the analysis of historical records of the virtual machines resources consumption [Balashov, Baranov, Korenkov, 2016]. The method allows to release some part of the allocated but underutilized computing resources by consolidating virtual machines using overcommitment technologies. Released using this method resources are supposed to be brought back to the pool of free resources, the work- load on which would be automatically reallocated by the built-in cloud scheduler according to its built-in algorithms. This approach makes it unnecessary to disable or modify in any way the built-in cloud scheduler of the deployed cloud platform, the only possible requirement on some platforms (eg. OpenStack) that should be met is disabled built-in overcommitment management. Following this approach to the re-use of the released resources decreases reliability of the virtual machines functioning because of the deliberate allocation of a larger amount of virtual resources than the physical resources the cloud really has. To have control over reliability of virtual machines the method uses a system of ranks of real and virtual resources that is basically a management tool for controlling overcommitment rates of particular servers and virtual machines. 116 4. Re-use of computing resources As a result of application of the suggested method arises a new problem of proper and accurate re-use of the released resources. One way is to return them back to the pool of free resources as it was suggested before, but it is not the only possible strategy. As an alternative solution it is possible to put servers that don’t have any virtual machines running into low-power mode [Beloglazov, Buyya, 2015]. By doing so it is possible to significantly decrease power consumption. On more solution that can be used by scientific clouds (similar to the JINR cloud service) is by integrating the cloud with the batch processing systems eventually allowing to safely re-use freed cloud resources and to minimize the drop of reliability of the virtual machines functioning in the conditions of their highly consolidated distribution over servers. Some of the batch processing systems already have the ability to launch computing jobs to cloud infrastructures based on IaaS model, e.g. workload management system DIRAC [McNab, Stagni, Luzzi, 2015]. At the same time there are developments of complementary systems to enable suchlike functionality for some other batch systems, e.g. HTCondor, that don’t have it by default (examples of such developments include VCondor [IHEP-CC, 2016] and Vcycle [McNab, Love, MacMahon, 2015]). At present virtual computing cluster based on HTCondor system is being deployed at JINR (needed to process computing job of the NOvA neutrino experiment), that would be able to re-use spare resources of the JINR cloud service using the described above approach. HTCondor is a batch processing system in which computing jobs are distributed among separate computing nodes. A virtual machine can be a computing node in this system and so in advance con- figured in particular way virtual machine can at any moment join HTCondor pool as a computing node. By default HTCondor doesn’t have the ability to request additional nodes from a cloud service, but configured in a proper way virtual machine is able by itself contact and join the HTCondor compu- ting infrastructure, so it is sufficient for the cloud scheduler (in case there are spare resources) to launch this virtual machine and it will automatically become a computing node of the HTCondor in- frastructure. To release occupied resources the scheduler at any time can stop the required number of virtual HTCondor computing nodes. The described above scheme is the simplest scheme not taking into account the possible absence of jobs in HTCondor queues. In this case HTCondor virtual computing nodes are still get launched leading to an inexpedient consumption of computing power. At the same time implementation of the queue tracking mechanism by the cloud scheduler would lead to the loss of the scheduler universality. As a solution of this problem the authors suggest to introduce the virtual machine labeling system to the cloud scheduling algorithm for the VMs that are temporary HTCondor computing nodes, so that the cloud scheduler could stop them when it’s needed to release the resources, and to delegate han- dling of virtual nodes and tracking the batch queue to a third-party system (e.g. VCondor). 5. Conclusion Modern cloud platforms are widespread in commercial as well as in scientific areas. Given the limited amount of resources comprising cloud infrastructures the problem of its efficient usage inevi- tably arises. As a possible solution of the task to increase the efficiency of the cloud resources utiliza- tion the authors of the paper proposed a new method of smart scheduling of cloud resources utilizing the ability of dynamic resources reallocation. The method allows to free some amount of the allocated but underutilized resources that could be reused by one of the suggested ways in accordance with the chosen strategy and that may significantly increase the efficiency of cloud resources utilization. Acknowledgements The work was supported by the RFBR grant 15-29-07027. 117 References Timm, S. et al. Cloud Services for the Fermilab Scientific Stakeholders // J.Phys.Conf.Ser. — Dec. 2015. — Vol. 664, No. 2. A.V. Baranov, N.A. Balashov, N.A. Kutovskiy, R.N. Semenov JINR cloud infrastructure evolution // Physics of Particles and Nuclei Letters. — 2016. — Vol. 13, No. 5. — P. 672–675. N. Balashov, A. Baranov, V. Korenkov Optimization of over-provisioned clouds // Physics of Particles and Nuclei Letters. — 2016. — Vol. 13, No. 5. — P. 609–612. A. Beloglazov, R. Buyya OpenStack Neat: A Framework for Dynamic and Energy-Efficient Consolida- tion of Virtual Machines in OpenStack Clouds // Concurrency and Computation: Practice and Experience (CCPE). - 2015. - Vol. 27, No. 5 — P. 1310-1333. A. McNab, F. Stagni, and C. Luzzi LHCb experience with running jobs in virtual machines // J. Phys.: Conf. Ser. 664 - 2015. - Vol. 664. Computing Center of the Institute of High Energy Physics (IHEP-CC) Vcondor - virtual computing resource pool manager based on HTCondor [online] URL: https://github.com/hep-gnu/VCondor (accessed on: 08.11.2016) A. McNab, P. Love, and E. MacMahon Managing virtual machines with Vac and Vcycle 2015 // J. Phys.: Conf. Ser. 664 - 2015. - Vol. 664. 118