Analysis of the Cluster Operating Time With the Migration of Virtual Machines? Vladimir Bogatyrev1,2[0000−0003−0213−0223] and Aleksey Derkach2[0000−0002−0108−319X] 1 Saint-Petersburg State University of Aerospace Instrumentation, Saint-Petersburg, Russia vladimir.bogatyrev@gmail.com http://new.guap.ru 2 ITMO University, Saint-Petersburg, Russia alexitmo1@gmail.com http://www.ifmo.ru/ru/ Abstract. A Markov model of reliability of a fault-tolerant cluster has been considered, using virtualization technologies that ensure the conti- nuity of the computational process in the event of a failure of the servers‘ physical resources and the impossibility of recovering from the interrup- tion of the computational process. The probability of maintaining the system’s operability under the condition of ensuring the continuity of the computing process for different service organization options was anilized. The mean time time to failure of such sistems was found. The purpose of the work is to increase the functional reliability of computing systems of a cluster architecture while increasing the time to failure, taking into account the requirements for ensuring the continuity of the computing process. A fault tolerance is considered as an object of study. A vir- tual machine is running on the cluster. The system involves launching a shadow copy of the VM on the backup server, which al-lows after the fail- ure of the primary server to continue its implementation on the backup server.The proposed models can be used to assess the level of system re- liability and are important in choosing a system configuration for certain conditions. Assessing the migration of virtual machines in the event of a failure of physical servers will allow you to calculate and evaluate the possible damage when using various models. Keywords: Virtualization · Cluster · Migration · Virtual machines · Mean time to failure. 1 Introduction For cluster computing systems, especially real-time, the key is to ensure reli- ability and fault tolerance while maintaining the continuity of the computing ? Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0) 2 V. Bogatyrev et al. process. The achievement of high and stable performance indicators, reliability, fault tolerance [1-3] and security [4] of computer systems is facilitated by the use of technologies for consolidation of clustering and virtualization resources [5-6], accompanied by replication and migration of virtual machines between physical servers. Migration and replication of virtual machines speeds up the reconfigu- ration process after failures of physical resources and contributes to supporting the continuity of the computing process required for managing cyber-physical systems and real-time technological processes. 2 Cluster fault tolerance technology with continuous computing One of the effective ways to achieve fault tolerance of computing systems and pro- cesses is the migration of virtual resources between the physical nodes (servers) of a computing system of a cluster architecture. In a cluster with replication of virtual machines (VMs) on different physical nodes, they can migrate be- tween cluster nodes in the event of failure of physical resources without stopping calculations on servers [7-8, 17-19]. Virtualization allows to optimize the use of computing resources, increases the scalability, fault tolerance and extensibility of the infrastructure, due to the rapid redistribution of the virtual resource [9-10]. Recovery time during VM migration after failures depends on the structure of the data storage.With shared storage for all physical nodes of the cluster, only RAM, virtual processor registers, and VM virtual device states are transferred during migration [5-6]. Information is transferred from hard disks in case of the data storage is localized for each node of the cluster. Fault tolerance ensures the continuity of the computing process (service) in the cluster after the failure of one of the physical servers with the support of two copies of the VM, which, in RAM, are located on different physical servers, so that in case of failure of one of them, continue to work on the second. During the functioning of the VM on the main servers, the backup copy must support the actual copy of the RAM [12-14] of the active VMs. In this case, the virtual disk images of the VM should be stored on a dedicated or distributed data storage with synchronous data replication. VMware Fault Tolerance, Kemari for Xen and KVM [11, 12] software products support fault tolerance technology. The purpose of the work is to increase the functional reliability of computing systems of a cluster architecture while increasing the time to failure, taking into account the requirements for ensuring the continuity of the computing process. By functional reliability, we mean the ability of systems to perform the re- quired functions, taking into account not only the operability of the resources required for their implementation, but also ensuring the necessary conditions for their implementation. Requirements to ensure the continuity of the computa- tional process in the inadmissibility of interruptions of the reservation system at the time of recovery are proposed as conditions of operation. Thus, in the systems under consideration, recovery is possible only if it is combined with the Title Suppressed Due to Excessive Length 3 implementation of the required functions by non-failed nodes. The system enters a non-recoverable state if it is impossible to reconfigure with the activation of the required number of operability resources. For such systems, the reliability indicator is the time to failure, including taking into account violations of the continuity of the computing process, provided that the permissible reconfigura- tion time, including the costs of migration of virtual machines, is exceeded. 3 Cluster organization and options for its recovery The cluster architecture computer system contains servers (Fig.1). Each server is connected directly to one local storage device (local server storage device). In the system to ensure automatic reconfiguration, aimed at supporting the continuity of computing processes based on dynamic migration, pairs of physical servers of the primary and secondary are allocated in the cluster. The main server performs the required tasks critical to the continuity of the computing process. The backup server is designed to perform dynamic reconfiguration with ensuring the continuity of the computing process in case of possible failures of the primary server. The backup server, in addition to implementing dynamic system reconfiguration, performs some background tasks that are not critical to the continuity of the computational process and to the time of query execution). If the backup server fails, the background tasks that it performs may be lost or redistributed to the main server if they are performed non-priority in the background. With the simplest implementation of a fault tolerance cluster, it is equipped with a pair of servers, one of which is designated as the main, and the second as the backup. Fault tolerance technology involves launching a backup copy of the primary server VM on the backup server and transferring the calculations to the backup server in case of primary server or storage device failure. Consider system options that provide (option A) and do not provide (option B) the restoration of physical nodes for states in which the continuity of the computing process is ensured during reconfiguration of the cluster, which allows us to select a working server and associated storage device for the implementation of the calculations. For the options under consideration, in the event of a transition to a failure state with the impossibility of implementing the required functions at least with a minimal workable configuration, it is considered that the computational process is interrupted for a time exceeding the maximum permissible value, which entails a transition to a state of unrecoverable failure. Let us consider cluster systems while ensuring fault-tolerant functioning with pairwise integration of physical servers into duplicated systems supporting the processes of virtual machine migration and data replication. For each pair of pairs in the cluster interacting to support dynamic reconfiguration of the servers (duplicated system), state and transition diagrams for a variant of organization A and B of a duplicated cluster system with recovery disciplines are shown in 4 V. Bogatyrev et al. Fig. 2 and 3 . The diagram shows the failure and recovery rates of the server λ0 and µ0 ; disk λ1 , µ1 ; commutator λ2 , µ2 . The actual data replica is loaded onto the recovered disk (synchronization of the distributed storage system) with an intensity of 3. The VM startup time on the backup server and the user application loading on it are negligibly small in comparison with the loading of the current data replica, there-fore, in this study, an instant switch between servers is assumed. The system of differential equations in accordance with the state diagram and transitions in Fig. 2 and have the form: P00 (t) = −(2λ0 + λ2 + 2λ1 )P0 (t) + µ3 P4 (t), P10 (t) = −(λ1 + λ0 + µ0 )P1 (t) + 2λ0 P0 (t) + λ0 P4 (t), P20 (t) = −(λ1 + λ0 + µ2 )P2 (t) + λ2 P0 (t) + λ2 P4 (t), P30 (t) = −(λ1 + λ0 + µ1 )P3 (t) + λ1 P4 (t) + 2λ1 P0 (t), P40 (t) = −(2λ0 + λ2 + 2λ1 + µ3 )P4 (t) + µ1 P3 (t) + µ0 P1 (t) + µ2 P2 (t), P50 (t) = −(λ1 + λ0 )(P1 (t) + P2 (t) + P3 (t) + P4 (t)). For option B: P00 (t) = −(2λ0 + λ2 + 2λ1 )P0 (t), P10 (t) = −(λ1 + λ0 )P1 (t) + 2λ0 P0 (t), P20 (t) = −(λ1 + λ0 )P2 (t) + λ2 P0 (t), P30 (t) = −(λ1 + λ0 )P3 (t) + 2λ1 P0 (t), P40 (t) = −(λ1 + λ0 )(P1 (t) + P2 (t) + P3 (t)). 4 Calculation of the probability of operability of duplicated systems The presented systems of differential equations make it possible to determine the dependence of the probabilities of all states from time. The probability of the system working while maintaining the continuity of the computing process for option A and B is defined as: 4 X P (t) = Pi (t), i=0 and for option B is defined as: 3 X P (t) = Pi (t). i=0 Title Suppressed Due to Excessive Length 5 Fig. 1. Cluster model. 6 V. Bogatyrev et al. Fig. 2. State and transition graph of a duplicated system with ensuring the continuity of the computing process for organization option A. Title Suppressed Due to Excessive Length 7 Fig. 3. State and transition graph of a duplicated system with ensuring the continuity of the computing process for organization option B. 8 V. Bogatyrev et al. Fig. 4. The probability of maintaining the system’s operability under the condition of ensuring the continuity of the computing process for ser-vice organization options A and B. The results of calculating the probability of duplicated computer systems‘ operability provided that the computing process is continuous for options A (the curve 1) and B (the curve 2) of the maintaining process‘ organization are presented in Fig. 3. The calculations were performed with the following failure rates λ0 = 1.115 · 10−5 (1/h) , λ1 = 3.425 · 10−6 (1/h), λ2 = 2.3 · 10−6 (1/h), and recovery rates µ0 = 0.33 (1/h) , µ1 = 0.17 (1/h) , µ2 = 0.33 (1/h) , µ3 = 1 (1/h). The presented dependences make it possible to evaluate the effect on the probability of maintaining the operability of a duplicated system, restrictions on the inadmissibility of interruption of the computational process, and the impact of restoration work while maintaining the possibility of continuity of the process of performing the required functions. 5 Calculation of the probability of operability of duplicated systems The mean time between failures and the probability of working without failures are related by the relation [15]: Z∞ T = P (t)dt. 0 Title Suppressed Due to Excessive Length 9 The average operating time to failure in accordance with the methodology of [15] is found as follows. The mean time to failure can be obtained by integrating the system of differential equations for a model with an absorbing state, the initial conditions P1 (0) = 1, . . . Pk (0) = 0, Pn (0) = 0 for a model with n states. For the systems under study, integrating the left and right sides of the sys- tems of equations (1), (2) for the models under consideration. Given that in the presence of an absorbing state, Pi (∞) = 0, we have [16]: −(2λ0 + λ2 + 2λ1 )T0 + µ3 T4 = −1, −(λ1 + λ0 + µ0 )T1 + 2λ0 T0 + λ0 T4 = 0, −(λ1 + λ0 + µ2 )T2 + λ2 T0 + λ2 T4 = 0, −(λ1 + λ0 + µ1 )T3 + λ1 T4 + 2λ1 T0 = 0, −(2λ0 + λ2 + 2λ1 + µ3 )T4 + µ1 T3 + µ0 T1 + µ2 T2 = 0, −(λ1 + λ0 )(T1 + T2 + T3 + T4 ) = 0. Thus, for the options for organizing system A, we have: For option B: −(2λ0 + λ2 + 2λ1 )T0 = −1, −(λ1 + λ0 )T1 + 2λ0 T0 = 0, −(λ1 + λ0 )T2 + λ2 T0 = 0, −(λ1 + λ0 )T3 + 2λ1 T0 = 0, −(λ1 + λ0 )(T1 + T2 + T3 ) = 0. Where Ti is the average time spent in working condition i when starting work from a operable state. Mean time to failure is determined by summing Ti , for all operational states [16]: X T = Ti . For the system under consideration, the time to failure with service discipline A : T1 = 3.891 ∗ 105 hours, and B T2 = 3, 277 ∗ 103 hours. 6 Conclusions The significance of the impact of ensuring the continuity of the computational process on duplicated systems of the cluster architecture is demonstrated. The result of the study was obtained on the basis of Markov models of reliability of a fault-tolerant cluster with the migration of virtual machines when it is impossible to recover after the interruption of the computational process. The time to the first failure of a duplicated system with recovery and without recovery in failure states of nodes that do not violate the continuity of the computational process to perform the required functional tasks critical to the continuity of the computing process is determined. 10 V. Bogatyrev et al. References 1. Kopetz H. Real-Time Systems: Design Principles for Distributed Embedded Appli- cations. Springer, 2011. 2. Sorin D. Fault Tolerant Computer Architecture. Morgan Claypool, Madison, 2009. 3. Dudin, A. N., Sun, B.: A multiserver MAP/PH/N system with controlled broad- casting by unreliable servers. Automatic Control and Computer Sciences V. 5, 32-44 (2009) 4. Zhmylev S., Martynchuk I. G., Kireev V. I., Aliev T.: Analytical methods of non- stationary processes modeling. CEUR Workshop Proceedings V. 2344 (2019) 5. Bogatyrev A. V., Bogatyrev V. A., and Bogatyrev S. V.: Multipath Redundant Transmission with Packet Segmentation. 2019 Wave Electronics and its Appli- cation in Information and Telecommunication Systems (WECONF) (2019) doi: 10.1109/WECONF.2019.8840643 6. Bogatyrev V. A., Bogatyrev S. V., and Bogatyrev A. V.: Model and Interaction Efficiency of Computer Nodes Based on Transfer Reservation at Mul-tipath Routing. 2019 Wave Electronics and its Application in Information and Telecommunication Systems (WECONF) (2019) doi: 10.1109/WECONF.2019.8840647 7. Jin H., Li D., Wu S., Shi X., Pan X.: Live virtual machine migration with adap- tive memory compression. Proc. IEEE International Conf. on Cluster Computing (CLUSTER ’09) Art. 5289170 (2009) doi: 10.1109/CLUSTR.2009.5289170 8. Sahni S., Varma V.: A hybrid approach to live migration of virtual machines. Proc. IEEE Int. Conf. on Cloud Computing for Emerging Markets (CCEM 2012), 12-16 (2012) doi: 10.1109/CCEM.2012.6354587 9. Poymanova E. D. Tatarnikova T. M.: Models and Methods for Studying Network Traffic // 2018 Wave Electronics and its Application in Information and Telecom- munication Systems (WECONF) (2018) doi: 10.1109/WECONF.2018.8604470 10. Kutuzov O., Tatarnikova T., : On the Acceleration of Simulation Modeling. In 2019 XXII International Conference on Soft Computing and Measurements (SCM) doi: 10.1109/SCM.2019.8903785 (2019) 11. Knowledge sharing portal UNIX/Linux-systems, open source systems, networks, and other related things, http://xgu.ru/wiki/Kemari Last accessed 15 Sep 2019 12. Elizarov E Dell Live Volume: virtualize disk space, http://blog.korphome.ru/2016/06/28/dell-live-volume Last accessed 15 Sep 2019 13. Bogatyrev V. A., Aleksankov S. M., Derkach A. N.: Model of Cluster Reliabil- ity with Migration of Virtual Machines and Restoration on Certain Level of Sys- tem Degradation //2018 Wave Electronics and its Application in Information and Telecommunication Systems (WECONF-2018) 92018) 14. Astakhova T., Shamin A., Verzun N., Kolbanev M. A. Astakhova T., Shamin A., Verzun N., Kolbanev M.: Proceedings of the 10th Majorov International Conference on Software Engineering and Computer Systems. CEUR Workshop Proceedings MICSECS 2018 (2019) 15. Victorova V. S., Stepanjanc A. C.: About reliability indicators of the average op- erating time type. Reliability 4(51), 27-36 (2014) 16. Victorova V. S., Stepanjanc A. C.: Models and methods for calculating the relia- bility of technical systems. 2nd edn. URSS LLC Lenand, Moscow (2016) 17. Bogatyrev V. A., Parshutina S. A. : Redundant Distribution of Requests Through the Network by Transferring Them Over Multiple Paths. Communications in Com- puter and Information Science, 601, 199-207 (2016) Title Suppressed Due to Excessive Length 11 18. Zakoldaev D. A., Shukalov A. V., Zharinov I. O., Zharinov O. O.: Workstations Industry 4.0 for instrument engineering products. IOP Conference Series: Materials Science and Engineering, 1 (665), pp. 012014 (2019) 19. Korobeinikov A. G., Fedosovsky M. E., Zharinov I. O., Polyakov V. I., Shukalov A. V., Gurjanov A. V., Arustamov S. A.: Method for Conceptual Presentation of Subject Tasks in Knowledge Engineering for Computer-Aided Design Systems. Advances in Intelligent Systems and Computing, V. 680, 50-56 (2018)