1. Introduction

Cluster fault tolerance model with migration of virtual machines

Andrii V. Riabko

Tetiana A. Vakaliuk

tetianavakaliuk@acnsci.org 0 2 3 5

Oksana V. Zaika

ksuwazaika@gmail.com 4

Roman P. Kukharchuk

Valerii V. Kontsedailo

valerakontsedailo@gmail.com 1 0 Academy of Cognitive and Natural Sciences , 54 Gagarin Ave., Kryvyi Rih, 50086 , Ukraine 1 Inner Circle , Nieuwendijk 40, 1012 MB Amsterdam , Netherlands 2 Institute for Digitalisation of Education of the NAES of Ukraine , 9 M. Berlynskoho Str., Kyiv, 04060 , Ukraine 3 Kryvyi Rih State Pedagogical University , 54 Gagarin Ave., Kryvyi Rih, 50086 , Ukraine 4 Oleksandr Dovzhenko Hlukhiv National Pedagogical University , 24 Kyivska Str., Hlukhiv, 41400 , Ukraine 5 Zhytomyr Polytechnic State University , 103 Chudnivsyka Str., Zhytomyr, 10005 , Ukraine

23 40

Ensuring high reliability, fault tolerance, and continuity of the computing process of computer systems is supported by combining computing resources into clusters. It is based on virtualization because of moving virtual resources, services, or applications between physical servers while supporting the continuity of computing processes. The object of study is a failover cluster, in the simplest case, consisting of two physical servers (primary and backup), which are connected through a switch. Each server has a local hard disk. A distributed storage system with synchronous data replication from the source server to the backup server is deployed on the local disks of the servers. A virtual machine is running on the cluster. The system implies launching a shadow copy of a virtual machine on a backup server so that in case of failure of the main server, the computing process can be continued on the virtual machine of the backup server. The coeficient of non-stationary readiness is taken as a reliability indicator. A Markov model of the reliability of a failover cluster is proposed, which takes into account the costs of migrating virtual machines, as well as mechanisms that ensure the continuity of the computing process (service) in the cluster in the event of a failure of one physical server. As a result of memory migration, two copies of the virtual machine are maintained, located on diferent physical servers, so that in the event of a failure of one of them, they continue to work on the other. A simplified model of a failover cluster is built, which neglects the cost of migrating virtual machines when restoring a cluster and gives an upper-reliability estimate. A significant impact on the reliability of a failover cluster (estimated by a non-stationary availability factor) of the virtual machine migration process is shown. The results obtained can be used to justify the choice of technology for ensuring failure stability and continuity of the computing process of computer systems of cluster architecture.

eol>virtual machine virtualization reliability fault tolerance redundancy clusters non-stationary availability

1. Introduction

Fault tolerance is a property of a system that allows it to continue to act correctly in the event of an error or failure in some of its parts. Modern systems for processing, storing, and transmitting data for various purposes, including cyber-physical and communication systems, are subject to high requirements for reliability, security, fault tolerance, and low cost of implementation and operation [ 1, 2 ]. The requirements for computer systems largely depend on the applications they perform, their criticality to delays and continuity of service, the features of operation, and their complexity. High reliability, fault tolerance, and readiness of computer systems for critical applications are achieved by consolidating processing and storage resources based on clustering technology, dynamic distribution of requests, and virtualization. In a clustered system with virtualization, in the event of failure or disconnection of physical servers for maintenance or other work, operability is ensured by moving virtual resources, services, or applications between physical servers while maintaining the continuity of computing processes.

Modern virtualization technologies are based on the targeted migration of virtual resources between physical servers to adapt cluster systems to the accumulation of physical server failures [ 3 ]. When migrating virtual machines (VMs), a cluster can share data storage with virtual machine virtual disks, which speeds up the migration process by migrating only the main memory, virtual processor registers, and virtual device state of the virtual machines. Nevertheless, the majority of edge devices, including UAVs (Unmanned Aerial Vehicles), tablets, and cellular phones, are mobile in nature [ 4 ]. Therefore, the configuration of the cluster must be flexible enough to adapt dynamically to the evolving network topology of the edge cluster, minimizing the overall communication delay incurred by the edge devices in processing the data received from IoT devices [5, 6]. In a cluster without a shared storage implementation, the migration also moves the contents of the virtual disks of the virtual machines, which can be significant in size, which slows down the migration process. The process of moving virtual resources can be further slowed down when moving across the network. In the process of dynamic movement, it is possible to single out the stages of transferring data (registries of virtual machines, RAM, disks) to a backup server and activating the functioning of virtual machines on it.

The virtualization technology aimed at ensuring high reliability of computer systems includes High Availability Cluster and Fault Tolerance technologies, the first of which supports automatic restart of the virtual machine on healthy cluster nodes, and the second – the continuity of the computing process when it is moved to the virtual cluster servers that have retained performance. High Availability Cluster technology allows you to automatically move a virtual machine from a failed server to a healthy server. Restoring the functionality of the virtual machine may take several minutes, depending on the configuration and loading of the physical server and the properties of user programs. With this technology, to automatically restart the virtual machine, all data must be stored on a shared data storage, which can be implemented as a device connected to all cluster nodes, or a distributed storage system. After any physical server fails, other servers can run virtual machines using virtual disks located on shared storage. In this case, the status of the virtual machine is lost, including data in RAM, registers of virtual processors, and external devices. Therefore, the system takes time to initialize the virtual machine and bring it to a pre-failure state.

For the correct operation of this virtualization mechanism, it is necessary to ensure the isolation of physical servers after a failure to exclude the simultaneous execution of the computing process by two virtual machines after a reboot to prevent data ambiguity in the shared storage.

High Availability Cluster technology assumes that after the failure of any physical server, the virtual machines running on it are automatically distributed among the surviving nodes and restarted on them. The RAM state of all virtual machines that were on the failed node is lost. Fault Tolerance technology ensures the continuity of the computing process (service) in the cluster after the failure of one physical server with the support of two copies of the virtual machine in RAM located on diferent physical servers so that in case of failure of one of them, continue working on the other. To organize the computing process during the operation of a virtual machine on one of the servers, the other must maintain an up-to-date copy of the RAM of the active virtual machine. In this case, virtual disk images of the virtual machine must be stored in dedicated or distributed storage with synchronous data replication. Software products that support fault tolerance technology include VMware Fault Tolerance, Kemari for Xen, and KVM.

These virtualization mechanisms afect the reliability of a cluster system, which must be taken into account when substantiating the structure of the system, and organizing computing processes and disciplines for restoring and maintaining highly reliable cluster systems. Justification of the choice of design solutions for building highly reliable cluster systems should be based on modeling in assessing the reliability, availability, fault tolerance, and performance of implementations.

The purpose of the authors of the article is to build models of cluster systems that allow assessing the impact of the virtualization process on their reliability. The considered models are focused on substantiating the choice of the structure and discipline of servicing and restoring a cluster, taking into account the requirements for implemented applied tasks and virtualization mechanisms.

2. Theoretical background

In the era of cloud computing [7], fault tolerance is a crucial technology that enables non-stop and long-lasting services to achieve high availability. This is typically accomplished through the use of virtualization technology [8]. Achieving high performance in cloud computing requires fault tolerance to be a critical requirement [9]. Virtualization is a widely used strategy, especially in the field of cloud computing, to enhance existing computing resources. Nevertheless, ensuring the stability and reliability of virtualization has become a significant subject [ 10]. According to Xu et al. [ 2 ], fault tolerance has a significant impact on the performance criteria of virtual machine scheduling. With the growing demand for cloud computing infrastructure, availability and reliability have become increasingly crucial due to their importance as major features in real-time computing systems [11].

Virtualization has become the foundation of cloud computing, enabling the deployment of virtual machines for data dissemination and administration. In modern applications, data is often stored using polyglot persistence, which combines SQL and NoSQL data stores. However, since these services are customized for specific storage requirements, it may be necessary to aggregate them from several heterogeneous clouds or migrate data from one cloud to another. Data migration can be performed ofline when the database is independent of the application or, alternatively, the application must be taken ofline during the migration process [12].

Cloud Computing is a groundbreaking model that provides internet-based access to physical and application resources. These resources are virtualized and ofered to users as a service through virtualization software. Nevertheless, virtual machine (VM) migration using virtualization technology can adversely afect cloud performance, making it a major concern. The uneven distribution of VMs during resource allocation and their frequent movement from one server to another can lead to increased energy consumption and network overhead [13, 14].

Cloud Computing is now extensively used for both personal and professional purposes [15]. Nevertheless, the widespread adoption and growth of cloud computing resources due to technological advancements have raised concerns about cloud service reliability and high energy consumption. In cloud computing, the primary challenges include ensuring data availability, backup replication, data eficiency, and reliability, as failures are frequently encountered during execution. Therefore, developing a fault tolerance technique is necessary to ensure reliability and availability while reducing energy consumption in the cloud. Currently, two primary fault-tolerant techniques exist – proactive and reactive fault tolerance [ 16 ]. To alleviate the resource burden on specific servers, the problem involves selecting one or more suitable virtual machines (VMs) for migration. Sivagami and Easwarakumar [ 17 ] introduce a new approach called Dynamic Fault Tolerant VM Migration that enforces reliability in cloud data center infrastructure through an advanced recovery mechanism for Virtual Network demand.

Placing virtual machines in highly reliable cloud applications is a challenging and crucial concern. To address this, the K-means clustering algorithm is utilized. Furthermore, the adaptive particle swarm optimization with the coyote optimization algorithm is employed to obtain the optimal cluster for virtual machine placement and reduce the challenge [ 18, 1 ]. Zhang et al. [ 19 ] establishes a model of initial placement for fault-tolerant virtual machines in star topological data centers of cloud systems, taking into account several factors such as the violation rate of service-level agreements, the remaining rate of resources, the rate of power consumption, the rate of failure, and the cost of fault tolerance. Fang et al. [ 20 ] developed a multi-factor real-time monitoring fault tolerance (MRMFT) model based on a GPU cluster to facilitate large-scale data processing.

Simultaneously, the continuously increasing demand for cloud resources results in service unavailability, which poses critical challenges such as cloud outages, violations of service-level agreements, and excessive power consumption [ 21 ]. Abdulhamid et al. [ 22 ] suggested a dynamic clustering league championship algorithm (DCLCA) scheduling technique that prioritizes fault tolerance awareness to tackle cloud task execution. This approach considers the currently available resources and minimizes the occurrence of untimely failure of autonomous tasks [ 22 ]. The growth of cloud usage has presented various challenges, including high energy consumption in Cloud Data Centers, security risks to Virtual Machines (VMs) due to co-residency with other risky VMs on the same Physical Machine, and Quality of Service (QoS) degradation caused by resource sharing. To address these issues, researchers have utilized Dynamic VM Consolidation to reduce energy consumption while minimizing QoS degradation. However, there are security concerns during data transmission when migrating VMs in a cloud environment. To solve this problem, Mangalagowri and Venkataraman [ 23 ] propose a Capability and Access Control (CAC) service scheme based on Software Defined Networks (SDN). In cloud data centers, virtual machine replication is useful for achieving fault tolerance, load balancing, and rapid response to user requests [ 24 ].

3. Research methods

Summarizing the considered studies, it should be noted that the theory of reliability studies the patterns of failures of technical objects (which, in particular, include information, computer systems, and networks), methods, and models of reliability analysis and ensuring their stable operation under failure conditions. Reliability is understood as the property of an object to maintain the ability to perform the necessary functions over time under given modes and conditions of use, maintenance, storage, and transportation. In other words, the reliability of an object is its ability to do what is needed in time.

Information, computing, and info-communication systems and networks have the following features. First, the need to take into account the impact of processing, storage, and transmission processes on the ability to perform the necessary functions. These processes create delays that can lead to the failure of functions in the required period and, as a result, failures in the implementation of the necessary functions.

Secondly, the need to take into account in computer systems the impact on the reliability of the operation of software, the failures of which have certain specifics in traditional technical systems. This specificity is due to the manifestations in the functioning of the system of errors of algorithms or programs that were not detected during testing or take into account some rare events that are potentially possible during the operation of information computer systems such as software and hardware systems.

Thirdly, a certain dependence on the reliability of the information system on ensuring its information security. Violation of information protection can manifest itself in deterioration of working conditions, increase in load, integrity violation, in particular, loss or distortion of information, which can lead to failure to perform the necessary system functions, erroneous performance, or an increase in the time of permissible delays. Functioning in the conditions of a security breach can, in particular, manifest itself in the initialization of some processes not provided for during normal operation, which, in addition to failure to perform the necessary functions and violation of the stationary of operating modes, can lead to an increase in load, overheating of processors and, ultimately, to an increase in the failure rate.

One of the main components of reliability is fault tolerance. Fault tolerance is the ability of a system to keep functioning in case of failures. The potential for maintaining the fault tolerance of the system depends on the types, number, combinations of failures, and location. Computing systems are characterized by the requirements to ensure the operability (reliability) not only of their structure as a set of hardware and software resources (including redundant ones) but also of the computing process, in particular, if it is necessary to maintain its continuity in the face of failures, failures and external destructive efects of random or malicious nature. A feature of an information computer system is the need to consider it not only as a general technical object with requirements for structural and parametric reliability but also as an object that implements information and computing processes with the requirement of functional reliability.

In a redundant system, there are many able-bodied states, from which one initial state can be distinguished, characterized by the operability of all elements of the system and, accordingly, the best characteristics of the quality (eficiency) of functioning. For the accumulation of failures in fault-tolerant systems, the degradation of the eficiency and potential of the system to ensure reliability usually occurs.

The operational state, in which the current values of the parameters are at such a level that the failure of one element can lead to the failure of the system, is called the pre-failure state. In the sequence of states of a redundant system, between the initial state and the state before failure, there are usually one or more intermediate states. The number of failures of the elements that bring the system from the initial state to the pre-failure state characterizes the redundancy of the system and its resistance to failure. In the general case, systems have a complex combinatorial dependence of the number of failures sustained by the system during its degradation on the relative position of the failed elements.

Fault tolerance indicators should reflect the dynamics of maintaining eficiency in the event of one, two, or more element failures. Deterministic and probabilistic fault tolerance indicators are used. Deterministic indicators of stability failure: 1) – the maximum number of element failures, under which the system’s operability is guaranteed; 2) – the maximum number of failures of elements, at which it is possible to maintain the system’s operability. The maximum number of element failures, at which the system’s operability is guaranteed (this indicator is called -reliability), corresponds to the minimum number of failures with the most unfortunate combination of element failures: = min (1) where is the number of elements that failed during the transition from the initial (fully operational) state to the pre-failure state along the -th path. Similarly = max (2) where is the number of failures of elements during the transition to the state preceding the failure along the -th path (each path can have several states preceding the failure).

A cluster is understood as a group of interconnected resources (servers, information storage devices, etc.), which is perceived by the user (query source) as a single resource. Clusters are created to achieve high availability, fault tolerance, and system performance based on the consolidation of resources. They can be created based on the same type or diferent types of resources (by parameters or functionality). In the first case, the cluster will be homogeneous, and in the second – heterogeneous. The joint work of cluster nodes is coordinated through a high-speed leased line or through a local network through which messages are exchanged. Clusters are distinguished between a server system without disk sharing and a server system with disk sharing.

An example of clusters when combining two servers is shown in figure 1.

When clustering a group of servers, there are options for organizing clusters with diferent redundancy. Options for combining servers and storage clusters are shown in figure 2.

In a fully permissive topology (figure 2, a), each storage device (disk array) is connected to only one cluster server. For the topology under consideration, while maintaining the fault tolerance of the configuration after node failure, failure is possible when executing functional queries due to loss of calculation results. The organization of the computational process without losing the functional requests that were executed at the time of failure is, in principle, possible for this topology, but it is associated with a significant slowdown of the computational process when organizing periodic saving of intermediate results via the local network in other nodes.

The N+1 topology (figure 2, b) means that each storage device (disk array) is connected to two cluster nodes, with one redundant server connected to all storage devices. It is used to organize high-availability clusters if one node can be allocated for redundancy. This topology reduces the load on active nodes and ensures that a load of a failed node can be restored to the standby node without loss of quality. It maintains fault tolerance of any of the primary nodes while connecting a single redundant node. In the cluster pair topology (figure 2, c), nodes are grouped in pairs, storage devices are attached to both nodes of the pair, and each node has access to all storage devices (disk arrays) of the pair. Thus, fault tolerance is maintained within cluster pairs. In a full-access topology, servers and storage devices are connected through switches (figure 2, d), the system can be expanded by adding additional servers and storage devices to the cluster without changing existing connections. This topology provides fault tolerance for all cluster resources, which is achieved by redistributing the execution of tasks of failed nodes between healthy nodes.

The probability of failure-free operation of the structures depicted in figure 2, with the same number n of servers and storage devices, provided that at least one server and its associated storage device must work in the system, is respectively found as ⎧⎪1() = 1 − (1 − 1()2()), ⎪ ⎪ ⎪⎨2() = 1()(1 − 2()) + (1 − 1())(1 − (1 − 1()2())− 1), ⎪3() = (1 − (1 − 1())2(1 − 2())2)/2, ⎪ ⎪ ⎪⎩4() = 1 − (1 − 1())(1 − 2())(1 − 3()), (3) where 1(), 2(), and 3() are the probabilities of failure-free operation of servers, devices, and switches, and is the number of switches.

Permanent operation of the infrastructure is possible only if there is an exact copy of the existing server running similar processes and services. That is, if you create a replica after a hardware failure, it will take time, which means it will lead to downtime and interruptions in the provision of services.

Fault tolerance is implemented in hardware and software. Hardware development is a “bifurcation” of the host: in other words, all the components of the system are simply duplicated, and the calculations occur at once. Synchronization is ensured by the presence of a special node. The software method is used more often but has several limitations. For example, its deployment will require the presence of a processor, communication between individual virtual machines, etc.

The programmatic way to deploy a cluster is considered in our study.

4. Results and discussions

Consider a highly reliable cluster implemented with virtualization technology focused on maintaining the continuity of the service (computing process). A failover cluster in the simplest case consists of two physical servers (primary and backup) with high-speed network interfaces (figure 3). Each server has one local hard disk drive (HDD) connected via SATA or SAS interface. Both servers have a hypervisor, clustering software, and virtualization management installed on the HDD. A distributed storage system with synchronous data replication from the source server to the backup server is deployed on the local disks of the servers. The cluster is running a virtual machine in failover mode.

The system assumes the launch of a shadow copy of the virtual machine on the backup server, which allows, after the failure of the main server, to continue the computing process on the virtual machine of the backup server without interruption. Support for the continuity of the computing process during automatic recovery after a failure (reconfiguration) requires constant synchronization of RAM and disk data, for which it is possible to use high-speed network adapters and second-level switches, for example, 10G Ethernet or InfiniBand; organizations on servers of a distributed storage system that supports synchronous replication of disk data from the primary to a backup server or a separate server for organizing an external storage system.

Let us consider the restoration of system resources that are lost as a result of failures, which is carried out immediately after a failure (provides for instantaneous detection of the occurrence of a failure using control, devices, and personnel ready to carry out repair work).

For fault-tolerant cluster systems, we take the non-stationary availability factor and the non-stationary availability function () as the reliability indicator. Non-stationary availability factor – the possibility that the system at a certain point in time is ready to perform the necessary functions (is working). It characterizes the readiness of the object to perform the necessary function at an arbitrary time , which is close enough to the moment of a fixed change in the state of the system (before the operation, after prevention, testing, reconfiguration, or recovery). A non-stationary availability factor is applied when the stationary mode, in which the probability of states depends on time, has not yet been established. In general, these indicators depend on the failure and recovery rates of the system elements, the time of its continuous operation, and the type and frequency of redundancy.

() ∼=

+ +

+ · exp[− · ( + )] where () = 0().

= lim () =

→∞ where is the number of elements of the non-redundant system, , are the corresponding failure and restoration rates of the element of the -th type and = 1, 2, ..., ; = ∑︀ – system failure rate. System update rate is =1 (4) (5) (6)

+ ∑︀

=1 = =1 ∑︀ ∑︀

The above dependencies indicate that the higher the coeficient and the readiness function, the lower the ratio .

Dynamic models are used to calculate the fault tolerance characteristics of complex systems. If the behavior of the system can be described by a Markov action, the mathematical model of the reliability of such a system is a system of diferential equations. When studying the functioning of recoverable systems under the Poisson law of distribution of failure and restoration flows (the intensity of the failure flow () and the restoration intensity () are constants), the mathematical model of such a system is a system of ordinary diferential equations. The system of ordinary diferential equations can be solved analytically or numerically.

Consider the case when the failure rate () is a function of time. Figure 4 shows the Markov graph of the restored element, the mathematical model of which is a system of nonlinear diferential equations.

If you build Markov models of system fault tolerance, consisting of several renewable elements, then the state space of the model will increase. The system of diferential equations with respect to () ( = 1, 2, ..., ) will have the general form (7) (8) ⎧1′() = − 1() ∑︀ 1() + ∑︀ () 1(), ⎪ ⎪ ⎪⎪⎪... ⎪ ⎨

′() = − () ∑︀ () + ∑︀ () (), ⎪⎪⎪... ⎪ ⎪ ⎪⎩′() = − () ∑︀ () + ∑︀ () (), where the first sum on the right side of the equation contains the intensity of transitions from the current state , and the second sum is the intensity of transitions to state ; transitions corresponding to failures have time-dependent coeficients; transitions corresponding to the restoration of working capacity are constants.

In the general case, it is dificult to obtain an analytical solution to a system of nonlinear diferential equations; therefore, it is advisable to use numerous methods for solving. For example, Mathcad has a built-in function , which is considered basic and implements the fourth-order Runge-Kutta method with a fixed step. This function is designed to solve systems of first-order diferential equations 1′ = 1(, 1, 2, ..., ), 2′ = 2(, 1, 2, ..., ),

...

′ = (, 1, 2, ..., ),

The function (, 1, 2, , ) returns a matrix of 1 + rows, in which the first column contains the solution, and the other columns contain the solution and its first − 1 derivatives.

Function arguments are: is the vector of initial values ( elements); 1 and 2 are the limits of the interval on which we are looking for a solution; – the number of points inside the interval (1, 2) in which we are looking for a solution. They are chosen from the condition of obtaining the desired accuracy of numerical integration; is a vector of elements – the first derivatives of the desired function.

As an example, consider the solution of a system of diferential equations for finding the non-stationary availability factor of a duplicated system. On it, column 0 corresponds to time ( = , 0), and the subsequent columns are the probabilities of states depending on time (figure 5). Also shown is a plot of the non-stationary availability factor (availability function) versus time.

Let us build a Markov model of the reliability of a failover cluster with online recovery, taking into account the implementation of mechanisms for moving a virtual machine. The state and transition diagram of a failover cluster with online recovery when implementing virtual machine movement is shown in figure 6. In the figure, the healthy states of the cluster (healthy states without failed nodes) are indicated by vertices circled with a solid line; repairman – thick solid line. The “VM” mark at the top of the graphs indicates the server on which the virtual machine with the virtual service is currently running. The top crossed out with two lines means the failure of the node, with one line – the state of the node in which it is currently not functioning and, accordingly, does not fail.

The diagram shows the failure rates ( 0, 1, 2) and updates ( 0, 1, 2) of the server, disk, and switch, respectively. The intensity of updating (synchronization of the distributed storage system), including the introduction of an up-to-date replica of data on the restored disk – 3. The intensity of restoring a virtual machine after an automatic restart, which includes starting a virtual machine on a standby server and loading a user program on it – 4.

To find the state probabilities from the given state and transition diagrams, systems of algebraic equations are compiled when estimating the stationary availability factor or diferential equations when estimating the non-stationary availability factor. We write the system of diferential equations according to the state and transition diagram (figure 6) as follows: ⎪6′() = − 06() + 01() + 10(), ⎪ ⎪ ⎪⎪7′() = − 17() + 12(), ⎪ ⎪ ⎪ ⎪⎪8′() = − 08() + 02(), ⎪ ⎪ ⎪ ⎪⎪9′() = − 09() + 13() + 112(), ⎪ ⎪ ⎪ ⎪⎪1′0() = − 010() + 03() + 012(), ⎪ ⎪ ⎪ ⎪⎪1′1() = − 411() + 04(), ⎪ ⎪ ⎪ ⎪⎩1′2() = − 412() + 14(), (9)

As a result, the simplified Markov model of cluster reliability, without taking into account the impact of reducing the availability of the cluster, and the cost of migrating virtual machines, respectively, leads to an upper estimate of the system reliability, presented in figure 7.

The system of diferential equations corresponding to the state and transition diagram, shown in figure 7, has the form: (10)

The results of calculating the coeficients of non-stationary availability of the cluster for the models corresponding to the diagrams in figures 2 and 3 are shown in figure 8.

In figure 8, curves 1 and 2 correspond to the evaluation of the function of non-stationary availability factors 1() and 2() based on the diagrams in figure 6 and figure 7. Curve 3 in figure 4 corresponds to the diference = 2()˘ 1() (the value axis is on the right). The calculation was performed under the following failure rates of the server, disk, and switch: 0 = 1.115 × 10− 5 1/h, 1 = 3.425 × 10− 6 1/h, 2 = 2.3 × 10− 6 1/h recovery respectively: 0 = 0.33 1/h, 1 = 0.171/h, 2 = 0.33 1/h.

The intensity of synchronization of the distributed storage system: 3 = 1 1/h, 4 = 2 1/h. The calculations were performed in the Mathcad computer mathematics system. Graphs allow us to conclude the significant impact of considering the migration of virtual machines on reliability.

Thus, a Markov model of the reliability of a failover cluster is proposed, which takes into account the costs of migrating virtual machines. A simplified model of a failover cluster has been built, which neglects the costs of restoring the migration of virtual machines. A significant impact on the reliability of a failover cluster (estimated by a non-stationary availability factor) is shown by taking into account virtualization mechanisms, in particular, the migration of virtual machines.

5. Conclusions

As a result of theoretical analysis, it has been established that a cluster is understood as a group of interconnected resources, perceived by the user as a single resource. Clusters are created to achieve high availability, fault tolerance, and system performance based on the consolidation of resources.

The article is considered a programmatic method of deploying a fault-tolerant computing cluster consisting of two physical servers (main and backup) on which a local hard disk is installed. The servers are connected via a switch. A distributed storage system with synchronous data replication from the source server to the standby server is deployed on the server disks, and a virtual machine is running on the cluster. A model of a failover cluster has been built, which neglects the costs of restoring the migration of virtual machines. The calculations were performed in the Mathcad computer mathematics system. The calculations allow us to conclude that accounting for the migration of virtual machines has a significant impact on reliability. objects, in: A. E. Kiv, M. P. Shyshkina (Eds.), Proceedings of the 2nd International Workshop on Augmented Reality in Education, Kryvyi Rih, Ukraine, March 22, 2019, volume 2547 of CEUR Workshop Proceedings, CEUR-WS.org, 2019, pp. 217–240. URL: https://ceur-ws.org/ Vol-2547/paper16.pdf. [5] K. Rajashekar, S. Karmakar, S. Paul, S. Sidhanta, Topology-Aware Cluster Configuration for Real-Time Multi-Access Edge Computing, in: Proceedings of the 24th International Conference on Distributed Computing and Networking, ICDCN ’23, Association for Computing Machinery, New York, NY, USA, 2023, p. 286–287. doi:10.1145/3571306.3571417. [6] N. M. Lobanchykova, I. A. Pilkevych, O. Korchenko, Analysis and protection of IoT systems: Edge computing and decentralized decision-making, Journal of Edge Computing 1 (2022) 55–67. doi:10.55056/jec.573. [7] M. Popel, S. V. Shokalyuk, M. Shyshkina, The Learning Technique of the SageMathCloud Use for Students Collaboration Support, in: V. Ermolayev, N. Bassiliades, H. Fill, V. Yakovyna, H. C. Mayr, V. S. Kharchenko, V. S. Peschanenko, M. Shyshkina, M. S. Nikitchenko, A. Spivakovsky (Eds.), Proceedings of the 13th International Conference on ICT in Education, Research and Industrial Applications. Integration, Harmonization and Knowledge Transfer, ICTERI 2017, Kyiv, Ukraine, May 15-18, 2017, volume 1844 of CEUR Workshop Proceedings, CEUR-WS.org, 2017, pp. 327–339. URL: https://ceur-ws.org/Vol-1844/10000327.pdf. [8] C.-Y. Yu, C.-R. Lee, P.-J. Tsao, Y.-S. Lin, T.-C. Chiueh, Eficient Group Fault Tolerance for Multi-tier Services in Cloud Environments, in: ICC 2020 - 2020 IEEE International Conference on Communications (ICC), 2020, pp. 1–7. doi:10.1109/ICC40277.2020. 9149253. [9] P. Kumari, P. Kaur, A survey of fault tolerance in cloud computing, Journal of King Saud University - Computer and Information Sciences 33 (2021) 1159–1176. doi:10.1016/j. jksuci.2018.09.021. [10] C.-T. Yang, W.-L. Chou, C.-H. Hsu, A. Cuzzocrea, On improvement of cloud virtual machine availability with virtualization fault tolerance mechanism, The Journal of Supercomputing 69 (2014) 1103–1122. doi:10.1007/s11227-013-1045-1. [11] S. M. Attallah, M. B. Fayek, S. M. Nassar, E. E. Hemayed, Proactive load balancing fault tolerance algorithm in cloud computing, Concurrency and Computation: Practice and Experience 33 (2021) e6172. doi:10.1002/cpe.6172. [12] K. Kaur, S. Bharany, S. Badotra, K. Aggarwal, A. Nayyar, S. Sharma, Energy-eficient polyglot persistence database live migration among heterogeneous clouds, The Journal of Supercomputing 79 (2022) 1–30. doi:10.1007/s11227-022-04662-6. [13] A. Belgacem, M. Saïd, M. A. Ferrag, A machine learning model for improving virtual machine migration in cloud computing, The Journal of Supercomputing (2023) 1–23. doi:10.1007/s11227-022-05031-z. [14] H. Jin, L. Deng, S. Wu, X. Shi, H. Chen, X. Pan, MECOM: Live migration of virtual machines by adaptively compressing memory pages, Future Generation Computer Systems 38 (2014) 23–35. doi:10.1016/j.future.2013.09.031. [15] P. Nechypurenko, T. Selivanova, M. Chernova, Using the Cloud-Oriented Virtual Chemical Laboratory VLab in Teaching the Solution of Experimental Problems in Chemistry of 9th Grade Students, in: V. Ermolayev, F. Mallet, V. Yakovyna, V. S. Kharchenko, V. Kobets,

[1]

Souza ,

A. Vittorio

Papadopoulos ,

Tomas ,

Gilbert ,

Tordsson , Hybrid Adaptive Checkpointing for Virtual Machine Fault Tolerance , in: 2018 IEEE International Conference on Cloud Engineering (IC2E) , 2018 , pp. 12 - 22 . doi: 10 .1109/IC2E. 2018 . 00023 .

[2]

Xu ,

Wei ,

Guo , Fault tolerance and quality of service aware virtual machine scheduling algorithm in cloud data centers , The Journal of Supercomputing ( 2022 ). doi:10. 1007/s11227-022-04760-5.

[3]

Oleksiuk , O. Oleksiuk, The practice of developing the academic cloud using the Proxmox VE platform , Educational Technology Quarterly 2021 ( 2021 ) 605 - 616 . doi: 10 .55056/etq. 36.

[4]

Y. O.

Modlo ,

S. O.

Semerikov ,

S. L.

Bondarevskyi ,

S. T.

Tolmachev ,

O. M.

Markova ,

P. P.

Nechypurenko , Methods of using mobile Internet devices in the formation of the general scientific component of bachelor in electromechanics competency in modeling of technical A . Kornilowicz,

Kravtsov ,

M. S.

Nikitchenko ,

Semerikov , A . Spivakovsky (Eds.), Proceedings of the 15th International Conference on ICT in Education, Research and Industrial Applications . Integration, Harmonization and

Knowledge

Transfer . Volume II: Workshops, Kherson, Ukraine, June 12-15, 2019 , volume 2393 of CEUR Workshop Proceedings , CEURWS.org, 2019 , pp. 968 - 983 . URL: https://ceur-ws. org/ Vol- 2393 /paper_329.pdf.

[16]

Talwar ,

Arora ,

Bharany , An Energy Eficient Agent Aware Proactive Fault Tolerance for Preventing Deterioration of Virtual Machines Within Cloud Environment , in: 2021 9th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO ), 2021 , pp. 1 - 7 . doi: 10 .1109/ICRITO51393. 2021 . 9596453 .

[17]

V. M.

Sivagami ,

K. S.

Easwarakumar , An Improved Dynamic Fault Tolerant Management Algorithm during VM migration in Cloud Data Center , Future Generation Computer Systems 98 ( 2019 ) 35 - 43 . doi: 10 .1016/j.future. 2018 . 11 .002.

[18]

Sheeba ,

B. Uma

Maheswari , An eficient fault tolerance scheme based enhanced ifrefly optimization for virtual machine placement in cloud computing , Concurrency and Computation: Practice and Experience 35 ( 2023 ) e7610 . URL: https://onlinelibrary.wiley. com/doi/abs/10.1002/cpe.7610. doi: 10 .1002/cpe.7610.

[19]

Zhang ,

Chen ,

Jiang , A multi-objective optimization method of initial virtual machine fault-tolerant placement for star topological data centers of cloud systems , Tsinghua Science and Technology 26 ( 2021 ) 95 - 111 . doi: 10 .26599/TST. 2019 . 9010044 .

[20]

Fang ,

Chen ,

Xiong , A multi-factor monitoring fault tolerance model based on a GPU cluster for big data processing , Information Sciences 496 ( 2019 ) 300 - 316 . doi: 10 .1016/j.ins. 2018 . 04 .053.

[21]

Saxena ,

A. K.

Singh , OFP-TM : An Online VM Failure Prediction and Tolerance Model towards High Availability of Cloud Computing Environments , The Journal of Supercomputing 78 ( 2022 ) 8003 - 8024 . doi: 10 .1007/s11227-021-04235-z.

[22]

S. M.

Abdulhamid ,

M. S. A.

Latif ,

S. H. H.

Madni ,

Abdullahi , Fault tolerance aware scheduling technique for cloud computing environment using dynamic clustering algorithm , Neural Computing and Applications 29 ( 2016 ) 279 - 293 . doi: 10 .1007/ s00521-016-2448-8.

[23]

Mangalagowri ,

Venkataraman , Ensure secured data transmission during virtual machine migration over cloud computing environment , International Journal of System Assurance Engineering and Management ( 2023 ). doi: 10 .1007/s13198-022-01834-8.

[24]

Gonzalez ,

Tang , FT-VMP: Fault-Tolerant Virtual Machine Placement in Cloud Data Centers , in: 2020 29th International Conference on Computer Communications and Networks (ICCCN) , 2020 , pp. 1 - 9 . doi: 10 .1109/ICCCN49398. 2020 . 9209676 .