<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Cluster fault tolerance model with migration of virtual machines</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrii V. Riabko</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tetiana A. Vakaliuk</string-name>
          <email>tetianavakaliuk@acnsci.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oksana V. Zaika</string-name>
          <email>ksuwazaika@gmail.com</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roman P. Kukharchuk</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Valerii V. Kontsedailo</string-name>
          <email>valerakontsedailo@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Academy of Cognitive and Natural Sciences</institution>
          ,
          <addr-line>54 Gagarin Ave., Kryvyi Rih, 50086</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Inner Circle</institution>
          ,
          <addr-line>Nieuwendijk 40, 1012 MB Amsterdam</addr-line>
          ,
          <country country="NL">Netherlands</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Institute for Digitalisation of Education of the NAES of Ukraine</institution>
          ,
          <addr-line>9 M. Berlynskoho Str., Kyiv, 04060</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Kryvyi Rih State Pedagogical University</institution>
          ,
          <addr-line>54 Gagarin Ave., Kryvyi Rih, 50086</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Oleksandr Dovzhenko Hlukhiv National Pedagogical University</institution>
          ,
          <addr-line>24 Kyivska Str., Hlukhiv, 41400</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Zhytomyr Polytechnic State University</institution>
          ,
          <addr-line>103 Chudnivsyka Str., Zhytomyr, 10005</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <fpage>23</fpage>
      <lpage>40</lpage>
      <abstract>
        <p>Ensuring high reliability, fault tolerance, and continuity of the computing process of computer systems is supported by combining computing resources into clusters. It is based on virtualization because of moving virtual resources, services, or applications between physical servers while supporting the continuity of computing processes. The object of study is a failover cluster, in the simplest case, consisting of two physical servers (primary and backup), which are connected through a switch. Each server has a local hard disk. A distributed storage system with synchronous data replication from the source server to the backup server is deployed on the local disks of the servers. A virtual machine is running on the cluster. The system implies launching a shadow copy of a virtual machine on a backup server so that in case of failure of the main server, the computing process can be continued on the virtual machine of the backup server. The coeficient of non-stationary readiness is taken as a reliability indicator. A Markov model of the reliability of a failover cluster is proposed, which takes into account the costs of migrating virtual machines, as well as mechanisms that ensure the continuity of the computing process (service) in the cluster in the event of a failure of one physical server. As a result of memory migration, two copies of the virtual machine are maintained, located on diferent physical servers, so that in the event of a failure of one of them, they continue to work on the other. A simplified model of a failover cluster is built, which neglects the cost of migrating virtual machines when restoring a cluster and gives an upper-reliability estimate. A significant impact on the reliability of a failover cluster (estimated by a non-stationary availability factor) of the virtual machine migration process is shown. The results obtained can be used to justify the choice of technology for ensuring failure stability and continuity of the computing process of computer systems of cluster architecture.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;virtual machine</kwd>
        <kwd>virtualization</kwd>
        <kwd>reliability</kwd>
        <kwd>fault tolerance</kwd>
        <kwd>redundancy</kwd>
        <kwd>clusters</kwd>
        <kwd>non-stationary availability</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Fault tolerance is a property of a system that allows it to continue to act correctly in the event of
an error or failure in some of its parts. Modern systems for processing, storing, and transmitting
data for various purposes, including cyber-physical and communication systems, are subject to
high requirements for reliability, security, fault tolerance, and low cost of implementation and
operation [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. The requirements for computer systems largely depend on the applications
they perform, their criticality to delays and continuity of service, the features of operation,
and their complexity. High reliability, fault tolerance, and readiness of computer systems for
critical applications are achieved by consolidating processing and storage resources based on
clustering technology, dynamic distribution of requests, and virtualization. In a clustered system
with virtualization, in the event of failure or disconnection of physical servers for maintenance
or other work, operability is ensured by moving virtual resources, services, or applications
between physical servers while maintaining the continuity of computing processes.
      </p>
      <p>
        Modern virtualization technologies are based on the targeted migration of virtual resources
between physical servers to adapt cluster systems to the accumulation of physical server
failures [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. When migrating virtual machines (VMs), a cluster can share data storage with
virtual machine virtual disks, which speeds up the migration process by migrating only the
main memory, virtual processor registers, and virtual device state of the virtual machines.
Nevertheless, the majority of edge devices, including UAVs (Unmanned Aerial Vehicles), tablets,
and cellular phones, are mobile in nature [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Therefore, the configuration of the cluster must
be flexible enough to adapt dynamically to the evolving network topology of the edge cluster,
minimizing the overall communication delay incurred by the edge devices in processing the
data received from IoT devices [5, 6]. In a cluster without a shared storage implementation, the
migration also moves the contents of the virtual disks of the virtual machines, which can be
significant in size, which slows down the migration process. The process of moving virtual
resources can be further slowed down when moving across the network. In the process of
dynamic movement, it is possible to single out the stages of transferring data (registries of
virtual machines, RAM, disks) to a backup server and activating the functioning of virtual
machines on it.
      </p>
      <p>The virtualization technology aimed at ensuring high reliability of computer systems includes
High Availability Cluster and Fault Tolerance technologies, the first of which supports automatic
restart of the virtual machine on healthy cluster nodes, and the second – the continuity of the
computing process when it is moved to the virtual cluster servers that have retained performance.
High Availability Cluster technology allows you to automatically move a virtual machine from
a failed server to a healthy server. Restoring the functionality of the virtual machine may
take several minutes, depending on the configuration and loading of the physical server and
the properties of user programs. With this technology, to automatically restart the virtual
machine, all data must be stored on a shared data storage, which can be implemented as a device
connected to all cluster nodes, or a distributed storage system. After any physical server fails,
other servers can run virtual machines using virtual disks located on shared storage. In this case,
the status of the virtual machine is lost, including data in RAM, registers of virtual processors,
and external devices. Therefore, the system takes time to initialize the virtual machine and
bring it to a pre-failure state.</p>
      <p>For the correct operation of this virtualization mechanism, it is necessary to ensure the
isolation of physical servers after a failure to exclude the simultaneous execution of the computing
process by two virtual machines after a reboot to prevent data ambiguity in the shared storage.</p>
      <p>High Availability Cluster technology assumes that after the failure of any physical server,
the virtual machines running on it are automatically distributed among the surviving nodes
and restarted on them. The RAM state of all virtual machines that were on the failed node is
lost. Fault Tolerance technology ensures the continuity of the computing process (service) in
the cluster after the failure of one physical server with the support of two copies of the virtual
machine in RAM located on diferent physical servers so that in case of failure of one of them,
continue working on the other. To organize the computing process during the operation of a
virtual machine on one of the servers, the other must maintain an up-to-date copy of the RAM
of the active virtual machine. In this case, virtual disk images of the virtual machine must be
stored in dedicated or distributed storage with synchronous data replication. Software products
that support fault tolerance technology include VMware Fault Tolerance, Kemari for Xen, and
KVM.</p>
      <p>These virtualization mechanisms afect the reliability of a cluster system, which must be
taken into account when substantiating the structure of the system, and organizing computing
processes and disciplines for restoring and maintaining highly reliable cluster systems.
Justification of the choice of design solutions for building highly reliable cluster systems should be
based on modeling in assessing the reliability, availability, fault tolerance, and performance of
implementations.</p>
      <p>The purpose of the authors of the article is to build models of cluster systems that allow
assessing the impact of the virtualization process on their reliability. The considered models are
focused on substantiating the choice of the structure and discipline of servicing and restoring a
cluster, taking into account the requirements for implemented applied tasks and virtualization
mechanisms.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Theoretical background</title>
      <p>
        In the era of cloud computing [7], fault tolerance is a crucial technology that enables non-stop
and long-lasting services to achieve high availability. This is typically accomplished through the
use of virtualization technology [8]. Achieving high performance in cloud computing requires
fault tolerance to be a critical requirement [9]. Virtualization is a widely used strategy, especially
in the field of cloud computing, to enhance existing computing resources. Nevertheless, ensuring
the stability and reliability of virtualization has become a significant subject [ 10]. According
to Xu et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], fault tolerance has a significant impact on the performance criteria of virtual
machine scheduling. With the growing demand for cloud computing infrastructure, availability
and reliability have become increasingly crucial due to their importance as major features in
real-time computing systems [11].
      </p>
      <p>Virtualization has become the foundation of cloud computing, enabling the deployment of
virtual machines for data dissemination and administration. In modern applications, data is
often stored using polyglot persistence, which combines SQL and NoSQL data stores. However,
since these services are customized for specific storage requirements, it may be necessary to
aggregate them from several heterogeneous clouds or migrate data from one cloud to another.
Data migration can be performed ofline when the database is independent of the application
or, alternatively, the application must be taken ofline during the migration process [12].</p>
      <p>Cloud Computing is a groundbreaking model that provides internet-based access to physical
and application resources. These resources are virtualized and ofered to users as a service
through virtualization software. Nevertheless, virtual machine (VM) migration using
virtualization technology can adversely afect cloud performance, making it a major concern. The
uneven distribution of VMs during resource allocation and their frequent movement from one
server to another can lead to increased energy consumption and network overhead [13, 14].</p>
      <p>
        Cloud Computing is now extensively used for both personal and professional purposes
[15]. Nevertheless, the widespread adoption and growth of cloud computing resources due to
technological advancements have raised concerns about cloud service reliability and high energy
consumption. In cloud computing, the primary challenges include ensuring data availability,
backup replication, data eficiency, and reliability, as failures are frequently encountered during
execution. Therefore, developing a fault tolerance technique is necessary to ensure reliability
and availability while reducing energy consumption in the cloud. Currently, two primary
fault-tolerant techniques exist – proactive and reactive fault tolerance [
        <xref ref-type="bibr" rid="ref5">16</xref>
        ]. To alleviate the
resource burden on specific servers, the problem involves selecting one or more suitable virtual
machines (VMs) for migration. Sivagami and Easwarakumar [
        <xref ref-type="bibr" rid="ref6">17</xref>
        ] introduce a new approach
called Dynamic Fault Tolerant VM Migration that enforces reliability in cloud data center
infrastructure through an advanced recovery mechanism for Virtual Network demand.
      </p>
      <p>
        Placing virtual machines in highly reliable cloud applications is a challenging and crucial
concern. To address this, the K-means clustering algorithm is utilized. Furthermore, the adaptive
particle swarm optimization with the coyote optimization algorithm is employed to obtain the
optimal cluster for virtual machine placement and reduce the challenge [
        <xref ref-type="bibr" rid="ref1 ref7">18, 1</xref>
        ]. Zhang et al. [
        <xref ref-type="bibr" rid="ref8">19</xref>
        ]
establishes a model of initial placement for fault-tolerant virtual machines in star topological
data centers of cloud systems, taking into account several factors such as the violation rate of
service-level agreements, the remaining rate of resources, the rate of power consumption, the
rate of failure, and the cost of fault tolerance. Fang et al. [
        <xref ref-type="bibr" rid="ref9">20</xref>
        ] developed a multi-factor real-time
monitoring fault tolerance (MRMFT) model based on a GPU cluster to facilitate large-scale data
processing.
      </p>
      <p>
        Simultaneously, the continuously increasing demand for cloud resources results in service
unavailability, which poses critical challenges such as cloud outages, violations of service-level
agreements, and excessive power consumption [
        <xref ref-type="bibr" rid="ref10">21</xref>
        ]. Abdulhamid et al. [
        <xref ref-type="bibr" rid="ref11">22</xref>
        ] suggested a dynamic
clustering league championship algorithm (DCLCA) scheduling technique that prioritizes fault
tolerance awareness to tackle cloud task execution. This approach considers the currently
available resources and minimizes the occurrence of untimely failure of autonomous tasks [
        <xref ref-type="bibr" rid="ref11">22</xref>
        ].
The growth of cloud usage has presented various challenges, including high energy consumption
in Cloud Data Centers, security risks to Virtual Machines (VMs) due to co-residency with other
risky VMs on the same Physical Machine, and Quality of Service (QoS) degradation caused by
resource sharing. To address these issues, researchers have utilized Dynamic VM Consolidation
to reduce energy consumption while minimizing QoS degradation. However, there are security
concerns during data transmission when migrating VMs in a cloud environment. To solve
this problem, Mangalagowri and Venkataraman [
        <xref ref-type="bibr" rid="ref12">23</xref>
        ] propose a Capability and Access Control
(CAC) service scheme based on Software Defined Networks (SDN). In cloud data centers, virtual
machine replication is useful for achieving fault tolerance, load balancing, and rapid response
to user requests [
        <xref ref-type="bibr" rid="ref13">24</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Research methods</title>
      <p>Summarizing the considered studies, it should be noted that the theory of reliability studies the
patterns of failures of technical objects (which, in particular, include information, computer
systems, and networks), methods, and models of reliability analysis and ensuring their stable
operation under failure conditions. Reliability is understood as the property of an object to
maintain the ability to perform the necessary functions over time under given modes and
conditions of use, maintenance, storage, and transportation. In other words, the reliability of an
object is its ability to do what is needed in time.</p>
      <p>Information, computing, and info-communication systems and networks have the following
features. First, the need to take into account the impact of processing, storage, and transmission
processes on the ability to perform the necessary functions. These processes create delays
that can lead to the failure of functions in the required period and, as a result, failures in the
implementation of the necessary functions.</p>
      <p>Secondly, the need to take into account in computer systems the impact on the reliability of
the operation of software, the failures of which have certain specifics in traditional technical
systems. This specificity is due to the manifestations in the functioning of the system of errors
of algorithms or programs that were not detected during testing or take into account some rare
events that are potentially possible during the operation of information computer systems such
as software and hardware systems.</p>
      <p>Thirdly, a certain dependence on the reliability of the information system on ensuring its
information security. Violation of information protection can manifest itself in deterioration
of working conditions, increase in load, integrity violation, in particular, loss or distortion of
information, which can lead to failure to perform the necessary system functions, erroneous
performance, or an increase in the time of permissible delays. Functioning in the conditions of
a security breach can, in particular, manifest itself in the initialization of some processes not
provided for during normal operation, which, in addition to failure to perform the necessary
functions and violation of the stationary of operating modes, can lead to an increase in load,
overheating of processors and, ultimately, to an increase in the failure rate.</p>
      <p>One of the main components of reliability is fault tolerance. Fault tolerance is the ability of a
system to keep functioning in case of failures. The potential for maintaining the fault tolerance
of the system depends on the types, number, combinations of failures, and location. Computing
systems are characterized by the requirements to ensure the operability (reliability) not only of
their structure as a set of hardware and software resources (including redundant ones) but also
of the computing process, in particular, if it is necessary to maintain its continuity in the face of
failures, failures and external destructive efects of random or malicious nature. A feature of an
information computer system is the need to consider it not only as a general technical object
with requirements for structural and parametric reliability but also as an object that implements
information and computing processes with the requirement of functional reliability.</p>
      <p>In a redundant system, there are many able-bodied states, from which one initial state can be
distinguished, characterized by the operability of all elements of the system and, accordingly,
the best characteristics of the quality (eficiency) of functioning. For the accumulation of failures
in fault-tolerant systems, the degradation of the eficiency and potential of the system to ensure
reliability usually occurs.</p>
      <p>The operational state, in which the current values of the parameters are at such a level
that the failure of one element can lead to the failure of the system, is called the pre-failure
state. In the sequence of states of a redundant system, between the initial state and the state
before failure, there are usually one or more intermediate states. The number of failures of
the elements that bring the system from the initial state to the pre-failure state characterizes
the redundancy of the system and its resistance to failure. In the general case, systems have a
complex combinatorial dependence of the number of failures sustained by the system during its
degradation on the relative position of the failed elements.</p>
      <p>Fault tolerance indicators should reflect the dynamics of maintaining eficiency in the event
of one, two, or more element failures. Deterministic and probabilistic fault tolerance indicators
are used. Deterministic indicators of stability failure: 1)  – the maximum number of element
failures, under which the system’s operability is guaranteed; 2)  – the maximum number of
failures of elements, at which it is possible to maintain the system’s operability. The maximum
number of element failures, at which the system’s operability is guaranteed  (this indicator is
called -reliability), corresponds to the minimum number of failures with the most unfortunate
combination of element failures:
 = min 

(1)
where  is the number of elements that failed during the transition from the initial (fully
operational) state to the pre-failure state along the -th path. Similarly
 = max 

(2)
where  is the number of failures of elements during the transition to the state preceding the
failure along the -th path (each path can have several states preceding the failure).</p>
      <p>A cluster is understood as a group of interconnected resources (servers, information storage
devices, etc.), which is perceived by the user (query source) as a single resource. Clusters
are created to achieve high availability, fault tolerance, and system performance based on the
consolidation of resources. They can be created based on the same type or diferent types of
resources (by parameters or functionality). In the first case, the cluster will be homogeneous,
and in the second – heterogeneous. The joint work of cluster nodes is coordinated through
a high-speed leased line or through a local network through which messages are exchanged.
Clusters are distinguished between a server system without disk sharing and a server system
with disk sharing.</p>
      <p>An example of clusters when combining two servers is shown in figure 1.</p>
      <p>When clustering a group of servers, there are options for organizing clusters with diferent
redundancy. Options for combining servers and storage clusters are shown in figure 2.</p>
      <p>In a fully permissive topology (figure 2, a), each storage device (disk array) is connected
to only one cluster server. For the topology under consideration, while maintaining the fault
tolerance of the configuration after node failure, failure is possible when executing functional
queries due to loss of calculation results. The organization of the computational process without
losing the functional requests that were executed at the time of failure is, in principle, possible
for this topology, but it is associated with a significant slowdown of the computational process
when organizing periodic saving of intermediate results via the local network in other nodes.</p>
      <p>The N+1 topology (figure 2, b) means that each storage device (disk array) is connected to two
cluster nodes, with one redundant server connected to all storage devices. It is used to organize
high-availability clusters if one node can be allocated for redundancy. This topology reduces
the load on active nodes and ensures that a load of a failed node can be restored to the standby
node without loss of quality. It maintains fault tolerance of any of the primary nodes while
connecting a single redundant node. In the cluster pair topology (figure 2, c), nodes are grouped
in pairs, storage devices are attached to both nodes of the pair, and each node has access to
all storage devices (disk arrays) of the pair. Thus, fault tolerance is maintained within cluster
pairs. In a full-access topology, servers and storage devices are connected through switches
(figure 2, d), the system can be expanded by adding additional servers and storage devices to
the cluster without changing existing connections. This topology provides fault tolerance for
all cluster resources, which is achieved by redistributing the execution of tasks of failed nodes
between healthy nodes.</p>
      <p>The probability of failure-free operation of the structures depicted in figure 2, with the same
number n of servers and storage devices, provided that at least one server and its associated
storage device must work in the system, is respectively found as
⎧⎪1() = 1 − (1 − 1()2()),
⎪
⎪
⎪⎨2() = 1()(1 − 2()) + (1 − 1())(1 − (1 − 1()2())− 1),
⎪3() = (1 − (1 − 1())2(1 − 2())2)/2,
⎪
⎪
⎪⎩4() = 1 − (1 − 1())(1 − 2())(1 − 3()),
(3)
where 1(), 2(), and 3() are the probabilities of failure-free operation of servers, devices,
and switches, and  is the number of switches.</p>
      <p>Permanent operation of the infrastructure is possible only if there is an exact copy of the
existing server running similar processes and services. That is, if you create a replica after a
hardware failure, it will take time, which means it will lead to downtime and interruptions in
the provision of services.</p>
      <p>Fault tolerance is implemented in hardware and software. Hardware development is a
“bifurcation” of the host: in other words, all the components of the system are simply duplicated,
and the calculations occur at once. Synchronization is ensured by the presence of a special
node. The software method is used more often but has several limitations. For example, its
deployment will require the presence of a processor, communication between individual virtual
machines, etc.</p>
      <p>The programmatic way to deploy a cluster is considered in our study.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Results and discussions</title>
      <p>Consider a highly reliable cluster implemented with virtualization technology focused on
maintaining the continuity of the service (computing process). A failover cluster in the simplest
case consists of two physical servers (primary and backup) with high-speed network interfaces
(figure 3). Each server has one local hard disk drive (HDD) connected via SATA or SAS interface.
Both servers have a hypervisor, clustering software, and virtualization management installed
on the HDD. A distributed storage system with synchronous data replication from the source
server to the backup server is deployed on the local disks of the servers. The cluster is running
a virtual machine in failover mode.</p>
      <p>The system assumes the launch of a shadow copy of the virtual machine on the backup server,
which allows, after the failure of the main server, to continue the computing process on the
virtual machine of the backup server without interruption. Support for the continuity of the
computing process during automatic recovery after a failure (reconfiguration) requires constant
synchronization of RAM and disk data, for which it is possible to use high-speed network
adapters and second-level switches, for example, 10G Ethernet or InfiniBand; organizations on
servers of a distributed storage system that supports synchronous replication of disk data from
the primary to a backup server or a separate server for organizing an external storage system.</p>
      <p>Let us consider the restoration of system resources that are lost as a result of failures, which
is carried out immediately after a failure (provides for instantaneous detection of the occurrence
of a failure using control, devices, and personnel ready to carry out repair work).</p>
      <p>For fault-tolerant cluster systems, we take the non-stationary availability factor  and the
non-stationary availability function () as the reliability indicator. Non-stationary availability
factor – the possibility that the system at a certain point in time is ready to perform the necessary
functions (is working). It characterizes the readiness of the object to perform the necessary
function at an arbitrary time , which is close enough to the moment of a fixed change in the state
of the system (before the operation, after prevention, testing, reconfiguration, or recovery). A
non-stationary availability factor is applied when the stationary mode, in which the probability
of states depends on time, has not yet been established. In general, these indicators depend on
the failure and recovery rates of the system elements, the time of its continuous operation, and
the type and frequency of redundancy.</p>
      <p>() ∼=</p>
      <p>+ 
+</p>
      <p>+  · exp[−  · ( +  )]
where () = 0().</p>
      <p>= lim () =</p>
      <p>→∞
where  is the number of elements of the non-redundant system,  ,   are the corresponding
failure and restoration rates of the element of the -th type and  = 1, 2, ..., ;  = ∑︀   –
system failure rate. System update rate is

=1
(4)
(5)
(6)</p>
      <p>+ 
∑︀</p>
      <p>=1
=  =1

∑︀  
∑︀</p>
      <p>=1</p>
      <p>The above dependencies indicate that the higher the coeficient and the readiness function,
the lower the ratio   .</p>
      <p />
      <p>Dynamic models are used to calculate the fault tolerance characteristics of complex systems.
If the behavior of the system can be described by a Markov action, the mathematical model of the
reliability of such a system is a system of diferential equations. When studying the functioning
of recoverable systems under the Poisson law of distribution of failure and restoration flows
(the intensity of the failure flow  () and the restoration intensity  () are constants), the
mathematical model of such a system is a system of ordinary diferential equations. The system
of ordinary diferential equations can be solved analytically or numerically.</p>
      <p>Consider the case when the failure rate () is a function of time. Figure 4 shows the Markov
graph of the restored element, the mathematical model of which is a system of nonlinear
diferential equations.</p>
      <p>If you build Markov models of system fault tolerance, consisting of several renewable elements,
then the state space of the model will increase. The system of diferential equations with respect
to () ( = 1, 2, ..., ) will have the general form
(7)
(8)
⎧1′() = − 1() ∑︀  1() + ∑︀ () 1(),
⎪
⎪
⎪⎪⎪...
⎪
⎨</p>
      <p>′() = − () ∑︀  () + ∑︀ () (),
⎪⎪⎪...
⎪
⎪
⎪⎩′() = − () ∑︀  () + ∑︀ () (),
where the first sum on the right side of the equation contains the intensity of transitions from
the current state , and the second sum is the intensity of transitions to state ; transitions
corresponding to failures have time-dependent coeficients; transitions corresponding to the
restoration of working capacity are constants.</p>
      <p>In the general case, it is dificult to obtain an analytical solution to a system of nonlinear
diferential equations; therefore, it is advisable to use numerous methods for solving. For
example, Mathcad has a built-in function  , which is considered basic and implements
the fourth-order Runge-Kutta method with a fixed step. This function is designed to solve
systems of first-order diferential equations
1′ = 1(, 1, 2, ..., ),
2′ = 2(, 1, 2, ..., ),</p>
      <p>...</p>
      <p>′ = (, 1, 2, ..., ),</p>
      <p>The function  (, 1, 2, , ) returns a matrix of 1 +  rows, in which
the first column contains the solution, and the other columns contain the solution and its first
 − 1 derivatives.</p>
      <p>Function arguments are:
 is the vector of initial values ( elements);
1 and 2 are the limits of the interval on which we are looking for a solution;
 – the number of points inside the interval (1, 2) in which we are looking for a
solution. They are chosen from the condition of obtaining the desired accuracy of numerical
integration;
 is a vector of  elements – the first derivatives of the desired function.</p>
      <p>As an example, consider the solution of a system of diferential equations for finding the
non-stationary availability factor of a duplicated system. On it, column 0 corresponds to time
( = , 0), and the subsequent columns are the probabilities of states depending on time
(figure 5). Also shown is a plot of the non-stationary availability factor (availability function)
versus time.</p>
      <p>Let us build a Markov model of the reliability of a failover cluster with online recovery,
taking into account the implementation of mechanisms for moving a virtual machine. The state
and transition diagram of a failover cluster with online recovery when implementing virtual
machine movement is shown in figure 6. In the figure, the healthy states of the cluster (healthy
states without failed nodes) are indicated by vertices circled with a solid line; repairman – thick
solid line. The “VM” mark at the top of the graphs indicates the server on which the virtual
machine with the virtual service is currently running. The top crossed out with two lines
means the failure of the node, with one line – the state of the node in which it is currently not
functioning and, accordingly, does not fail.</p>
      <p>The diagram shows the failure rates ( 0,  1,  2) and updates ( 0,  1,  2) of the server, disk,
and switch, respectively. The intensity of updating (synchronization of the distributed storage
system), including the introduction of an up-to-date replica of data on the restored disk –  3.
The intensity of restoring a virtual machine after an automatic restart, which includes starting
a virtual machine on a standby server and loading a user program on it –  4.</p>
      <p>To find the state probabilities from the given state and transition diagrams, systems of
algebraic equations are compiled when estimating the stationary availability factor or diferential
equations when estimating the non-stationary availability factor. We write the system of
diferential equations according to the state and transition diagram (figure 6) as follows:
⎪6′() = −  06() +  01() +  10(),
⎪
⎪
⎪⎪7′() = −  17() +  12(),
⎪
⎪
⎪
⎪⎪8′() = −  08() +  02(),
⎪
⎪
⎪
⎪⎪9′() = −  09() +  13() +  112(),
⎪
⎪
⎪
⎪⎪1′0() = −  010() +  03() +  012(),
⎪
⎪
⎪
⎪⎪1′1() = −  411() +  04(),
⎪
⎪
⎪
⎪⎩1′2() = −  412() +  14(),
(9)</p>
      <p>As a result, the simplified Markov model of cluster reliability, without taking into account
the impact of reducing the availability of the cluster, and the cost of migrating virtual machines,
respectively, leads to an upper estimate of the system reliability, presented in figure 7.</p>
      <p>The system of diferential equations corresponding to the state and transition diagram, shown
in figure 7, has the form:
(10)</p>
      <p>The results of calculating the coeficients of non-stationary availability of the cluster for the
models corresponding to the diagrams in figures 2 and 3 are shown in figure 8.</p>
      <p>In figure 8, curves 1 and 2 correspond to the evaluation of the function of non-stationary
availability factors 1() and 2() based on the diagrams in figure 6 and figure 7. Curve 3
in figure 4 corresponds to the diference  = 2()˘ 1() (the  value axis is on the right).
The calculation was performed under the following failure rates of the server, disk, and switch:
 0 = 1.115 × 10− 5 1/h,  1 = 3.425 × 10− 6 1/h,  2 = 2.3 × 10− 6 1/h recovery respectively:
 0 = 0.33 1/h,  1 = 0.171/h,  2 = 0.33 1/h.</p>
      <p>The intensity of synchronization of the distributed storage system:  3 = 1 1/h,  4 = 2 1/h.
The calculations were performed in the Mathcad computer mathematics system. Graphs allow
us to conclude the significant impact of considering the migration of virtual machines on
reliability.</p>
      <p>Thus, a Markov model of the reliability of a failover cluster is proposed, which takes into
account the costs of migrating virtual machines. A simplified model of a failover cluster has
been built, which neglects the costs of restoring the migration of virtual machines. A significant
impact on the reliability of a failover cluster (estimated by a non-stationary availability factor) is
shown by taking into account virtualization mechanisms, in particular, the migration of virtual
machines.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>As a result of theoretical analysis, it has been established that a cluster is understood as a group
of interconnected resources, perceived by the user as a single resource. Clusters are created to
achieve high availability, fault tolerance, and system performance based on the consolidation of
resources.</p>
      <p>The article is considered a programmatic method of deploying a fault-tolerant computing
cluster consisting of two physical servers (main and backup) on which a local hard disk is
installed. The servers are connected via a switch. A distributed storage system with synchronous
data replication from the source server to the standby server is deployed on the server disks,
and a virtual machine is running on the cluster. A model of a failover cluster has been built,
which neglects the costs of restoring the migration of virtual machines. The calculations were
performed in the Mathcad computer mathematics system. The calculations allow us to conclude
that accounting for the migration of virtual machines has a significant impact on reliability.
objects, in: A. E. Kiv, M. P. Shyshkina (Eds.), Proceedings of the 2nd International Workshop
on Augmented Reality in Education, Kryvyi Rih, Ukraine, March 22, 2019, volume 2547 of
CEUR Workshop Proceedings, CEUR-WS.org, 2019, pp. 217–240. URL: https://ceur-ws.org/
Vol-2547/paper16.pdf.
[5] K. Rajashekar, S. Karmakar, S. Paul, S. Sidhanta, Topology-Aware Cluster Configuration for
Real-Time Multi-Access Edge Computing, in: Proceedings of the 24th International
Conference on Distributed Computing and Networking, ICDCN ’23, Association for Computing
Machinery, New York, NY, USA, 2023, p. 286–287. doi:10.1145/3571306.3571417.
[6] N. M. Lobanchykova, I. A. Pilkevych, O. Korchenko, Analysis and protection of IoT systems:
Edge computing and decentralized decision-making, Journal of Edge Computing 1 (2022)
55–67. doi:10.55056/jec.573.
[7] M. Popel, S. V. Shokalyuk, M. Shyshkina, The Learning Technique of the
SageMathCloud Use for Students Collaboration Support, in: V. Ermolayev, N. Bassiliades,
H. Fill, V. Yakovyna, H. C. Mayr, V. S. Kharchenko, V. S. Peschanenko, M. Shyshkina,
M. S. Nikitchenko, A. Spivakovsky (Eds.), Proceedings of the 13th International
Conference on ICT in Education, Research and Industrial Applications. Integration,
Harmonization and Knowledge Transfer, ICTERI 2017, Kyiv, Ukraine, May 15-18, 2017,
volume 1844 of CEUR Workshop Proceedings, CEUR-WS.org, 2017, pp. 327–339. URL:
https://ceur-ws.org/Vol-1844/10000327.pdf.
[8] C.-Y. Yu, C.-R. Lee, P.-J. Tsao, Y.-S. Lin, T.-C. Chiueh, Eficient Group Fault Tolerance
for Multi-tier Services in Cloud Environments, in: ICC 2020 - 2020 IEEE International
Conference on Communications (ICC), 2020, pp. 1–7. doi:10.1109/ICC40277.2020.
9149253.
[9] P. Kumari, P. Kaur, A survey of fault tolerance in cloud computing, Journal of King Saud
University - Computer and Information Sciences 33 (2021) 1159–1176. doi:10.1016/j.
jksuci.2018.09.021.
[10] C.-T. Yang, W.-L. Chou, C.-H. Hsu, A. Cuzzocrea, On improvement of cloud virtual machine
availability with virtualization fault tolerance mechanism, The Journal of Supercomputing
69 (2014) 1103–1122. doi:10.1007/s11227-013-1045-1.
[11] S. M. Attallah, M. B. Fayek, S. M. Nassar, E. E. Hemayed, Proactive load balancing fault
tolerance algorithm in cloud computing, Concurrency and Computation: Practice and
Experience 33 (2021) e6172. doi:10.1002/cpe.6172.
[12] K. Kaur, S. Bharany, S. Badotra, K. Aggarwal, A. Nayyar, S. Sharma, Energy-eficient
polyglot persistence database live migration among heterogeneous clouds, The Journal of
Supercomputing 79 (2022) 1–30. doi:10.1007/s11227-022-04662-6.
[13] A. Belgacem, M. Saïd, M. A. Ferrag, A machine learning model for improving virtual
machine migration in cloud computing, The Journal of Supercomputing (2023) 1–23.
doi:10.1007/s11227-022-05031-z.
[14] H. Jin, L. Deng, S. Wu, X. Shi, H. Chen, X. Pan, MECOM: Live migration of virtual machines
by adaptively compressing memory pages, Future Generation Computer Systems 38 (2014)
23–35. doi:10.1016/j.future.2013.09.031.
[15] P. Nechypurenko, T. Selivanova, M. Chernova, Using the Cloud-Oriented Virtual Chemical
Laboratory VLab in Teaching the Solution of Experimental Problems in Chemistry of 9th
Grade Students, in: V. Ermolayev, F. Mallet, V. Yakovyna, V. S. Kharchenko, V. Kobets,</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Souza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Vittorio</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Tomas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gilbert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tordsson</surname>
          </string-name>
          ,
          <article-title>Hybrid Adaptive Checkpointing for Virtual Machine Fault Tolerance</article-title>
          , in: 2018
          <source>IEEE International Conference on Cloud Engineering (IC2E)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>12</fpage>
          -
          <lpage>22</lpage>
          . doi:
          <volume>10</volume>
          .1109/IC2E.
          <year>2018</year>
          .
          <volume>00023</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>Fault tolerance and quality of service aware virtual machine scheduling algorithm in cloud data centers</article-title>
          ,
          <source>The Journal of Supercomputing</source>
          (
          <year>2022</year>
          ).
          <source>doi:10. 1007/s11227-022-04760-5.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>V.</given-names>
            <surname>Oleksiuk</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. Oleksiuk,</surname>
          </string-name>
          <article-title>The practice of developing the academic cloud using the Proxmox VE platform</article-title>
          ,
          <source>Educational Technology Quarterly</source>
          <year>2021</year>
          (
          <year>2021</year>
          )
          <fpage>605</fpage>
          -
          <lpage>616</lpage>
          . doi:
          <volume>10</volume>
          .55056/etq. 36.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y. O.</given-names>
            <surname>Modlo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. O.</given-names>
            <surname>Semerikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. L.</given-names>
            <surname>Bondarevskyi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. T.</given-names>
            <surname>Tolmachev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. M.</given-names>
            <surname>Markova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. P.</given-names>
            <surname>Nechypurenko</surname>
          </string-name>
          ,
          <article-title>Methods of using mobile Internet devices in the formation of the general scientific component of bachelor in electromechanics competency in modeling of technical A</article-title>
          . Kornilowicz,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kravtsov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Nikitchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Semerikov</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Spivakovsky (Eds.),
          <source>Proceedings of the 15th International Conference on ICT in Education, Research and Industrial Applications</source>
          . Integration, Harmonization and
          <string-name>
            <given-names>Knowledge</given-names>
            <surname>Transfer</surname>
          </string-name>
          . Volume II: Workshops, Kherson, Ukraine, June 12-15,
          <year>2019</year>
          , volume
          <volume>2393</volume>
          <source>of CEUR Workshop Proceedings</source>
          , CEURWS.org,
          <year>2019</year>
          , pp.
          <fpage>968</fpage>
          -
          <lpage>983</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2393</volume>
          /paper_329.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>B.</given-names>
            <surname>Talwar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Arora</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bharany</surname>
          </string-name>
          ,
          <article-title>An Energy Eficient Agent Aware Proactive Fault Tolerance for Preventing Deterioration of Virtual Machines Within Cloud Environment</article-title>
          , in: 2021 9th International Conference on Reliability,
          <article-title>Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO</article-title>
          ),
          <year>2021</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICRITO51393.
          <year>2021</year>
          .
          <volume>9596453</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>V. M.</given-names>
            <surname>Sivagami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. S.</given-names>
            <surname>Easwarakumar</surname>
          </string-name>
          ,
          <article-title>An Improved Dynamic Fault Tolerant Management Algorithm during VM migration in Cloud Data Center</article-title>
          ,
          <source>Future Generation Computer Systems</source>
          <volume>98</volume>
          (
          <year>2019</year>
          )
          <fpage>35</fpage>
          -
          <lpage>43</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.future.
          <year>2018</year>
          .
          <volume>11</volume>
          .002.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sheeba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. Uma</given-names>
            <surname>Maheswari</surname>
          </string-name>
          ,
          <article-title>An eficient fault tolerance scheme based enhanced ifrefly optimization for virtual machine placement in cloud computing</article-title>
          ,
          <source>Concurrency and Computation: Practice and Experience</source>
          <volume>35</volume>
          (
          <year>2023</year>
          )
          <article-title>e7610</article-title>
          . URL: https://onlinelibrary.wiley. com/doi/abs/10.1002/cpe.7610. doi:
          <volume>10</volume>
          .1002/cpe.7610.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <article-title>A multi-objective optimization method of initial virtual machine fault-tolerant placement for star topological data centers of cloud systems</article-title>
          ,
          <source>Tsinghua Science and Technology</source>
          <volume>26</volume>
          (
          <year>2021</year>
          )
          <fpage>95</fpage>
          -
          <lpage>111</lpage>
          . doi:
          <volume>10</volume>
          .26599/TST.
          <year>2019</year>
          .
          <volume>9010044</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <article-title>A multi-factor monitoring fault tolerance model based on a GPU cluster for big data processing</article-title>
          ,
          <source>Information Sciences 496</source>
          (
          <year>2019</year>
          )
          <fpage>300</fpage>
          -
          <lpage>316</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.ins.
          <year>2018</year>
          .
          <volume>04</volume>
          .053.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>D.</given-names>
            <surname>Saxena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <surname>OFP-TM</surname>
          </string-name>
          :
          <article-title>An Online VM Failure Prediction and Tolerance Model towards High Availability of Cloud Computing Environments</article-title>
          ,
          <source>The Journal of Supercomputing</source>
          <volume>78</volume>
          (
          <year>2022</year>
          )
          <fpage>8003</fpage>
          -
          <lpage>8024</lpage>
          . doi:
          <volume>10</volume>
          .1007/s11227-021-04235-z.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Abdulhamid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S. A.</given-names>
            <surname>Latif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. H. H.</given-names>
            <surname>Madni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Abdullahi</surname>
          </string-name>
          ,
          <article-title>Fault tolerance aware scheduling technique for cloud computing environment using dynamic clustering algorithm</article-title>
          ,
          <source>Neural Computing and Applications</source>
          <volume>29</volume>
          (
          <year>2016</year>
          )
          <fpage>279</fpage>
          -
          <lpage>293</lpage>
          . doi:
          <volume>10</volume>
          .1007/ s00521-016-2448-8.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>R.</given-names>
            <surname>Mangalagowri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Venkataraman</surname>
          </string-name>
          ,
          <article-title>Ensure secured data transmission during virtual machine migration over cloud computing environment</article-title>
          ,
          <source>International Journal of System Assurance Engineering and Management</source>
          (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .1007/s13198-022-01834-8.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>C.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Tang</surname>
          </string-name>
          , FT-VMP:
          <article-title>Fault-Tolerant Virtual Machine Placement in Cloud Data Centers</article-title>
          ,
          <source>in: 2020 29th International Conference on Computer Communications and Networks (ICCCN)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICCCN49398.
          <year>2020</year>
          .
          <volume>9209676</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>