<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Analysis of the Cluster Operating Time With the Migration of Virtual Machines?</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>ITMO University</institution>
          ,
          <addr-line>Saint-Petersburg</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Saint-Petersburg State University of Aerospace Instrumentation</institution>
          ,
          <addr-line>Saint-Petersburg</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>A Markov model of reliability of a fault-tolerant cluster has been considered, using virtualization technologies that ensure the continuity of the computational process in the event of a failure of the servers` physical resources and the impossibility of recovering from the interruption of the computational process. The probability of maintaining the system's operability under the condition of ensuring the continuity of the computing process for di erent service organization options was anilized. The mean time time to failure of such sistems was found. The purpose of the work is to increase the functional reliability of computing systems of a cluster architecture while increasing the time to failure, taking into account the requirements for ensuring the continuity of the computing process. A fault tolerance is considered as an object of study. A virtual machine is running on the cluster. The system involves launching a shadow copy of the VM on the backup server, which al-lows after the failure of the primary server to continue its implementation on the backup server.The proposed models can be used to assess the level of system reliability and are important in choosing a system con guration for certain conditions. Assessing the migration of virtual machines in the event of a failure of physical servers will allow you to calculate and evaluate the possible damage when using various models.</p>
      </abstract>
      <kwd-group>
        <kwd>Virtualization Mean time to failure</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        For cluster computing systems, especially real-time, the key is to ensure
reliability and fault tolerance while maintaining the continuity of the computing
? Copyright c 2019 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0)
process. The achievement of high and stable performance indicators, reliability,
fault tolerance [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1-3</xref>
        ] and security [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] of computer systems is facilitated by the use
of technologies for consolidation of clustering and virtualization resources [
        <xref ref-type="bibr" rid="ref5 ref6">5-6</xref>
        ],
accompanied by replication and migration of virtual machines between physical
servers. Migration and replication of virtual machines speeds up the recon
guration process after failures of physical resources and contributes to supporting
the continuity of the computing process required for managing cyber-physical
systems and real-time technological processes.
2
      </p>
      <p>
        Cluster fault tolerance technology with continuous
computing
One of the e ective ways to achieve fault tolerance of computing systems and
processes is the migration of virtual resources between the physical nodes (servers)
of a computing system of a cluster architecture. In a cluster with replication
of virtual machines (VMs) on di erent physical nodes, they can migrate
between cluster nodes in the event of failure of physical resources without stopping
calculations on servers [
        <xref ref-type="bibr" rid="ref17 ref18 ref19 ref7 ref8">7-8, 17-19</xref>
        ].
      </p>
      <p>
        Virtualization allows to optimize the use of computing resources, increases
the scalability, fault tolerance and extensibility of the infrastructure, due to the
rapid redistribution of the virtual resource [
        <xref ref-type="bibr" rid="ref10 ref9">9-10</xref>
        ].
      </p>
      <p>
        Recovery time during VM migration after failures depends on the structure
of the data storage.With shared storage for all physical nodes of the cluster, only
RAM, virtual processor registers, and VM virtual device states are transferred
during migration [
        <xref ref-type="bibr" rid="ref5 ref6">5-6</xref>
        ]. Information is transferred from hard disks in case of the
data storage is localized for each node of the cluster.
      </p>
      <p>
        Fault tolerance ensures the continuity of the computing process (service) in
the cluster after the failure of one of the physical servers with the support of two
copies of the VM, which, in RAM, are located on di erent physical servers, so
that in case of failure of one of them, continue to work on the second. During the
functioning of the VM on the main servers, the backup copy must support the
actual copy of the RAM [
        <xref ref-type="bibr" rid="ref12 ref13 ref14">12-14</xref>
        ] of the active VMs. In this case, the virtual disk
images of the VM should be stored on a dedicated or distributed data storage
with synchronous data replication. VMware Fault Tolerance, Kemari for Xen
and KVM [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ] software products support fault tolerance technology.
      </p>
      <p>The purpose of the work is to increase the functional reliability of computing
systems of a cluster architecture while increasing the time to failure, taking into
account the requirements for ensuring the continuity of the computing process.</p>
      <p>By functional reliability, we mean the ability of systems to perform the
required functions, taking into account not only the operability of the resources
required for their implementation, but also ensuring the necessary conditions for
their implementation. Requirements to ensure the continuity of the
computational process in the inadmissibility of interruptions of the reservation system
at the time of recovery are proposed as conditions of operation. Thus, in the
systems under consideration, recovery is possible only if it is combined with the
implementation of the required functions by non-failed nodes. The system enters
a non-recoverable state if it is impossible to recon gure with the activation of
the required number of operability resources. For such systems, the reliability
indicator is the time to failure, including taking into account violations of the
continuity of the computing process, provided that the permissible recon
guration time, including the costs of migration of virtual machines, is exceeded.
3</p>
      <p>Cluster organization and options for its recovery
The cluster architecture computer system contains servers (Fig.1). Each server
is connected directly to one local storage device (local server storage device).
In the system to ensure automatic recon guration, aimed at supporting the
continuity of computing processes based on dynamic migration, pairs of physical
servers of the primary and secondary are allocated in the cluster. The main
server performs the required tasks critical to the continuity of the computing
process. The backup server is designed to perform dynamic recon guration with
ensuring the continuity of the computing process in case of possible failures of
the primary server. The backup server, in addition to implementing dynamic
system recon guration, performs some background tasks that are not critical to
the continuity of the computational process and to the time of query execution).
If the backup server fails, the background tasks that it performs may be lost
or redistributed to the main server if they are performed non-priority in the
background.</p>
      <p>With the simplest implementation of a fault tolerance cluster, it is equipped
with a pair of servers, one of which is designated as the main, and the second as
the backup.</p>
      <p>Fault tolerance technology involves launching a backup copy of the primary
server VM on the backup server and transferring the calculations to the backup
server in case of primary server or storage device failure.</p>
      <p>Consider system options that provide (option A) and do not provide (option
B) the restoration of physical nodes for states in which the continuity of the
computing process is ensured during recon guration of the cluster, which allows
us to select a working server and associated storage device for the implementation
of the calculations.</p>
      <p>For the options under consideration, in the event of a transition to a failure
state with the impossibility of implementing the required functions at least with
a minimal workable con guration, it is considered that the computational process
is interrupted for a time exceeding the maximum permissible value, which entails
a transition to a state of unrecoverable failure.</p>
      <p>Let us consider cluster systems while ensuring fault-tolerant functioning with
pairwise integration of physical servers into duplicated systems supporting the
processes of virtual machine migration and data replication. For each pair of
pairs in the cluster interacting to support dynamic recon guration of the servers
(duplicated system), state and transition diagrams for a variant of organization
A and B of a duplicated cluster system with recovery disciplines are shown in</p>
      <p>Calculation of the probability of operability of
duplicated systems
The presented systems of di erential equations make it possible to determine
the dependence of the probabilities of all states from time.</p>
      <p>The probability of the system working while maintaining the continuity of
the computing process for option A and B is de ned as:
and for option B is de ned as:</p>
      <p>P (t) =
P (t) =</p>
      <p>4
X Pi(t);
i=0
3
X Pi(t):
i=0</p>
      <p>Fig. 1. Cluster model.</p>
      <p>The results of calculating the probability of duplicated computer systems`
operability provided that the computing process is continuous for options A
(the curve 1) and B (the curve 2) of the maintaining process` organization are
presented in Fig. 3.</p>
      <p>The calculations were performed with the following failure rates 0 = 1:115
10 5 (1=h) , 1 = 3:425 10 6 (1=h), 2 = 2:3 10 6 (1=h), and recovery rates
0 = 0:33 (1=h) , 1 = 0:17 (1=h) , 2 = 0:33 (1=h) , 3 = 1 (1=h).</p>
      <p>The presented dependences make it possible to evaluate the e ect on the
probability of maintaining the operability of a duplicated system, restrictions on
the inadmissibility of interruption of the computational process, and the impact
of restoration work while maintaining the possibility of continuity of the process
of performing the required functions.
5</p>
      <p>
        Calculation of the probability of operability of
duplicated systems
The mean time between failures and the probability of working without failures
are related by the relation [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]:
T =
      </p>
      <p>P (t)dt:</p>
      <p>
        The average operating time to failure in accordance with the methodology of
[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] is found as follows. The mean time to failure can be obtained by integrating
the system of di erential equations for a model with an absorbing state, the
initial conditions P1(0) = 1; : : : Pk(0) = 0; Pn(0) = 0 for a model with n states.
      </p>
      <p>
        For the systems under study, integrating the left and right sides of the
systems of equations (1), (2) for the models under consideration. Given that in the
presence of an absorbing state, Pi(1) = 0, we have [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]:
      </p>
      <p>
        Where Ti is the average time spent in working condition i when starting work
from a operable state. Mean time to failure is determined by summing Ti, for
all operational states [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]:
      </p>
      <p>T =</p>
      <p>X Ti:</p>
      <p>For the system under consideration, the time to failure with service discipline
A : T1 = 3:891 105 hours, and B T2 = 3; 277 103 hours.
6</p>
      <p>Conclusions
The signi cance of the impact of ensuring the continuity of the computational
process on duplicated systems of the cluster architecture is demonstrated. The
result of the study was obtained on the basis of Markov models of reliability of a
fault-tolerant cluster with the migration of virtual machines when it is impossible
to recover after the interruption of the computational process. The time to the
rst failure of a duplicated system with recovery and without recovery in failure
states of nodes that do not violate the continuity of the computational process to
perform the required functional tasks critical to the continuity of the computing
process is determined.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Kopetz</surname>
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Real-Time</surname>
            <given-names>Systems</given-names>
          </string-name>
          :
          <article-title>Design Principles for Distributed Embedded Applications</article-title>
          . Springer,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Sorin D. Fault</surname>
          </string-name>
          <article-title>Tolerant Computer Architecture</article-title>
          . Morgan Claypool, Madison,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Dudin</surname>
            ,
            <given-names>A. N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>A multiserver MAP/PH/N system with controlled broadcasting by unreliable servers</article-title>
          .
          <source>Automatic Control and Computer Sciences V. 5</source>
          ,
          <fpage>32</fpage>
          -
          <lpage>44</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Zhmylev</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martynchuk</surname>
            <given-names>I. G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kireev</surname>
            <given-names>V. I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aliev</surname>
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Analytical methods of nonstationary processes modeling</article-title>
          . CEUR Workshop Proceedings V.
          <volume>2344</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Bogatyrev</surname>
            <given-names>A. V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bogatyrev</surname>
            <given-names>V. A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Bogatyrev S</surname>
          </string-name>
          . V.:
          <article-title>Multipath Redundant Transmission with Packet Segmentation. 2019 Wave Electronics and its Application in Information and Telecommunication Systems</article-title>
          (WECONF) (
          <year>2019</year>
          ) doi: 10.1109/WECONF.
          <year>2019</year>
          .8840643
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Bogatyrev</surname>
            <given-names>V. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bogatyrev</surname>
            <given-names>S. V.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Bogatyrev</surname>
            <given-names>A. V.</given-names>
          </string-name>
          :
          <article-title>Model and Interaction E ciency of Computer Nodes Based on Transfer Reservation at Mul-tipath Routing</article-title>
          .
          <source>2019 Wave Electronics and its Application in Information and Telecommunication Systems (WECONF)</source>
          (
          <year>2019</year>
          ) doi: 10.1109/WECONF.
          <year>2019</year>
          .8840647
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Jin</surname>
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shi</surname>
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pan</surname>
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Live virtual machine migration with adaptive memory compression</article-title>
          .
          <source>Proc. IEEE International Conf. on Cluster Computing (CLUSTER '09) Art</source>
          .
          <volume>5289170</volume>
          (
          <year>2009</year>
          ) doi: 10.1109/CLUSTR.
          <year>2009</year>
          .5289170
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Sahni</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varma</surname>
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>A hybrid approach to live migration of virtual machines</article-title>
          .
          <source>Proc. IEEE Int. Conf. on Cloud Computing for Emerging Markets (CCEM</source>
          <year>2012</year>
          ),
          <fpage>12</fpage>
          -
          <lpage>16</lpage>
          (
          <year>2012</year>
          ) doi: 10.1109/CCEM.
          <year>2012</year>
          .6354587
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Poymanova E. D. Tatarnikova T. M.</surname>
          </string-name>
          <article-title>: Models and Methods for Studying Network Tra c // 2018 Wave Electronics and its Application in Information and Telecommunication Systems</article-title>
          (WECONF) (
          <year>2018</year>
          ) doi: 10.1109/WECONF.
          <year>2018</year>
          .8604470
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Kutuzov</surname>
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tatarnikova</surname>
            <given-names>T.</given-names>
          </string-name>
          , :
          <article-title>On the Acceleration of Simulation Modeling</article-title>
          .
          <source>In 2019 XXII International Conference on Soft Computing and Measurements (SCM) doi: 10</source>
          .1109/SCM.
          <year>2019</year>
          .
          <volume>8903785</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <article-title>Knowledge sharing portal UNIX/Linux-systems, open source systems, networks, and other related things</article-title>
          , http://xgu.ru/wiki/Kemari Last accessed 15 Sep 2019
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Elizarov E Dell Live</surname>
          </string-name>
          <article-title>Volume: virtualize disk space</article-title>
          , http://blog.korphome.ru/
          <year>2016</year>
          /06/28/dell-live-volume
          <source>Last accessed 15 Sep 2019</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Bogatyrev</surname>
            <given-names>V. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aleksankov</surname>
            <given-names>S. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Derkach</surname>
            <given-names>A. N.</given-names>
          </string-name>
          :
          <article-title>Model of Cluster Reliability with Migration of Virtual Machines and Restoration on Certain Level of System Degradation //2018 Wave Electronics and its Application in Information and Telecommunication Systems (WECONF-</article-title>
          <year>2018</year>
          ) 92018)
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Astakhova</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shamin</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verzun</surname>
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kolbanev M. A. Astakhova</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shamin</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verzun</surname>
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kolbanev</surname>
            <given-names>M.</given-names>
          </string-name>
          :
          <source>Proceedings of the 10th Majorov International Conference on Software Engineering and Computer Systems. CEUR Workshop Proceedings MICSECS</source>
          <year>2018</year>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Victorova</surname>
            <given-names>V. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stepanjanc</surname>
            <given-names>A. C.</given-names>
          </string-name>
          :
          <article-title>About reliability indicators of the average operating time type</article-title>
          .
          <source>Reliability</source>
          <volume>4</volume>
          (
          <issue>51</issue>
          ),
          <fpage>27</fpage>
          -
          <lpage>36</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Victorova</surname>
            <given-names>V. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stepanjanc</surname>
            <given-names>A. C.</given-names>
          </string-name>
          :
          <article-title>Models and methods for calculating the reliability of technical systems. 2nd edn</article-title>
          .
          <source>URSS LLC Lenand</source>
          , Moscow (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Bogatyrev</surname>
            <given-names>V. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parshutina</surname>
            <given-names>S. A.</given-names>
          </string-name>
          :
          <article-title>Redundant Distribution of Requests Through the Network by Transferring Them Over Multiple Paths</article-title>
          .
          <source>Communications in Computer and Information Science</source>
          ,
          <volume>601</volume>
          ,
          <fpage>199</fpage>
          -
          <lpage>207</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Zakoldaev</surname>
            <given-names>D. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shukalov</surname>
            <given-names>A. V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zharinov</surname>
            <given-names>I. O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zharinov</surname>
            <given-names>O. O.</given-names>
          </string-name>
          :
          <article-title>Workstations Industry 4.0 for instrument engineering products</article-title>
          .
          <source>IOP Conference Series: Materials Science and Engineering</source>
          ,
          <volume>1</volume>
          (
          <issue>665</issue>
          ), pp.
          <volume>012014</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Korobeinikov</surname>
            <given-names>A. G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fedosovsky</surname>
            <given-names>M. E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zharinov</surname>
            <given-names>I. O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polyakov</surname>
            <given-names>V. I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shukalov</surname>
            <given-names>A. V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gurjanov</surname>
            <given-names>A. V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arustamov</surname>
            <given-names>S. A.</given-names>
          </string-name>
          :
          <article-title>Method for Conceptual Presentation of Subject Tasks in Knowledge Engineering for Computer-Aided Design Systems</article-title>
          .
          <source>Advances in Intelligent Systems and Computing</source>
          , V.
          <volume>680</volume>
          ,
          <fpage>50</fpage>
          -
          <lpage>56</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>