=Paper=
{{Paper
|id=Vol-3057/paper33.pdf
|storemode=property
|title=Reliability of Computer Systems During Physical and Informational Recovery of Duplicated Memory
|pdfUrl=https://ceur-ws.org/Vol-3057/paper33.pdf
|volume=Vol-3057
|authors=Vladimir A. Bogatyrev,Stanislav V. Bogatyrev,Anatoly V. Bogatyrev
}}
==Reliability of Computer Systems During Physical and Informational Recovery of Duplicated Memory==
<pdf width="1500px">https://ceur-ws.org/Vol-3057/paper33.pdf</pdf>
<pre>
Reliability Of Computer Systems During Physical                                                                           And
Informational Recovery Of Duplicated Memory

Vladimir A. Bogatyrev 1,2, Stanislav V. Bogatyrev 1,3 and Anatoly V. Bogatyrev 3.
1
  ITMO University, Kronverksky Pr. 49, bldg. A, Saint-Petersburg, 197101, Russia
2
  Saint-Petersburg State University of Aerospace Instrumentation, 67, Bolshaya Morskaia str. St Petersburg,
Russia
3
  JSC NEO Saint Petersburg Competence Center, 1-Ya Sovetskaya, house 6 str. St. Petersburg, Russia


                Abstract
                The analysis of the influence of various requirements for the continuity of the computing
                process and the safety of information after failures on the reliability of a fault-tolerant computer
                system, including a computing node with two redundant storage devices connected to it, is
                carried out. The peculiarity of restoring storage devices is the need for both their physical and
                subsequent information recovery, that is, entering the information lost after a failure. Three
                variants of system recovery disciplines are considered, taking into account physical and
                informational memory recovery. For the first option, the recovered information is not unique
                and can always be loaded into the physically restored memory from some external source. For
                the second option, the information accumulated during the operation of the system can be
                entered into physically restored memory only based on using replicas of information stored in
                the system. The loss of all copies of the unique data may lead to the inability to restore the
                information, and as a result, to an unrecoverable failure of the computer system. The third
                option requires ensuring the continuity of the computing process, which is possible with the
                operability of the computing node and the availability for it of at least one functioning of the
                two storage devices containing up-to-date information. A Markov model of a computer system
                is proposed that reflects the stages of physical and informational recovery of duplicated
                memory for different criticality of the system to data loss due to memory failures.

                Keywords 1
                the computer system, redundancy, recovery, continuity of the computing process, availability
                coefficient Introduction.

1. Introduction

    Computer systems operating in real-time, especially as part of cyber-physical systems, have high
requirements for reliability, fault tolerance, readiness [1-5], and timeliness of servicing requests [6-10].
In real-time systems, in some cases, to support functional safety and reliability, it is necessary to ensure
the continuity of the computing process and the timeliness of servicing requests in case of failures and
intentional impacts [11-13]. The timeliness and continuity of the computing process are ensured when
reserving data storage and processing facilities and can provide for redundant query service [14,15]. It
should be noted that the redundancy of the computing process, in addition to ensuring its reliability,
timeliness, and continuity, allows you to implement control based on comparing the results of
calculations. Computer systems of responsible use may also have increased requirements for the safety
of information, in this case, redundancy (in the simplest case, duplication) of storage devices with data


    Proceedings of VI International Scientific and Practical Conference Distance Learning Technologies (DLT–2021), September 20-22,
2021, Yalta, Crimea
EMAIL: vabogatyrev@corp.ifmo.ru (A. 1); stanislav@nspcc.ru (A. 2); anatoly@nspcc.ru (A. 3)
ORCID: 0000-0003-0213-0223 (A. 1); 0000-0003-0836-8515 (A. 2); 0000-0001-5447-7275 (A. 3)
             ©️ 2021 Copyright for this paper by its authors.
             Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
             CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                  274
replication in them is applied. The peculiarity of restoring storage devices is the need for both their
physical and subsequent information recovery, that is, entering the information lost after a failure that
is required to perform the functional tasks assigned to the system. Recovery of information after
physical recovery of the memory of a failed device can be based on the use of replicas accumulated
during the execution of application tasks in a non-failed device. The loss of all replicas of unique data
can lead to the inability to restore information and, as a result, to an unrecoverable failure of the
computer system. To increase the reliability of information stored in computer systems, RAID arrays
are widely used, for which Markov models are known that take into account the features of information
recovery [16, 17].
    When designing modern complex computer systems for responsible purposes, there is a tendency to
use a model-oriented approach that allows, based on modeling and optimization of the design solutions
under consideration, to ensure a high level of reliability and efficiency of the systems being developed
[18,19]. In the model-oriented design of real-time systems, when building reliability models, it is
important to take into account various requirements for ensuring the continuity of the computing
process and the safety of information after failures [14, 15, 20, 21].
    The purpose of this article is to build reliability models focused on the justification of design
decisions while ensuring the continuity of the computing process and the safety of information
accumulated during the operation of the system with memory reservation.

2. Construction of Reliability Models for Physical and Informational Memory
   Recovery
    As an object of research, a computer system is considered, including a computing node with two
redundant (duplicated) storage devices (memory blocks) connected to it.
    The proposed model of computer system reliability takes into account the two-stage memory
recovery. At the first stage, the physical recovery of memory is carried out, and at the second, its
informational recovery is carried out. The construction of storage devices based on the use of RAID
array technology has not been implemented [16, 17].
    Let's consider three options for organizing the recovery of information and the system as a whole,
taking into account various requirements for ensuring the continuity of the computing process and the
safety of information after failures.
    For the first option, we will assume that the information being restored is not unique and can always
be entered into the physically restored memory from some source.
    For the second option, we will assume that the information being restored is unique and can be
entered into physically restored memory only based on replicas stored in a non-failed storage device. If
two storage devices fail, the information cannot be restored to the physically restored storage device,
which may lead to the inability of the system to function, that is, to its failure (transition to a state of
non-recoverable failure).
    For the third option, we will assume that it is necessary to ensure the continuity of the computing
process, which is possible if the computing node is operational and at least one of the two storage
devices containing up-to-date information is available for it.
    Markov models of reliability for the first, second, and third variants of the computer system
organization are represented by diagrams of states and transitions according to Fig. 1, Fig. 2, and Fig.
3.
    We encode the system states in the form (x1, x2, x3). At the same time, if the computing unit is
healthy, then x1=1, and if it failed, then x1=0. Variables x2, x3-display the states of memory blocks, while
the states are highlighted in which the memory block is operational -"1", failed - "0", physically
restored, but the required information is not entered in it - "F". States are invariant to the numbering of
memory blocks (for example, states «101» and «110» are considered identical). The probabilities of the
system state are denoted as P1, P2,.., P8. The presented diagrams indicate the failure rates and recoveries
of the computing block λ1 and µ1, the failure rates of memory blocks λ2, and the intensity of their
physical and information recovery µ21 and µ22.


                                                     275
Figure 1: Markov model of a computer for physical and informational recovery of duplicated memory

    The presented diagrams of states and transitions allow us to create a system of algebraic or
differential equations, the solution of which allows us to determine the probabilities of all states of the
system. As a result, it is possible to determine the dependence of the probability of system operability
on the time of its operation, as well as the stationary and non-stationary availability coefficient for the
restored systems.
    For example, for the diagram of states and transitions in Fig. 1, the system of differential equations
has the form:
    (22  1 ) P0 (t )  1 P7 (t )  22 P3 (t )  P0 (t ),
    (2  1  21 ) P1 (t )  1 P8 (t )  22 P4 (t )  222 P0 (t )  2 P3 (t )  P1 (t ),
    21 P2 (t )  2 P1 (t )  2 P4 (t )  P2 (t ),
   ( 22  22  1 ) P3 (t )  21 P1 (t )  1 P5 (t )  P3 (t ),
   ( 22  2  1 ) P4 (t )  2 P3 (t )  21 P2 (t )  1 P6 (t )  P4 (t ),
    1 P5 (t )  1 P3 (t )  P5 (t ),
    1 P6 (t )  1 P4 (t )  P6 (t ),
    1 P7 (t )  1 P0 (t )  P7 (t ),
    1 P8 (t )  1 P1 (t )  P8 (t ).


                                                                   276
Figure 2: Markov model of a computer for physical and informational recovery of duplicated memory


Figure 3: Markov model of a computer with the criticality of the continuity of the computational
process

   For the diagram of states and transitions according to Fig. 3, the system of differential equations has
the form:
                              (22  1 ) P0 (t )  22 P3 (t )  P0 (t ),
                              (2  21 ) P1 (t )  22 P0 (t )  2 P3 (t )  P1 (t ),
                               1 P0 (t )  (1  2 ) P1 (t )  (1  2 ) P3 (t )  P2 (t ),
                               ( 22  22  1 ) P3 (t )  21 P1 (t )  P3 (t ),
   The non-stationary readiness coefficient (the probability of the system's readiness to function
without interrupting the computing process due to failures or loss of information) is defined as
                                          k (t )  P0 (t )  P1 (t )  P3 (t ),
   Note that the stationary readiness coefficient for the variants corresponding to Fig.3 and Fig.2 is
zero, and for the variant, according to Fig.1 it is not zero.


                                                            277
      3. Example of Calculating the Reliability of a Computer System
    An example of calculating the non-stationary availability coefficient of computer systems with
duplicated storage devices is shown in Figure 4, where Curve 1 corresponds to the non-stationary
availability coefficient k of a system that is not critical to data loss, and curve 2 corresponds to the non-
stationary availability coefficient of a system that is critical to the loss of unique data. Curve 3
corresponds to the difference in the probability of operability d between the options critical to the loss
of information and the preservation of the continuity of the computational process. The calculation was
performed at λ1=10-4 1/h, λ2=2 10-4 1/h, µ1=1 1/h, µ21 =1 1/h, µ22 =11 1/h.
    The presented dependencies confirm the significance of the influence of the factors under
consideration on the reliability of the computer systems under study.


Figure 4: Calculation of the non-stationary availability coefficient of computer systems with duplicated
storage devices


4. Conclusions

   For computer systems of cluster architecture operating in real-time, efficiency criteria are proposed
and the possibilities of increasing the overall efficiency of servicing requests of heterogeneous traffic
are shown based on the replication of waiting-critical requests and the allocation of a group of nodes
for their servicing of certain types of requests for acceptable waiting delays.
   An analytical model is proposed and the efficiency of redundant service options is determined with
the possible allocation of cluster nodes to solve the most critical waiting requests in queues.
   It is shown that there is a region of efficiency of reserved servicing of latency-critical requests when
dividing cluster nodes into groups designed to service requests of different latency criticality.


References
[1] I. Aysan, Fault-tolerance strategies and probabilistic guarantees for real-time systems Mälardalen
    University, Västerås, Sweden. 2012. 190 p.
[2] M. L. Shooman, Reliability of computer systems and networks. John Wiley & Sons Inc., 2002.

                                                      278
[3] G. N. Cherkesov, Reliability of Hardware and Software Systems, (in Russian). Saint-Petersburg:
    Piter, 2005.
[4] M. Polovko, S. V. Gurov, Basis of Reliability Theory, (in Russian). Saint-Petersburg: BHV-
    Petersburg, 2006.
[5] H. Koren, Fault-tolerant systems. Morgan Kaufmann publications, San Francisco 2009 378 p.
[6] M. Bennis, M Debbah, H.V. Poor, Ultrareliable and Low-Latency Wireless Communication: Tail,
    Risk, and Scale. Proc. IEEE 2018, 106, 1834–1853. DOI: 10.1109/JPROC.2018.2867029
[7] I S. Kim, Y. Choi, Constraint-aware VM placement in heterogeneous computing clusters. Cluster
    Comput. 23, 71–85 (2020). https://doi.org/10.1007/s10586-019-02966-6.
[8] J. Sachs, G. Wikström, T. Dudda, R. Baldemair.; K. Kittichokechai, 5G Radio Network Design for
    Ultra-Reliable     Low-Latency       Communication.       IEEE     Netw.    2018,     32,    24–31.
    DOI:10.1109/MNET.2018.1700232.
[9] H.Ji, S. Park, J. Yeo, Y. Kim, J. Lee, B. Shim, Ultra-Reliable and Low-Latency Communications in
    5G Downlink: Physical Layer Aspects. IEEE Wirel. Commun. 2018, 25, 124–130.
    DOI:10.1109/MWC.2018.1700294.
[10]     S. Samarasinghe, Neural Networks for Applied Sciences and engineering: from Fundamentals
    to Complex Pattern Recognition. Boston: Auerbach publications, 2016. – 570 p.
[11]     M. Siddiqi1, H. Yu, J. Joung, 5G Ultra-Reliable Low-Latency Communication Implementation
    Challenges and Operational Issues with IoT Devices Electronics 2019, 8, 981;
    doi:10.3390/electronics8090981
[12]     D.A. Zakoldaev, A.G. Korobeynikov, A.V. Shukalov, I.O. Zharinov, O.O. Zharinov, Industry
    4.0 vs Industry 3.0: the role of personnel in production//IOP Conference Series: Materials Science
    and Engineering, 2020, Vol. 734, No. 1, pp. 012048. DOI 10.1088/1757-899X/734/1/012048
[13]     V. Malik, C.R. Barde, Live migration of virtual machines in cloud environment using
    prediction of CPU usage. International Journal of Computer Applications. 2015. V. 117 N 23. P. 1–
    5. DOI: 10.5120/20691-3604
[14]     V.A. Bogatyrev, A.V. Bogatyrev, S.V. Bogatyrev, Redundant Servicing of a Flow of
    Heterogeneous Requests Critical to the Total Waiting Time During the Multi-path Passage of a
    Sequence of Info-Communication Nodes. Lecture Notes in Computer Science (including subseries
    Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 2020. Vol. 12563. pp.
    100-112. DOI 10.1007/978-3-030-66471-8_9
[15] V.A. Bogatyrev, S.V. Bogatyrev, A.V. Bogatyrev, Redundant multi-path service of a flow
    heterogeneous in delay criticality with defined node passage paths. Journal of Physics: Conference
    Series, Volume 1864, 13th Multiconference on Control Problems (MCCP 2020) 6-8 October 2020,
    Saint Petersburg, Russia 2021 J. Phys.: Conf. Ser. 1864 012094 - 2021, Vol. 1864, 012094, No. 1,
    pp. 012094. DOI 10.1088/1742-6596/1864/1/012094.
[16]     M. Rausand, A.Hoyland, System reliability theory. John Wiley & Sons Inc., 2004.
[17]     M.K. Greenan, J. S. Plank, J. J. Wylie, Mean time to meaningless: MTTDL, Markov models,
    and storage system reliability, HotStorage (2010).
[18]     B. Sovetov, T. Tatarnikova, V. Cehanovsky, a Detection system for threats of the presence of
    the hazardous substance in the environment. Proceedings of 2019 22nd International Conference on
    Soft Computing and Measurements, SCM 2019 (2019) 121-124. DOI: 10.1109/SCM.2019.8903771.
[19]     T. Astakhova, N.Verzun, M. Kolbanev, A. Shamin, A model for estimating energy
    consumption seen when nodes of ubiquitous sensor networks communicate information to each
    other. In Proceedings of the 10th Majorov International Conference on Software Engineering and
    Computer Systems, Saint Petersburg, Russia, December 20-21 (2018).
[20]     V.A. Bogatyrev, S.V. Bogatyrev A.N. Derkach, Timeliness of the Reserved Maintenance by
    Duplicated Computers of Heterogeneous Delay-Critical Stream. CEUR Workshop Proceedings.
    2019. Vol. 2522. pp. 26-36.
[21]     Bogatyrev V.A., Derkach A.N. Evaluation of a Cyber-Physical Computing System with
    Migration of Virtual Machines during Continuous Computing. Computers - 2020, Vol. 9, No. 2, pp.
    42. DOI 10.3390/computers9020042.


                                                   279

</pre>