=Paper=
{{Paper
|id=Vol-3057/paper33.pdf
|storemode=property
|title=Reliability of Computer Systems During Physical and Informational Recovery of Duplicated Memory
|pdfUrl=https://ceur-ws.org/Vol-3057/paper33.pdf
|volume=Vol-3057
|authors=Vladimir A. Bogatyrev,Stanislav V. Bogatyrev,Anatoly V. Bogatyrev
}}
==Reliability of Computer Systems During Physical and Informational Recovery of Duplicated Memory==
Reliability Of Computer Systems During Physical And Informational Recovery Of Duplicated Memory Vladimir A. Bogatyrev 1,2, Stanislav V. Bogatyrev 1,3 and Anatoly V. Bogatyrev 3. 1 ITMO University, Kronverksky Pr. 49, bldg. A, Saint-Petersburg, 197101, Russia 2 Saint-Petersburg State University of Aerospace Instrumentation, 67, Bolshaya Morskaia str. St Petersburg, Russia 3 JSC NEO Saint Petersburg Competence Center, 1-Ya Sovetskaya, house 6 str. St. Petersburg, Russia Abstract The analysis of the influence of various requirements for the continuity of the computing process and the safety of information after failures on the reliability of a fault-tolerant computer system, including a computing node with two redundant storage devices connected to it, is carried out. The peculiarity of restoring storage devices is the need for both their physical and subsequent information recovery, that is, entering the information lost after a failure. Three variants of system recovery disciplines are considered, taking into account physical and informational memory recovery. For the first option, the recovered information is not unique and can always be loaded into the physically restored memory from some external source. For the second option, the information accumulated during the operation of the system can be entered into physically restored memory only based on using replicas of information stored in the system. The loss of all copies of the unique data may lead to the inability to restore the information, and as a result, to an unrecoverable failure of the computer system. The third option requires ensuring the continuity of the computing process, which is possible with the operability of the computing node and the availability for it of at least one functioning of the two storage devices containing up-to-date information. A Markov model of a computer system is proposed that reflects the stages of physical and informational recovery of duplicated memory for different criticality of the system to data loss due to memory failures. Keywords 1 the computer system, redundancy, recovery, continuity of the computing process, availability coefficient Introduction. 1. Introduction Computer systems operating in real-time, especially as part of cyber-physical systems, have high requirements for reliability, fault tolerance, readiness [1-5], and timeliness of servicing requests [6-10]. In real-time systems, in some cases, to support functional safety and reliability, it is necessary to ensure the continuity of the computing process and the timeliness of servicing requests in case of failures and intentional impacts [11-13]. The timeliness and continuity of the computing process are ensured when reserving data storage and processing facilities and can provide for redundant query service [14,15]. It should be noted that the redundancy of the computing process, in addition to ensuring its reliability, timeliness, and continuity, allows you to implement control based on comparing the results of calculations. Computer systems of responsible use may also have increased requirements for the safety of information, in this case, redundancy (in the simplest case, duplication) of storage devices with data Proceedings of VI International Scientific and Practical Conference Distance Learning Technologies (DLT–2021), September 20-22, 2021, Yalta, Crimea EMAIL: vabogatyrev@corp.ifmo.ru (A. 1); stanislav@nspcc.ru (A. 2); anatoly@nspcc.ru (A. 3) ORCID: 0000-0003-0213-0223 (A. 1); 0000-0003-0836-8515 (A. 2); 0000-0001-5447-7275 (A. 3) ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 274 replication in them is applied. The peculiarity of restoring storage devices is the need for both their physical and subsequent information recovery, that is, entering the information lost after a failure that is required to perform the functional tasks assigned to the system. Recovery of information after physical recovery of the memory of a failed device can be based on the use of replicas accumulated during the execution of application tasks in a non-failed device. The loss of all replicas of unique data can lead to the inability to restore information and, as a result, to an unrecoverable failure of the computer system. To increase the reliability of information stored in computer systems, RAID arrays are widely used, for which Markov models are known that take into account the features of information recovery [16, 17]. When designing modern complex computer systems for responsible purposes, there is a tendency to use a model-oriented approach that allows, based on modeling and optimization of the design solutions under consideration, to ensure a high level of reliability and efficiency of the systems being developed [18,19]. In the model-oriented design of real-time systems, when building reliability models, it is important to take into account various requirements for ensuring the continuity of the computing process and the safety of information after failures [14, 15, 20, 21]. The purpose of this article is to build reliability models focused on the justification of design decisions while ensuring the continuity of the computing process and the safety of information accumulated during the operation of the system with memory reservation. 2. Construction of Reliability Models for Physical and Informational Memory Recovery As an object of research, a computer system is considered, including a computing node with two redundant (duplicated) storage devices (memory blocks) connected to it. The proposed model of computer system reliability takes into account the two-stage memory recovery. At the first stage, the physical recovery of memory is carried out, and at the second, its informational recovery is carried out. The construction of storage devices based on the use of RAID array technology has not been implemented [16, 17]. Let's consider three options for organizing the recovery of information and the system as a whole, taking into account various requirements for ensuring the continuity of the computing process and the safety of information after failures. For the first option, we will assume that the information being restored is not unique and can always be entered into the physically restored memory from some source. For the second option, we will assume that the information being restored is unique and can be entered into physically restored memory only based on replicas stored in a non-failed storage device. If two storage devices fail, the information cannot be restored to the physically restored storage device, which may lead to the inability of the system to function, that is, to its failure (transition to a state of non-recoverable failure). For the third option, we will assume that it is necessary to ensure the continuity of the computing process, which is possible if the computing node is operational and at least one of the two storage devices containing up-to-date information is available for it. Markov models of reliability for the first, second, and third variants of the computer system organization are represented by diagrams of states and transitions according to Fig. 1, Fig. 2, and Fig. 3. We encode the system states in the form (x1, x2, x3). At the same time, if the computing unit is healthy, then x1=1, and if it failed, then x1=0. Variables x2, x3-display the states of memory blocks, while the states are highlighted in which the memory block is operational -"1", failed - "0", physically restored, but the required information is not entered in it - "F". States are invariant to the numbering of memory blocks (for example, states «101» and «110» are considered identical). The probabilities of the system state are denoted as P1, P2,.., P8. The presented diagrams indicate the failure rates and recoveries of the computing block λ1 and µ1, the failure rates of memory blocks λ2, and the intensity of their physical and information recovery µ21 and µ22. 275 Figure 1: Markov model of a computer for physical and informational recovery of duplicated memory The presented diagrams of states and transitions allow us to create a system of algebraic or differential equations, the solution of which allows us to determine the probabilities of all states of the system. As a result, it is possible to determine the dependence of the probability of system operability on the time of its operation, as well as the stationary and non-stationary availability coefficient for the restored systems. For example, for the diagram of states and transitions in Fig. 1, the system of differential equations has the form: (22 1 ) P0 (t ) 1 P7 (t ) 22 P3 (t ) P0 (t ), (2 1 21 ) P1 (t ) 1 P8 (t ) 22 P4 (t ) 222 P0 (t ) 2 P3 (t ) P1 (t ), 21 P2 (t ) 2 P1 (t ) 2 P4 (t ) P2 (t ), ( 22 22 1 ) P3 (t ) 21 P1 (t ) 1 P5 (t ) P3 (t ), ( 22 2 1 ) P4 (t ) 2 P3 (t ) 21 P2 (t ) 1 P6 (t ) P4 (t ), 1 P5 (t ) 1 P3 (t ) P5 (t ), 1 P6 (t ) 1 P4 (t ) P6 (t ), 1 P7 (t ) 1 P0 (t ) P7 (t ), 1 P8 (t ) 1 P1 (t ) P8 (t ). 276 Figure 2: Markov model of a computer for physical and informational recovery of duplicated memory Figure 3: Markov model of a computer with the criticality of the continuity of the computational process For the diagram of states and transitions according to Fig. 3, the system of differential equations has the form: (22 1 ) P0 (t ) 22 P3 (t ) P0 (t ), (2 21 ) P1 (t ) 22 P0 (t ) 2 P3 (t ) P1 (t ), 1 P0 (t ) (1 2 ) P1 (t ) (1 2 ) P3 (t ) P2 (t ), ( 22 22 1 ) P3 (t ) 21 P1 (t ) P3 (t ), The non-stationary readiness coefficient (the probability of the system's readiness to function without interrupting the computing process due to failures or loss of information) is defined as k (t ) P0 (t ) P1 (t ) P3 (t ), Note that the stationary readiness coefficient for the variants corresponding to Fig.3 and Fig.2 is zero, and for the variant, according to Fig.1 it is not zero. 277 3. Example of Calculating the Reliability of a Computer System An example of calculating the non-stationary availability coefficient of computer systems with duplicated storage devices is shown in Figure 4, where Curve 1 corresponds to the non-stationary availability coefficient k of a system that is not critical to data loss, and curve 2 corresponds to the non- stationary availability coefficient of a system that is critical to the loss of unique data. Curve 3 corresponds to the difference in the probability of operability d between the options critical to the loss of information and the preservation of the continuity of the computational process. The calculation was performed at λ1=10-4 1/h, λ2=2 10-4 1/h, µ1=1 1/h, µ21 =1 1/h, µ22 =11 1/h. The presented dependencies confirm the significance of the influence of the factors under consideration on the reliability of the computer systems under study. Figure 4: Calculation of the non-stationary availability coefficient of computer systems with duplicated storage devices 4. Conclusions For computer systems of cluster architecture operating in real-time, efficiency criteria are proposed and the possibilities of increasing the overall efficiency of servicing requests of heterogeneous traffic are shown based on the replication of waiting-critical requests and the allocation of a group of nodes for their servicing of certain types of requests for acceptable waiting delays. An analytical model is proposed and the efficiency of redundant service options is determined with the possible allocation of cluster nodes to solve the most critical waiting requests in queues. It is shown that there is a region of efficiency of reserved servicing of latency-critical requests when dividing cluster nodes into groups designed to service requests of different latency criticality. References [1] I. Aysan, Fault-tolerance strategies and probabilistic guarantees for real-time systems Mälardalen University, Västerås, Sweden. 2012. 190 p. [2] M. L. Shooman, Reliability of computer systems and networks. John Wiley & Sons Inc., 2002. 278 [3] G. N. Cherkesov, Reliability of Hardware and Software Systems, (in Russian). Saint-Petersburg: Piter, 2005. [4] M. Polovko, S. V. Gurov, Basis of Reliability Theory, (in Russian). Saint-Petersburg: BHV- Petersburg, 2006. [5] H. Koren, Fault-tolerant systems. Morgan Kaufmann publications, San Francisco 2009 378 p. [6] M. Bennis, M Debbah, H.V. Poor, Ultrareliable and Low-Latency Wireless Communication: Tail, Risk, and Scale. Proc. IEEE 2018, 106, 1834–1853. DOI: 10.1109/JPROC.2018.2867029 [7] I S. Kim, Y. Choi, Constraint-aware VM placement in heterogeneous computing clusters. Cluster Comput. 23, 71–85 (2020). https://doi.org/10.1007/s10586-019-02966-6. [8] J. Sachs, G. Wikström, T. Dudda, R. Baldemair.; K. Kittichokechai, 5G Radio Network Design for Ultra-Reliable Low-Latency Communication. IEEE Netw. 2018, 32, 24–31. DOI:10.1109/MNET.2018.1700232. [9] H.Ji, S. Park, J. Yeo, Y. Kim, J. Lee, B. Shim, Ultra-Reliable and Low-Latency Communications in 5G Downlink: Physical Layer Aspects. IEEE Wirel. Commun. 2018, 25, 124–130. DOI:10.1109/MWC.2018.1700294. [10] S. Samarasinghe, Neural Networks for Applied Sciences and engineering: from Fundamentals to Complex Pattern Recognition. Boston: Auerbach publications, 2016. – 570 p. [11] M. Siddiqi1, H. Yu, J. Joung, 5G Ultra-Reliable Low-Latency Communication Implementation Challenges and Operational Issues with IoT Devices Electronics 2019, 8, 981; doi:10.3390/electronics8090981 [12] D.A. Zakoldaev, A.G. Korobeynikov, A.V. Shukalov, I.O. Zharinov, O.O. Zharinov, Industry 4.0 vs Industry 3.0: the role of personnel in production//IOP Conference Series: Materials Science and Engineering, 2020, Vol. 734, No. 1, pp. 012048. DOI 10.1088/1757-899X/734/1/012048 [13] V. Malik, C.R. Barde, Live migration of virtual machines in cloud environment using prediction of CPU usage. International Journal of Computer Applications. 2015. V. 117 N 23. P. 1– 5. DOI: 10.5120/20691-3604 [14] V.A. Bogatyrev, A.V. Bogatyrev, S.V. Bogatyrev, Redundant Servicing of a Flow of Heterogeneous Requests Critical to the Total Waiting Time During the Multi-path Passage of a Sequence of Info-Communication Nodes. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 2020. Vol. 12563. pp. 100-112. DOI 10.1007/978-3-030-66471-8_9 [15] V.A. Bogatyrev, S.V. Bogatyrev, A.V. Bogatyrev, Redundant multi-path service of a flow heterogeneous in delay criticality with defined node passage paths. Journal of Physics: Conference Series, Volume 1864, 13th Multiconference on Control Problems (MCCP 2020) 6-8 October 2020, Saint Petersburg, Russia 2021 J. Phys.: Conf. Ser. 1864 012094 - 2021, Vol. 1864, 012094, No. 1, pp. 012094. DOI 10.1088/1742-6596/1864/1/012094. [16] M. Rausand, A.Hoyland, System reliability theory. John Wiley & Sons Inc., 2004. [17] M.K. Greenan, J. S. Plank, J. J. Wylie, Mean time to meaningless: MTTDL, Markov models, and storage system reliability, HotStorage (2010). [18] B. Sovetov, T. Tatarnikova, V. Cehanovsky, a Detection system for threats of the presence of the hazardous substance in the environment. Proceedings of 2019 22nd International Conference on Soft Computing and Measurements, SCM 2019 (2019) 121-124. DOI: 10.1109/SCM.2019.8903771. [19] T. Astakhova, N.Verzun, M. Kolbanev, A. Shamin, A model for estimating energy consumption seen when nodes of ubiquitous sensor networks communicate information to each other. In Proceedings of the 10th Majorov International Conference on Software Engineering and Computer Systems, Saint Petersburg, Russia, December 20-21 (2018). [20] V.A. Bogatyrev, S.V. Bogatyrev A.N. Derkach, Timeliness of the Reserved Maintenance by Duplicated Computers of Heterogeneous Delay-Critical Stream. CEUR Workshop Proceedings. 2019. Vol. 2522. pp. 26-36. [21] Bogatyrev V.A., Derkach A.N. Evaluation of a Cyber-Physical Computing System with Migration of Virtual Machines during Continuous Computing. Computers - 2020, Vol. 9, No. 2, pp. 42. DOI 10.3390/computers9020042. 279