=Paper= {{Paper |id=Vol-1940/paper08 |storemode=property |title=Stochastic Modelling of Large-Scale Distributed Computer Systems Functioning with Group Restorations |pdfUrl=https://ceur-ws.org/Vol-1940/paper08.pdf |volume=Vol-1940 |authors=Valery Pavsky,Kirill Pavsky }} ==Stochastic Modelling of Large-Scale Distributed Computer Systems Functioning with Group Restorations== https://ceur-ws.org/Vol-1940/paper08.pdf
   Stochastic Modelling of Large-Scale Distributed
Computer Systems Functioning with Group Restorations

                             Valery Pavsky 1, and Kirill Pavsky 2
    1
        Kemerovo Institute of Food Science and Technology (University), Kemerovo, Russia
                                     pavva46@mail.ru

    2
        Rzhanov Institute of Semiconductor Physics Siberian Branch of Russian Academy of
                   Sciences, pr. Lavrentieva, 13, Novosibirsk, 630090, Russia
                                      pkv@isp.nsc.ru



         Abstract.
         The model of functioning of distributed computer systems with group restora-
         tions of failed machines is considered. The model is formalized in the form of a
         system of differential equations, in which are unknown probabilities of the
         states. The paper proposes solutions for calculating the mathematical expecta-
         tion of the number of efficient machines and variances, which are the basis for
         creating indices of potential robustness.
         The investigation of the functioning of the CS under the assumption of the va-
         lidity of the exponential law of failure of computers makes it possible, due to a
         well-developed theory, to obtain profound results, in contrast to the use of other
         distribution laws. And the obtained analytical solutions can be used for rapid
         analysis of the functioning of the CS.

         Keywords: distributed computer systems, mathematical model, group restora-
         tion, number of working machines, analytical solutions.


1        Introduction

The problem of the reliability of high-performance scalable computing devices, in-
cluding supercomputers and computer systems [1], increases with the number of ele-
mentary machines (EM) [2], the number of which in such systems ranges from sever-
al tens to hundreds of thousands [3]. Practice shows that in scalable computing sys-
tems (CS) the time between different types of failures can be measured by hours [4,5].
The preservation of the efficiency of the CS in the conditions of failures [6-8], the
analysis of functioning with regard to robustness, is an urgent task.
This work is devoted to the development of means for analyzing the efficiency of the
operation of larger-scale distributed CS [9, 10]. The queuing theory apparatus is used
as an analysis tool.
The paper proposes formulas for calculating the mathematical expectation and va-
riance of the number of working machines, which are the basis for creating indicators
of potential robustness in group recovery [2]. The indices of potential robustness of
                                                                                        61


the CS take into account the fact that in the solution of problems all the working EMs
are used, the number of which is actually instant. This assumes that parallel programs
of complex tasks, when implemented on survivable CS, are capable of using the total
performance of all working EM systems. Assuming mathematical idealization, CS
can be regarded as a stochastic object.


2      Model of the CS functioning

A stochastic model of the distributed CS operation is proposed, in which the EMs are
not absolutely reliable [2]. Each of them fails with  intensity. Out of order EM gets
into the restoration system and is waiting for resoration. At random moments of time,
the recovery is exercised by groups in r EM with  intensity (Fig.1). We believe that
at the initial time the system contains n EMs. As performance indices (evaluation of
potential viability), we use numerical characteristics – the mathematical expectation
of the number of effective EMs and its variance [2].
                       Computer System                             Restoration System



                                                                          
                                                 r           r

                                           μ           μ           μ
                                                            



                                                       k
                                .




Fig. 1. Model of the functioning of a scalable CS in the group recovery of failed EM.

In constructing the model of the functioning of the CS, we use the methods of
queuing theory, in which models of this class are formulated according to the tried-
and-tested method – a system of differential equations is compiled, and the probabili-
ty distribution is considered as unknown functions. The analytical solution of such
systems is far from always available, even stationary [2]. Assuming further mathemat-
ical idealization of the model, we believe that the number of EMs in the CS is poten-
tially infinite, which is permissible because of its scalability. From the formalized
system of differential equations of the model, we find an analytical solution, directly
for the moments, bypassing the probability distribution.
62


3      Mathematical Model

The queuing system (QS) with an infinite number of channels receives a Poisson
stream of packets with  intensity [11, 12]. Each packet consists of r requirements.
At any fixed time t  [ 0 ,  ) , the QS is in one of a number of incompatible states C k -
where k is the number of requirements in the QS, including undeserved requirements.
The service time of each requirement is subject to an exponential distribution with a
parameter  . If the system is in the C k state, then one of the k requirements leaves
the system with the k intensity. It is required to find the mathematical expectation
of the M ( t ) state number, in which the system is located when servicing the re-
quirements and the corresponding variance D( t ) , provided that M ( 0 )  n , D( 0 )  0 .
Figure 2 shows a graph-scheme of QS states that allows us to better understand the
relationship between the formulation of the model and its formalization by a system
of differential equations.
                          μ        μ                              μ                   μ                 μ
     c0         c1            c2             cr         cr1               cj      c j r      c j r 1   
           λ         2λ                   rλ         (r 1)λ                jλ        ( j  r)λ ( j  r 1)λ


Fig. 2. Graph-scheme of QS states, for the scalable CS operation model.

 Pk ( t ) denotes the probability that at the instant t the QS is in the state C k , k = 0,1,…
The system of differential equations has the form:
            P' ( t )   P ( t )  P ( t ),
            0               0         1
            '
            Pk ( t )  (   k )Pk ( t )  ( k  1 )Pk 1( t ), 0  k  r ,(*)           (1)
            '
            Pk ( t )  (   k )Pk ( t )  Pk  r ( t )  ( k  1 )Pk 1( t ), k  r .
To find the mathematical expectation and variance, we apply the method of generat-
ing functions. We transform the system of equations (1) so that in it the middle equa-
tion (*), with sliding parameter k, would be obtained from the last equation, at 0