=Paper= {{Paper |id=Vol-1614/paper_111 |storemode=property |title=Availability Model of Critical NPP I&C Systems Considering Software Reliability Indices |pdfUrl=https://ceur-ws.org/Vol-1614/paper_111.pdf |volume=Vol-1614 |authors=Bogdan Volochiy,Vitaliy Yakovyna,Oleksandr Mulyak |dblpUrl=https://dblp.org/rec/conf/icteri/VolochiyYM16 }} ==Availability Model of Critical NPP I&C Systems Considering Software Reliability Indices== https://ceur-ws.org/Vol-1614/paper_111.pdf
Availability Model of Critical NPP I&C Systems
   Considering Software Reliability Indices

        Bogdan Volochiy, Vitaliy Yakovyna, Oleksandr Mulyak

 National University Lviv Polytechnic, 12 Bandera St., 79013, Lviv, Ukraine

      bvolochiy@ukr.net, yakovyna@polynet.lviv.ua,
               mulyak.oleksandr@gmail.com



Abstract. Providing the high availability level for the Instrumentation and Con-
trol (I&C) Systems in Nuclear Power Plants (NPP) is highly important. The
availability of the critical NPP I&C systems depends on the hardware and soft-
ware reliability behavior. The high availability of the I&C systems is ensured
by the following measures: structural redundancy with choice of the I&C sys-
tem configurations (two comparable sub-systems in the I&C system, majority
voting "2oo3", "2oo4", etc.); maintenance of the I&C system, which implies the
repair (changing) of no operational modules; using the N-version programming;
software updates; automatic software restart after temporary interrupts caused
by the hardware fault. This paper proposes solution of the following case: the
configuration of the fault-tolerant I&C system with known reliability indexes of
hardware (failure rate and temporary failure rate) is chosen, the maintenance
strategy of hardware (mean time to repair, numbers of repair), methods to fore-
cast the number of software failures and the failure rate is specified. To solve
this issue, the availability model of the fault-tolerant I&C system was devel-
oped in the discrete-continuous stochastic system form. We have estimated the
influence of the I&C system on the operational software parameters. Two con-
figurations of I&C systems are presented in this paper: two comparable sub-
systems in I&C system, and I&C system with majority voting "2oo3".


Keywords. Instrumentation and Control (I&C) System, Discrete-Continuous
Stochastic System, Reliability Behavior, Structural-Automated Model, Mark-
ovian Chains, Software Reliability


Key Terms. Mathematical Modeling, Method, Software Systems




ICTERI 2016, Kyiv, Ukraine, June 21-24, 2016
Copyright © 2016 by the paper authors
                                         - 400 -




1      Introduction

1.1    Motivation
Nowadays the development of fault-tolerant computer-based systems (FTCSs) is a
part of weaponry components, space, aviation, energy and other critical systems. One
of the main tasks is to provide requirements of reliability, availability and functional
safety. Thus the two types of possible risks relate to the assessment of risk, and to
ensuring their safety and security.
   Reliability (dependability) related design (RRD) [1-6] is a main part of develop-
ment of complex fault-tolerant systems based on computers, software (SW) and
hardware (HW) components. The goal of RRD is to develop the structure of FTCS
tolerating HW physical failure and SW designs faults and assure required values of
reliability, availability and other dependability attributes. To ensure fault-tolerance
software, two or more versions of software (developed by different developers, using
other languages and technologies, etc) are used [7].Therefore use of structural redun-
dancy for FTCS with multiple versions of software is mandatory. When commission-
ing software some bugs (design faults) remain in its code [8], this leads to the shut-
down of the FTCS. After detection the bugs, a software update is carried out. These
factors have influence on the availability of the FTCS and should be taken into ac-
count in the availability indexes. During the operation of FTCS it is also possible that
the HW will fail leading to failure of the software. To recover the software operabil-
ity, an automatic restart procedure, which is time consuming, is performed. The effi-
ciency of fault-tolerant hardware of FTCS is provided by maintenance and repair.
   Insufficient level of adequacy of the availability models of FTCS leads either to
additional costs (while underestimating of the indexes), or to the risk of total failure
(when inflating their values), namely accidents, material damage and even loss of life.
Reliability and safety are assured by using (selection and development) fault-tolerant
structures at RRD of the FTCS, and identifying and implementing strategies for main-
tenance. Adoption of wrong decisions at this stage leads to similar risks.

1.2    Related Works Analysis
Research papers, which focus on RRD, consider models of the FTCS. Most models
are primarily developed to identify the impact of one the above-listed factors on reli-
ability indexes. The rest of the factors are overlooked. Papers [4, 5] describe the reli-
ability model of FTCS which illustrates separate HW and SW failures. Paper [6] offer
reliability model of a fault-tolerant system, in which HW and SW failures are differ-
entiated and after corrections in the program code the software failure rate is ac-
counted for. Paper [8] describes the reliability model of the FTCS, which accounts for
the software updates. In paper [10] the author outlines the relevance of the estimation
of the reliability indexes of FTCS considering the failure of SW and recommends a
method for their determination. Such reliability models of the FTCS produce analysis
of its conditions under the failure of SW. This research suggests that
MTTFsystem=MTTFsoftware. Thus, it is possible to conclude that the author considers the
HW of the FTCS as absolutely reliable. Such condition reduces the credibility of the
                                        - 401 -




result, especially when the reliability of the HW is commensurable to the reliability of
the SW. Paper [11] presents the assessment of reliability parameters of FTCS through
modeling behavior using Markovian chains, which account for multiple software
updates. Nevertheless there was no evidence of the quantitative assessments of the
reliability measures of presented FTCS.
   In paper [12], the authors propose a model of FTCS using Macro-Markovian
chains, where the software failure rate, duration of software verification, failure rate
and repair rate of HW are accounted for. The presented method of Macro-Markovian
chains modeling [12, 13] is based on logical analysis and cannot be used for profound
configurations of FTCS due to their complexity and high probability of the occur-
rence of mistakes. Also there is a discussion around the definition of requirements for
operational verification of software of the space system, together with the research
model of the object for availability evaluation and scenarios preference. It is noted
that over the last ten years out of 27% of space devices failures, which were fatal or
such that restricted their use, 6% were associated with HW failure and 21% with SW
failure.
   Software updates are necessary due to the fact that at the point of SW commission-
ing they may contain a number of undetected faults, which can lead to critical failures
of the FTCS. Presence of HW faults relates to the complexity of the system, and fail-
ure to conduct overall testing, as such testing is time consuming and needs substation
financial support. To predict the number of SW faults at the time of its commissioning
various models can be used, one for example is Jelinski-Moranda [14].
   A goal of the paper is to suggest a technique to develop a Markovian chain for
critical NPP I&C system with different redundancy types (first of all, structure and
version) using the proposed formal procedure and tool. The main idea is to decrease
risks of errors during development of Markovian chain (MC) for systems with very
large (tens and hundreds) number of states. We propose a special notation which al-
lows supporting development chain step by step and designing final MC using soft-
ware tools. The paper is structured in the following way. The aim of this research is
calculating the availability function of critical NPP I&C system with version-
structural redundancy and double software updates.
   To achieve this goal we propose a newly designed reliability model of critical NPP
I&C system. As an example a special critical NPP I&C system is researched (Fig.1).
The following factors are accounted for in this model: overall reserve of critical NPP
I&C system and joint cold redundancy of modules of main and diverse systems of
critical NPP I&C system; the existence of three software versions; SW double update;
physicals fault.
   Structure of the paper is the following. Researched critical NPP I&C system is de-
scribed in the second section. An approach to developing mathematical model based
on Markovian chain and detailed procedure for the critical NPP I&C system are sug-
gested in the third and fourth sections correspondingly. Simulation results for re-
searched Markov’s model are analyzed in the section 4. Last section concludes the
paper and presents some directions of future researches and developments.
                                             - 402 -




2      Researched Typical NPP Instrumentation and Control
       Systems Based on Digital Platform
Here we provide the structure (Fig.1) of researched typical NPP Instrumentation and
Control system (I&C) based on the digital platform [15]. This platform consists of
main and diverse systems which are based on the Field Programmable Gates Arrays
(FPGA) chips [16]. Main and Diverse systems based on the FPGA safety controller
(FSC) with three parallel channels on voting logic “2-out-of-3”.




                     Fig. 1. Configuration of critical NPP I&C systems

   This architecture consists of two system (main, diverse) each of them consists three
channels connected in parallel with majority voting arrangement for the output sig-
nals, such that the output state is not charged if only one channel gives a different
result, which disagrees with the other two channels.
   The signals from Main and Diverse systems are comparing by element OR.


3      Methods to Forecast the Number of Software Failures
       and the Software Failure Rate
   The papers [18, 19] describe methods of predicting numbers of undetected SW de-
sign faults. This method is based on the SW reliability model with index of complex-
ity [20, 21]. The SW reliability model [20] describes the behavior of SW failures in
non homogeneous Poisson process forms. The cumulative number of SW failures up
to time t is calculated based on formula (1):

                                                            
                       mt      s t s e  t  sG t s  ,                   (1)

                 z
where G z  p   t p 1e  t dt – an incomplete gamma function, α – the coefficient
                 
                 0
describes the total number of SW failure, β – the factor that represents the rate of
                                          - 403 -




detection of SW failures, s – an index of SW complexity.
   Work [21] researches and specifies the intervals of value of the complexity index
of SW s. This circumstance has allowed for the elaboration of a formal selection rule
for SW reliability models with different complexity indexes. The total number of SW
failures (and, consequently, the total number of SW design faults Ndef, on condition
that one SW failure is caused by one SW design fault) is determined by the value of
the function of the cumulative number of SW failures (1) at t:

                            N def  m  sGs  ,                                (2)

where G(s) – the Gamma function.
  To estimate the undetected numbers of SW design faults, the following steps [22]
should be performed:
─ carry out SW testing and represent the result as the number of SW failures in de-
  fined interval. The input range of statistical sampling is divided into equal interval
  l ≤ 5lg(n) (where n – the total number of SW failures obtained during testing);
─ define the point estimates of the reliability SW model parameters α, β, and define
  parameter s by using the method of maximum likelihood [20];
─ carry out the Kolmogorov – Smirnov test for quality, reviewing the experimental
  reliability model described;
─ use the point estimates of the reliability SW model parameters according to (2) the
  defined total number of SW design faults Ndef . The forecast for the number of
  undetected SW design faults is obtained by subtracting the total number of SW de-
  sign faults Ndef and defined SW design faults.
  Using regression analysis [18, 19], it is possible to:
─ increase the accuracy of the forecast of the total number of SW design faults using
  formula (2); or
─ decrease the time required to forecast the number of SW design faults.
   The number SW faults depends on the duration of SW testing, which provides in-
formation about SW failure behavior. The variable Ndef from formula (2) was esti-
mated using a nonlinear regression with explanatory variables Ti – time of SW testing.
The following equation (3) was used as the regression equation.
                                          e
                                          x
                                          p




                                  
                    N def Ti   A 1     k T  T   ,
                                                    i   c
                                                            d
                                                                                    (3)

where A, k, d, Tc – parameters of the regression equation.
  It is then possible to determine the adjusted forecast of the total number of SW
           *
failures N def from equation (3) on condition of the time of SW testing being unlim-
ited (Ti). Based on the equation (3) the total number of SW failure is equal to the
value of regression parameter A.
   To estimate the adjusted forecast for the total number of undetected SW failures
  *
N def , the following steps should be performed:
                                                                            - 404 -




─ during the SW testing procedure, it is necessary to calculate the point estimates of
  the reliability SW model parameters α, β and s [20] by using the methods of
  maximum likelihood on the interval (0; Ti), where Ti - the current moment of SW
  testing. It is also is necessary to calculate Ndef(Ti) according the equation (2);
─ estimate the parameter of regression equation (3) by using the least squares method
  for set of Ndef(Ti); N def  A ;
                                                  *


─ the forecast number of undetected SW faults is determined by subtracting the num-
                                                                                      *
  ber of detected and fixed SW faults from N def ;
─ in the case where a continuation of the process for SW testing is necessary, go back
  to step 1 and continue adding the new value to set Ndef(Ti).
   An example of dependence Ndef(Ti) [19] which was obtained during the SW testing
procedure is presented in figure 2.
                                         40

                                         38

                                         36

                                         34
               Number of defects, Ndef




                                         32

                                         30

                                         28

                                         26

                                         24

                                         22

                                         20
                                              0         200          400       600        800      1000   1200
                                                                     Testing duration, Ti (runs)


      Fig. 2. Dependents of forecasting the numbers of SW faults Ndef (points), which
    was calculated according equation (2) from the SW testing durations and appropriated
                                 regression equation (line)

   In this case, using the methods of forecasting t the SW failure numbers and equa-
tion (3) increases the accuracy of forecasting by 2-3%. Also, this method decreases
the time required to forecast the number of SW failures [19].
   An advantage of the SW reliability model [20] is that it is possibile to estimate the
SW failure rate based on SW testing results at the appropriate level of the life cycle.
The SW failure rate depends on the time of SW testing (this dependence is caused by
correction of the SW faults on the appropriate live cycle). The relationship takes the
form (4):
                                                                 dmt 
                                                       t              s 1t s e  t ,                    (4)
                                                                  dt
                                                - 405 -




   As a result of using equation (4), the point value of the model parameters and the
duration of SW , it is possible to calculate the SW failure rate SW – which is constant
in time. It is necessary to estimate the value of SW, the availability of the I&C NPP
system based on Markovian analysis.
   The authenticity of the estimate of the undetected SW faults [23, 24] is provided by
forecasting the SW failure numbers (as result SW faults) based on artificial neural
networks (NN) with radial basis function (RBF). The NN RBF is a nonparametric
model of behavior of SW reliability which does not require a priori knowledge and
assumptions about the behavior of SW failure. In this research, input data about SW
failures were presented in cumulative time series form. The cumulative time series is
used for learning about the neural networks RBF and for forecasting the value of SW
failure on subsequent time series.
   The most reasonable results of forecasting SW failure are obtained by using NN
RBF with an Inverse Multi-quadratic function (10 neurons in input layer and 30 neu-
rons hidden layer) [24]. In this configuration, the mean square error of approximation
is 1,0%. The coefficient of determination between the forecasting and controlled se-
ries is 0,9965. Although the accuracy of forecasting decreases by 1,7%, it is possible
to reduce the duration of learning time of the neural network by 3-6 times by using a
Gaussian function (15 neurons in the input layer and 10 neurons in the hidden layer)
[23, 24].
   As a result of the different SW systems analysed, a configuration of neural network
RBF was conducted that could be used for time series forecast with homogeneous
failure process represented by a cumulative time series.
   Figure 3 presents an example of forecasting t, specifically, the total number of SW
failures of the web-browser Chromium forecast using the neural network RBF with
parameters listed above.

                             160                              Experiment
                                                              Prediction


                             140
           Software faults




                             120




                             100



                                   100   110   120          130      140     150
                                                     Time

     Fig. 3. An example of forecasting t, the total number of SW failures of web-browser
                         Chromium, using the neural network RBF
                                        - 406 -




   This parts of paper outlines the estimated numbers of undetected SW faults using
two methods based on regressions analysis and neural networks. This is used for re-
liably estimating the number of undetected SW faults and ensures the requirements of
standard [25] are satisfied. It is considered acceptable when number of SW faults
calculated by two methods is equal to or less than the standards requirement.


4      Markov’s Model for Critical NPP I&C Systems
       with Software Updates
The method of automated development the Markovian chain of the researched critical
NPP I&C systems is described in the works [9, 26]. It involves a formalized represen-
tation of the object of study as a “structural-automated model”. To develop this avail-
ability model of the critical NPP I&C systems one needs to perform the following
tasks: develop a verbal description of the research object (fig. 1); define the basic
events; define the components of vector states, which can be described as a state of
random time; define the parameters for the object of research, which should be in the
model; and shape the tree of the modification of the rules and component of the vector
of states.

4.1    The Procedures to Describe Behavior of the Critical NPP I&C Systems
The critical NPP I&C systems behavior is described by the following procedures:
─ Procedure 1. Detection the failure in the critical NPP I&C systems (hardware fail-
  ure, software failure). Failure can occur in the Main system (MS) and Diverse sys-
  tem (DS).
─ Procedure 2. Detection of failure in the MS or in the DS of the critical NPP I&C
  systems.
─ Procedure 3. Connection of the module from cold standby to faulty systems.
─ Procedure 4. Loading the software on the module with connections from cold
  standby to faulty systems.
─ Procedure 5. Software updating.
─ Procedure 6. Repair (replacement) of the HW of the faulty systems.

4.2    A Set of the Events for the Critical NPP I&C Systems
According to described procedures which determine the behavior of critical NPP I&C
systems, a list of events is composed. Events are presented in pairs corresponding to
the start and the end of time intervals to perform each procedure. From this list of
events for “structural-automated model” basic events are selected [9].
   As a result of analysis, seven basic events in particular were determined: Event 1 -
“Hardware failure of the MS module”; Event 2 - “Software failure of the MS mod-
ule”; Event 3 - “Hardware failure of the DS module”; Event 4 - “Software failure of
the DS module”; Event 5 - “Completing of the module switching from cold standby to
non-operational systems”; Event 6 - “Completing of the software updates procedure”;
Event 7 - “Completing of the procedure of the hardware repair”
                                          - 407 -




4.3     Components of Vector States for the Critical NPP I&C Systems
Components of the vector state that can also be described as a state of random time.
To describe the state of the system, eleven components are used: V1 – displays the
current number of modules in the MS (the initial value of components V1 equal to n);
V2 – displays the current number of modules in the DS (the initial value of compo-
nents V2 equal to k); V3 – displays the current number of modules in cold standby
(the initial value of components V3 equal to mc); V4 – displays which software ver-
sion is operated by the MS (V4=0 – first version, V4=1 – second version, V4=2 –
third version); V5 – displays which software version operated by DS (V5=0 – first
version, V5=1 – second version, V5=2 – third version); V6 – displays the SW faults
in the MS; V7 – displays the SW faults in the DS; V8 – displays the SW failure in the
MS; V9 – displays the SW failure in the DS; V10 – displays the number of non-
operational module, due to HW failure.

4.4     The Parameters of the Critical NPP I&C Systems Markov’s Model
Developing Markov’s model of the critical NPP I&C systems, its composition and
separate components should be set to relevant parameters in particular: n – number of
modules that are the part of the MS; k – number of modules that are the part of the
DS; mc –number of the modules in the cold standby;hw– the failure rate that is in MS
or DS and in the hot standby; sw11, sw12 – the failure rate of first and second software
versions;Tup1, Tup2 – mean time of the first and second software updates; Tswitch – mean
time of the module connections from standby; Trep– mean time of hardware repair.

4.5     Structural-Automaded Model of the Critical NPP I&C System for the
        Automated Development the Markovian Chain with Software Updates
According to the technology of a modeling, the discrete-continuous stochastic sys-
tems [9] based on certain events using the component vector state and the parameters
that describe critical NPP I&C systems, and model of the critical NPP I&C systems
for automated development of the Markovian chains are presented on the table 1.
Below is describes the procedures of structural-automated model development:

Table 1. Structural-Automated Model of the critical NPP I&C systems for the automated deve-
                             lopment of the Markovian chains

                                                        Rule of modification
                              Formula used for the
       Terms and conditions                            component for the state
                              intensity of the events
                                                               vector
                    Event 1. Hardware failure of the MS module
      (V1>=(n-1)) AND (V6=0)          V1·λhw            V1:=V1-1; V8:=V8+1
                    Event 2. Software failure of the MS module
      (V1>=(n-1)) AND (V4=0)                            V1:=V1-1;       V4:=0;
                                     V1·λsw11
           AND (V6=0)                                 V6:=1
                                      - 408 -




                                                         Rule of modification
                                Formula used for the
     Terms and conditions                               component for the state
                               intensity of the events
                                                                vector
     (V1>=(n-1)) AND (V4=1)                              V1:=V1-1;       V4:=1;
                                      V1·λsw12
          AND (V6=0)                                   V6:=1
                    Event 3. Hardware failure of the DS module
     (V2>=(k-1)) AND (V7=0)            V2·λhw            V2:=V2-1; V8:=V8+1
                     Event 4. Software failure of the DS module
     (V2>=(k-1)) AND (V5=0)                              V2:=V2-1;       V5:=0;
                                      V2·λsw11
          AND (V7=0)                                   V7:=1
     (V2>=(k-1)) AND (V5=1)                              V2:=V2-1;       V5:=1;
                                      V2·λsw12
          AND (V7=0)                                   V7:=1
     Event 5. Completing of the module switching procedure from cold standby to
                              non-operational systems
      (V1<(n-1)) AND (V3>0)
                                       1/Tswitch         V1:=V1+1; V3:=V3-1
          AND (V8>0)
      (V2<(n-1)) AND (V3>0)
                                       1/Tswitch         V2:=V2+1; V3:=V3-1
          AND (V8>0)
               Event 6. Completing of the software updates procedure
    (V10)
    (V1=n) AND (V20)
    (V10)

  The number of software updates can be also changed. It is necessary to change
vectors V4 and V5 the event 6, that are responsible for the number of updates. For
                                          - 409 -




example, if there are three software updates, the entry component of the event will be
as follows:

        (V1