=Paper=
{{Paper
|id=Vol-1614/paper_111
|storemode=property
|title=Availability Model of Critical NPP I&C Systems Considering Software Reliability Indices
|pdfUrl=https://ceur-ws.org/Vol-1614/paper_111.pdf
|volume=Vol-1614
|authors=Bogdan Volochiy,Vitaliy Yakovyna,Oleksandr Mulyak
|dblpUrl=https://dblp.org/rec/conf/icteri/VolochiyYM16
}}
==Availability Model of Critical NPP I&C Systems Considering Software Reliability Indices==
Availability Model of Critical NPP I&C Systems Considering Software Reliability Indices Bogdan Volochiy, Vitaliy Yakovyna, Oleksandr Mulyak National University Lviv Polytechnic, 12 Bandera St., 79013, Lviv, Ukraine bvolochiy@ukr.net, yakovyna@polynet.lviv.ua, mulyak.oleksandr@gmail.com Abstract. Providing the high availability level for the Instrumentation and Con- trol (I&C) Systems in Nuclear Power Plants (NPP) is highly important. The availability of the critical NPP I&C systems depends on the hardware and soft- ware reliability behavior. The high availability of the I&C systems is ensured by the following measures: structural redundancy with choice of the I&C sys- tem configurations (two comparable sub-systems in the I&C system, majority voting "2oo3", "2oo4", etc.); maintenance of the I&C system, which implies the repair (changing) of no operational modules; using the N-version programming; software updates; automatic software restart after temporary interrupts caused by the hardware fault. This paper proposes solution of the following case: the configuration of the fault-tolerant I&C system with known reliability indexes of hardware (failure rate and temporary failure rate) is chosen, the maintenance strategy of hardware (mean time to repair, numbers of repair), methods to fore- cast the number of software failures and the failure rate is specified. To solve this issue, the availability model of the fault-tolerant I&C system was devel- oped in the discrete-continuous stochastic system form. We have estimated the influence of the I&C system on the operational software parameters. Two con- figurations of I&C systems are presented in this paper: two comparable sub- systems in I&C system, and I&C system with majority voting "2oo3". Keywords. Instrumentation and Control (I&C) System, Discrete-Continuous Stochastic System, Reliability Behavior, Structural-Automated Model, Mark- ovian Chains, Software Reliability Key Terms. Mathematical Modeling, Method, Software Systems ICTERI 2016, Kyiv, Ukraine, June 21-24, 2016 Copyright © 2016 by the paper authors - 400 - 1 Introduction 1.1 Motivation Nowadays the development of fault-tolerant computer-based systems (FTCSs) is a part of weaponry components, space, aviation, energy and other critical systems. One of the main tasks is to provide requirements of reliability, availability and functional safety. Thus the two types of possible risks relate to the assessment of risk, and to ensuring their safety and security. Reliability (dependability) related design (RRD) [1-6] is a main part of develop- ment of complex fault-tolerant systems based on computers, software (SW) and hardware (HW) components. The goal of RRD is to develop the structure of FTCS tolerating HW physical failure and SW designs faults and assure required values of reliability, availability and other dependability attributes. To ensure fault-tolerance software, two or more versions of software (developed by different developers, using other languages and technologies, etc) are used [7].Therefore use of structural redun- dancy for FTCS with multiple versions of software is mandatory. When commission- ing software some bugs (design faults) remain in its code [8], this leads to the shut- down of the FTCS. After detection the bugs, a software update is carried out. These factors have influence on the availability of the FTCS and should be taken into ac- count in the availability indexes. During the operation of FTCS it is also possible that the HW will fail leading to failure of the software. To recover the software operabil- ity, an automatic restart procedure, which is time consuming, is performed. The effi- ciency of fault-tolerant hardware of FTCS is provided by maintenance and repair. Insufficient level of adequacy of the availability models of FTCS leads either to additional costs (while underestimating of the indexes), or to the risk of total failure (when inflating their values), namely accidents, material damage and even loss of life. Reliability and safety are assured by using (selection and development) fault-tolerant structures at RRD of the FTCS, and identifying and implementing strategies for main- tenance. Adoption of wrong decisions at this stage leads to similar risks. 1.2 Related Works Analysis Research papers, which focus on RRD, consider models of the FTCS. Most models are primarily developed to identify the impact of one the above-listed factors on reli- ability indexes. The rest of the factors are overlooked. Papers [4, 5] describe the reli- ability model of FTCS which illustrates separate HW and SW failures. Paper [6] offer reliability model of a fault-tolerant system, in which HW and SW failures are differ- entiated and after corrections in the program code the software failure rate is ac- counted for. Paper [8] describes the reliability model of the FTCS, which accounts for the software updates. In paper [10] the author outlines the relevance of the estimation of the reliability indexes of FTCS considering the failure of SW and recommends a method for their determination. Such reliability models of the FTCS produce analysis of its conditions under the failure of SW. This research suggests that MTTFsystem=MTTFsoftware. Thus, it is possible to conclude that the author considers the HW of the FTCS as absolutely reliable. Such condition reduces the credibility of the - 401 - result, especially when the reliability of the HW is commensurable to the reliability of the SW. Paper [11] presents the assessment of reliability parameters of FTCS through modeling behavior using Markovian chains, which account for multiple software updates. Nevertheless there was no evidence of the quantitative assessments of the reliability measures of presented FTCS. In paper [12], the authors propose a model of FTCS using Macro-Markovian chains, where the software failure rate, duration of software verification, failure rate and repair rate of HW are accounted for. The presented method of Macro-Markovian chains modeling [12, 13] is based on logical analysis and cannot be used for profound configurations of FTCS due to their complexity and high probability of the occur- rence of mistakes. Also there is a discussion around the definition of requirements for operational verification of software of the space system, together with the research model of the object for availability evaluation and scenarios preference. It is noted that over the last ten years out of 27% of space devices failures, which were fatal or such that restricted their use, 6% were associated with HW failure and 21% with SW failure. Software updates are necessary due to the fact that at the point of SW commission- ing they may contain a number of undetected faults, which can lead to critical failures of the FTCS. Presence of HW faults relates to the complexity of the system, and fail- ure to conduct overall testing, as such testing is time consuming and needs substation financial support. To predict the number of SW faults at the time of its commissioning various models can be used, one for example is Jelinski-Moranda [14]. A goal of the paper is to suggest a technique to develop a Markovian chain for critical NPP I&C system with different redundancy types (first of all, structure and version) using the proposed formal procedure and tool. The main idea is to decrease risks of errors during development of Markovian chain (MC) for systems with very large (tens and hundreds) number of states. We propose a special notation which al- lows supporting development chain step by step and designing final MC using soft- ware tools. The paper is structured in the following way. The aim of this research is calculating the availability function of critical NPP I&C system with version- structural redundancy and double software updates. To achieve this goal we propose a newly designed reliability model of critical NPP I&C system. As an example a special critical NPP I&C system is researched (Fig.1). The following factors are accounted for in this model: overall reserve of critical NPP I&C system and joint cold redundancy of modules of main and diverse systems of critical NPP I&C system; the existence of three software versions; SW double update; physicals fault. Structure of the paper is the following. Researched critical NPP I&C system is de- scribed in the second section. An approach to developing mathematical model based on Markovian chain and detailed procedure for the critical NPP I&C system are sug- gested in the third and fourth sections correspondingly. Simulation results for re- searched Markov’s model are analyzed in the section 4. Last section concludes the paper and presents some directions of future researches and developments. - 402 - 2 Researched Typical NPP Instrumentation and Control Systems Based on Digital Platform Here we provide the structure (Fig.1) of researched typical NPP Instrumentation and Control system (I&C) based on the digital platform [15]. This platform consists of main and diverse systems which are based on the Field Programmable Gates Arrays (FPGA) chips [16]. Main and Diverse systems based on the FPGA safety controller (FSC) with three parallel channels on voting logic “2-out-of-3”. Fig. 1. Configuration of critical NPP I&C systems This architecture consists of two system (main, diverse) each of them consists three channels connected in parallel with majority voting arrangement for the output sig- nals, such that the output state is not charged if only one channel gives a different result, which disagrees with the other two channels. The signals from Main and Diverse systems are comparing by element OR. 3 Methods to Forecast the Number of Software Failures and the Software Failure Rate The papers [18, 19] describe methods of predicting numbers of undetected SW de- sign faults. This method is based on the SW reliability model with index of complex- ity [20, 21]. The SW reliability model [20] describes the behavior of SW failures in non homogeneous Poisson process forms. The cumulative number of SW failures up to time t is calculated based on formula (1): mt s t s e t sG t s , (1) z where G z p t p 1e t dt – an incomplete gamma function, α – the coefficient 0 describes the total number of SW failure, β – the factor that represents the rate of - 403 - detection of SW failures, s – an index of SW complexity. Work [21] researches and specifies the intervals of value of the complexity index of SW s. This circumstance has allowed for the elaboration of a formal selection rule for SW reliability models with different complexity indexes. The total number of SW failures (and, consequently, the total number of SW design faults Ndef, on condition that one SW failure is caused by one SW design fault) is determined by the value of the function of the cumulative number of SW failures (1) at t: N def m sGs , (2) where G(s) – the Gamma function. To estimate the undetected numbers of SW design faults, the following steps [22] should be performed: ─ carry out SW testing and represent the result as the number of SW failures in de- fined interval. The input range of statistical sampling is divided into equal interval l ≤ 5lg(n) (where n – the total number of SW failures obtained during testing); ─ define the point estimates of the reliability SW model parameters α, β, and define parameter s by using the method of maximum likelihood [20]; ─ carry out the Kolmogorov – Smirnov test for quality, reviewing the experimental reliability model described; ─ use the point estimates of the reliability SW model parameters according to (2) the defined total number of SW design faults Ndef . The forecast for the number of undetected SW design faults is obtained by subtracting the total number of SW de- sign faults Ndef and defined SW design faults. Using regression analysis [18, 19], it is possible to: ─ increase the accuracy of the forecast of the total number of SW design faults using formula (2); or ─ decrease the time required to forecast the number of SW design faults. The number SW faults depends on the duration of SW testing, which provides in- formation about SW failure behavior. The variable Ndef from formula (2) was esti- mated using a nonlinear regression with explanatory variables Ti – time of SW testing. The following equation (3) was used as the regression equation. e x p N def Ti A 1 k T T , i c d (3) where A, k, d, Tc – parameters of the regression equation. It is then possible to determine the adjusted forecast of the total number of SW * failures N def from equation (3) on condition of the time of SW testing being unlim- ited (Ti). Based on the equation (3) the total number of SW failure is equal to the value of regression parameter A. To estimate the adjusted forecast for the total number of undetected SW failures * N def , the following steps should be performed: - 404 - ─ during the SW testing procedure, it is necessary to calculate the point estimates of the reliability SW model parameters α, β and s [20] by using the methods of maximum likelihood on the interval (0; Ti), where Ti - the current moment of SW testing. It is also is necessary to calculate Ndef(Ti) according the equation (2); ─ estimate the parameter of regression equation (3) by using the least squares method for set of Ndef(Ti); N def A ; * ─ the forecast number of undetected SW faults is determined by subtracting the num- * ber of detected and fixed SW faults from N def ; ─ in the case where a continuation of the process for SW testing is necessary, go back to step 1 and continue adding the new value to set Ndef(Ti). An example of dependence Ndef(Ti) [19] which was obtained during the SW testing procedure is presented in figure 2. 40 38 36 34 Number of defects, Ndef 32 30 28 26 24 22 20 0 200 400 600 800 1000 1200 Testing duration, Ti (runs) Fig. 2. Dependents of forecasting the numbers of SW faults Ndef (points), which was calculated according equation (2) from the SW testing durations and appropriated regression equation (line) In this case, using the methods of forecasting t the SW failure numbers and equa- tion (3) increases the accuracy of forecasting by 2-3%. Also, this method decreases the time required to forecast the number of SW failures [19]. An advantage of the SW reliability model [20] is that it is possibile to estimate the SW failure rate based on SW testing results at the appropriate level of the life cycle. The SW failure rate depends on the time of SW testing (this dependence is caused by correction of the SW faults on the appropriate live cycle). The relationship takes the form (4): dmt t s 1t s e t , (4) dt - 405 - As a result of using equation (4), the point value of the model parameters and the duration of SW , it is possible to calculate the SW failure rate SW – which is constant in time. It is necessary to estimate the value of SW, the availability of the I&C NPP system based on Markovian analysis. The authenticity of the estimate of the undetected SW faults [23, 24] is provided by forecasting the SW failure numbers (as result SW faults) based on artificial neural networks (NN) with radial basis function (RBF). The NN RBF is a nonparametric model of behavior of SW reliability which does not require a priori knowledge and assumptions about the behavior of SW failure. In this research, input data about SW failures were presented in cumulative time series form. The cumulative time series is used for learning about the neural networks RBF and for forecasting the value of SW failure on subsequent time series. The most reasonable results of forecasting SW failure are obtained by using NN RBF with an Inverse Multi-quadratic function (10 neurons in input layer and 30 neu- rons hidden layer) [24]. In this configuration, the mean square error of approximation is 1,0%. The coefficient of determination between the forecasting and controlled se- ries is 0,9965. Although the accuracy of forecasting decreases by 1,7%, it is possible to reduce the duration of learning time of the neural network by 3-6 times by using a Gaussian function (15 neurons in the input layer and 10 neurons in the hidden layer) [23, 24]. As a result of the different SW systems analysed, a configuration of neural network RBF was conducted that could be used for time series forecast with homogeneous failure process represented by a cumulative time series. Figure 3 presents an example of forecasting t, specifically, the total number of SW failures of the web-browser Chromium forecast using the neural network RBF with parameters listed above. 160 Experiment Prediction 140 Software faults 120 100 100 110 120 130 140 150 Time Fig. 3. An example of forecasting t, the total number of SW failures of web-browser Chromium, using the neural network RBF - 406 - This parts of paper outlines the estimated numbers of undetected SW faults using two methods based on regressions analysis and neural networks. This is used for re- liably estimating the number of undetected SW faults and ensures the requirements of standard [25] are satisfied. It is considered acceptable when number of SW faults calculated by two methods is equal to or less than the standards requirement. 4 Markov’s Model for Critical NPP I&C Systems with Software Updates The method of automated development the Markovian chain of the researched critical NPP I&C systems is described in the works [9, 26]. It involves a formalized represen- tation of the object of study as a “structural-automated model”. To develop this avail- ability model of the critical NPP I&C systems one needs to perform the following tasks: develop a verbal description of the research object (fig. 1); define the basic events; define the components of vector states, which can be described as a state of random time; define the parameters for the object of research, which should be in the model; and shape the tree of the modification of the rules and component of the vector of states. 4.1 The Procedures to Describe Behavior of the Critical NPP I&C Systems The critical NPP I&C systems behavior is described by the following procedures: ─ Procedure 1. Detection the failure in the critical NPP I&C systems (hardware fail- ure, software failure). Failure can occur in the Main system (MS) and Diverse sys- tem (DS). ─ Procedure 2. Detection of failure in the MS or in the DS of the critical NPP I&C systems. ─ Procedure 3. Connection of the module from cold standby to faulty systems. ─ Procedure 4. Loading the software on the module with connections from cold standby to faulty systems. ─ Procedure 5. Software updating. ─ Procedure 6. Repair (replacement) of the HW of the faulty systems. 4.2 A Set of the Events for the Critical NPP I&C Systems According to described procedures which determine the behavior of critical NPP I&C systems, a list of events is composed. Events are presented in pairs corresponding to the start and the end of time intervals to perform each procedure. From this list of events for “structural-automated model” basic events are selected [9]. As a result of analysis, seven basic events in particular were determined: Event 1 - “Hardware failure of the MS module”; Event 2 - “Software failure of the MS mod- ule”; Event 3 - “Hardware failure of the DS module”; Event 4 - “Software failure of the DS module”; Event 5 - “Completing of the module switching from cold standby to non-operational systems”; Event 6 - “Completing of the software updates procedure”; Event 7 - “Completing of the procedure of the hardware repair” - 407 - 4.3 Components of Vector States for the Critical NPP I&C Systems Components of the vector state that can also be described as a state of random time. To describe the state of the system, eleven components are used: V1 – displays the current number of modules in the MS (the initial value of components V1 equal to n); V2 – displays the current number of modules in the DS (the initial value of compo- nents V2 equal to k); V3 – displays the current number of modules in cold standby (the initial value of components V3 equal to mc); V4 – displays which software ver- sion is operated by the MS (V4=0 – first version, V4=1 – second version, V4=2 – third version); V5 – displays which software version operated by DS (V5=0 – first version, V5=1 – second version, V5=2 – third version); V6 – displays the SW faults in the MS; V7 – displays the SW faults in the DS; V8 – displays the SW failure in the MS; V9 – displays the SW failure in the DS; V10 – displays the number of non- operational module, due to HW failure. 4.4 The Parameters of the Critical NPP I&C Systems Markov’s Model Developing Markov’s model of the critical NPP I&C systems, its composition and separate components should be set to relevant parameters in particular: n – number of modules that are the part of the MS; k – number of modules that are the part of the DS; mc –number of the modules in the cold standby;hw– the failure rate that is in MS or DS and in the hot standby; sw11, sw12 – the failure rate of first and second software versions;Tup1, Tup2 – mean time of the first and second software updates; Tswitch – mean time of the module connections from standby; Trep– mean time of hardware repair. 4.5 Structural-Automaded Model of the Critical NPP I&C System for the Automated Development the Markovian Chain with Software Updates According to the technology of a modeling, the discrete-continuous stochastic sys- tems [9] based on certain events using the component vector state and the parameters that describe critical NPP I&C systems, and model of the critical NPP I&C systems for automated development of the Markovian chains are presented on the table 1. Below is describes the procedures of structural-automated model development: Table 1. Structural-Automated Model of the critical NPP I&C systems for the automated deve- lopment of the Markovian chains Rule of modification Formula used for the Terms and conditions component for the state intensity of the events vector Event 1. Hardware failure of the MS module (V1>=(n-1)) AND (V6=0) V1·λhw V1:=V1-1; V8:=V8+1 Event 2. Software failure of the MS module (V1>=(n-1)) AND (V4=0) V1:=V1-1; V4:=0; V1·λsw11 AND (V6=0) V6:=1 - 408 - Rule of modification Formula used for the Terms and conditions component for the state intensity of the events vector (V1>=(n-1)) AND (V4=1) V1:=V1-1; V4:=1; V1·λsw12 AND (V6=0) V6:=1 Event 3. Hardware failure of the DS module (V2>=(k-1)) AND (V7=0) V2·λhw V2:=V2-1; V8:=V8+1 Event 4. Software failure of the DS module (V2>=(k-1)) AND (V5=0) V2:=V2-1; V5:=0; V2·λsw11 AND (V7=0) V7:=1 (V2>=(k-1)) AND (V5=1) V2:=V2-1; V5:=1; V2·λsw12 AND (V7=0) V7:=1 Event 5. Completing of the module switching procedure from cold standby to non-operational systems (V1<(n-1)) AND (V3>0) 1/Tswitch V1:=V1+1; V3:=V3-1 AND (V8>0) (V2<(n-1)) AND (V3>0) 1/Tswitch V2:=V2+1; V3:=V3-1 AND (V8>0) Event 6. Completing of the software updates procedure (V10) (V1=n) AND (V2 0) (V1 0) The number of software updates can be also changed. It is necessary to change vectors V4 and V5 the event 6, that are responsible for the number of updates. For - 409 - example, if there are three software updates, the entry component of the event will be as follows: (V1