Wtroduction

Software Reliability Measurement Experiences Conducted in Alcatel Portugal

Rui Loureno

Alcatel Portugal

2001

169 174

Sofhvare Reliabz.lity measurement is essential for examining the degree of qualz.ty or relz"abilz.tyof a developed sojiware system. This paper describes the experz`mentsconducted at Alcatel Portugal, concerning the use of Sofiware Reliabilz.ty models. The results are general and can be used to monitor sofiware relz"abilz.ty growth in order to attain a certain qualz`ty withz.n schedule. A method based on the analysis of the trend exhibz"tedby the data collected is used to z.mprove the predictions. The results show that-. . It is dicultfor the models to reproduce the observed faz"luredata when changes in trend do notfoLLowthe models assumptions. . The Laplace trend test is a major toolfor guz.dingthe partitionz.ng of failure data according to the assumptions of models relz.abz`Iz.gtyrowth. . Predz-ctz`oynields good results over a -timeperiod of a few months, showing that reliabilitY modeling is a major tooLfor test/maz"ntenanceplanning and foLlow up.

Wtroduction

Software reliability models are used to monitor, evaluate and predict the quality of software systems. Quantitative measures providedby these models are a key help to the decision-makingprocess in our organization.

Since a lot of resources are consumed by software development projects, by using software reliability models, our goal is to optimize the use of these resources, in order to achieve the best quality with lower costs and optimizedschedules.

They enable us to estimate software reliability measures such as: . The number of failures that will be found during some time period in the future . How much time will be required to detect a certain numberof failures . What is the mean time interval between failures and what resources are needed (testing + correction) to achieve a given quality level . To perform comparative analysis: "how does my productcomparewith others" With respect to the software life cycle, the phases requiringcarefulreliabilityevaluationare: . Test: To quantify the efficiency of a test set and detectthe saturationinstant i.e., the instant when the probability of test failure detection becomes very low. . Qualification:To demonstratequantitativelythatthe softwarehasreached a specified level of quality. . Maintenance: To quantify the efficiency of maintenanceactions. At the starting of operational life, the software might be less reliable as the operational environment changes. Maintenance actionsrestore thereliability to a specified level.

Data requirements these models needed to implement

To implementthese softwarereliability models, a process needs to be set up in order to collect, verify and validate the errordata needed to be used as an input" Since we are using a defect management tool to submit, manage and track defects detected during the software development life cycle phases mentioned above, it is thereforerelatively easy for us to retrieve errordata from the software defectsdatabase.

This way, it is possible to collect historical and actual error data from projects, in the form of time intervals between failuresand/or number of failuresper time unit, as softwarereliability models usually request.

Normally, the following dataneeds to be available before we startusing the models: . The fault counts per time unit (where repeated failuresare not counted) . The elapsedtime between consecutive failures . The lengthof each time unitused . The effortspent on test per time unit

ModeBng approach

The basic approachhere is to model past failure data to predict futurebehavior.This approachemploys either the observed number of failures discovered per time period, or the observedtimes between failures of the software. The models used therefore, fall into two basic classes, dependinguponthe types of data the model uses:

1. Failuresper time period 2. Times between failures

These classes are, however, not mutually disjoint. There are models that can handle either data type. Moreover, many of the models for one data type can still be applied even if the user has data of the other type, applying data transformationsprocedures.

For example, one of the models we use with moresuccess is the S-shaped (SS) reliability growth model. For this model, the software error detection process can be described as an S-shaped growth curve to reflect the initial learning curve at the beginning, as the test team members become familiar with the software, followed by a growth and then leveling off as the residual faults become moredifficult to uncover.

Like the Goel Okamoto COO)and the Rayleigh models, that we also use very often, they can be classified as Poisson type models (the numberof failures per unit of time is an independentPoisson random variable). Their performancedepends basically on 2 parameters: . One that estimates the total number of software failuresto be eventually detected. . Another that measures the efficiency with which software failuresare detected.

In order to estimate the models parameters,we use the tool CASRE: Computer-AidedSoftware Reliability tool. This is a PC based tool thatwas developed in 1993 by the Jet PropulsionLaboratoryfor theU.S. Air Force~

Models assumptions

The modeling approach described here is primarily applicable from the testing phase onward. The software must have maturedto the pointthat extensive changes are not being routinely made. The models can't have a credible performanceif the software is changing so fast that gathering data on one day is not the same as gathering data on another day. Different approaches and models need to be consideredif thatis the case. Another importantissue of the modeling procedure, is that we need to know the inflection points, i.e.. the points in time when the software failuresstop growing and start to decrease. Reliability growth models cannot follow these trend variations, thus our approach consists of partitioning the data into stages subsequent to applying the models. Inflection points are the boundariesbetween these stages. A simple way to identify inflection points is by performing trendtests, such as the Laplace trendtest [ 3 ].

The use of trendtests is particularlyimportantfor models such as the S-shaped, on which predictions can only be accurate as long as the observed data meet the model assumption of reliability decay priorto reliability growth. The model CS-shaped)cannot predict future reliability decay, so that when this phenomenon occurs, a new analysis is needed and the model must be applied from the time period presentingreliability decay.

However, this is not the only way of looking at the problem. Assuming that the error detection rate in software testing is proportional to the current error content and the proportionalitydepends on the currenttest effort at an arbitrarytesting time, a plausible software reliability growth model based on a Non-Homogeneous Poisson Process has also been used.

How to reliability obtain predictions of future In a predictive situation, statements have to be made regarding the future reliability of software, and we can only make use of the informationavailable at thattime. A trend test carried out on the available data helps choose the reliability growth model(s) to be applied and the subset of data to which this (or these) model(s) will be applied.

As mentioned before, the models are applied as long as the environmental conditions remain significantly unchanged (changes in the testing strategy, specification changes, no new system installation...).

In fact even in these situations,reliability decreasemay be noticed. Initially, one can consider thatit is due to a local random fluctuation and that reliability will increase sometime in the near future. In this case predictions are still made without partitioning data. If reliability keeps decreasing, one has to find out why and new predictions may be made by partitioningdata into subsets according to the new trenddisplayed by the data.

If significant changes in the development or operational conditions take place, greatcare is needed since reliability trend changes may result, leading to erroneous predictions. New trendtests have to be carriedout. If there is insuf6cient evidence that a different phase in the programs reliability evolution has been reached, applicationof reliability growthmodels can be continued. If there is an obvious reliability decrease, reliability growthmodel's application has to be stopped until a new reliability growth period is reached again. Then, the observed failure data has to be partitioned according to the new trend.

Number of models to be appEed

With respect to the number of models to be applied, previous studies indicated that there are not "universally best" models. This suggests that we try several models and examine the quality of predictionbeing obtainedfrom each of them and that even doing so, we are not able to guaranteeobtaining good predictions.

During the development phases of a runningproject, it is not always possible to apply several models, because of lack of time, experience, and analytical and practical tools. Usually people only apply one, two or three models to their data. Analysis of the collected data and of the environmental conditions helps us understanding the evolution of softwarereliability,and data partitioninginto subsets helps us improvethe quality of the predictions.

Models caRbratJonand appBcation

The models may be calibrated either after each new observed data (step-by-step) or periodically after observation of a given number of failures, say y, (y-stepahead). Stepby-step prediction seems more interesting. However, one needs to have a good data collection process set up to implement this procedure, since data might not always be available immediately.In operational life, longer inter-failure times allow step-by-step predictions.

Since we have a database with error data from running projects in our organization (the defects are collected from the test phase onwards), we have a forma\ procedure to regularlyretrieve, analyse and verify this data. Then we use a periodical approach, to make predictions, which can be summarizedas follows: . Every week, we retrieve errordata from the projects we are interestedin evaluatesoftware reliability. . We analyze and validate this data and look for possible trends,in order to select the best data set that could be used for doing predictions. . If models assumptions are met, we apply the models, validate them and analyzethe results they provide. . Then we collect feedback from people involved in the projectsand, if necessary. take actions thathelp in improvingproductsreliability.

Laplace trend test

The Laplace trend test [ 3 ] is used to determine the software reliability trendusing data on failuresrelatingto software: . Time intervalbetween failures,or . Number of failuresper time unit.

This test calculatesan indicator u(n), expressed according to the data (time interval between failures or number of failuresper time unit). A negative u(n) suggests an overall increase in reliabilitybetween dataitem 1 and dataitem n. A positive u(n) suggests an overall decrease in reliability between data items I and n. Therefore, if we notice an increase (decrease) in u(n) then we have a period of local reliability decrease (growth).

TheLaplace trendtest is straightforwardand much faster to use than models. The reliabilitystudy can be stopped at this stage if it is believed that the information obtained has, indeed, answeredthe proposed questions. Of course, the obtainedinformationis restrictedto: . Increase in reliability, . .

Decrease in reliability, Stablereliability. Case study

We are going to apply the previously described methodology to the software validation phase of one of the software projects currently in the maintenancephase in our company.

The project in question is a large telecom network managementsystem, with more than 350 000 source lines of code. The volume and complexity of this software system makeit difficult, if not impossible, to eliminate all software defects prior to its operational phase. Our aim was to evaluate quantitatively some operational quality factors, particularly software reliability, before the softwarepackage startedits operationallife.

The software validationphase for this project is a 4 step testing process I) integration test, 2) system test, 3) qualification test and 4) acceptance test. The first 3 steps correspond to the test phases usually defined for the softwarelife cycle. Acceptance test consists of testing the entire software system in real environment, which approachesthe normal operating environment. It uses a system configuration (hardware and software) that has reached a sufficient level of quality after completing the first 3 test phases describedabove, After the validation phase has started, software errors detected by the test team were submitted into the defects database.

Failure reports that had at least one of the following characteristicswere rejected: . Failures not due to software, but data,documentation or hardware . Reportsrelated to an improvementrequest . Results in accordanceto specifications . Reports failuresalreadyaccounted for In order to collect the test effort spent per time unit on a given project, we used data existent in another database specially created to collect manpowerfigures.

Ourgoal was to evaluate: . The number of failures that still remained in the software, after half of the test"planned time was completed . The time-point when 95% of all software failures existing in the software (forecasted by the models) were found . The mean time between failures achieved by the end of system, qualification and acceptance tests . The amount of test effort (testing + correction) still needed to achieve the target of 95% of software defects found.

When we first decided to apply these models, we were half the way throughthe system test phase. At that time we were interested in determine the number of defects remaining in our application so we could reevaluate our test strategy. The first approachconsisted in considering the entire set of softwarefailures collected up to thattime, to model thesoftware reliability.

To meet this goal, we selected a set of models that used failures per time period as an input The S-shaped (SS) and the Broof;:s/MOdey(BM) models were chosen, independentlyof the severity (critical, major and rninor), of the observedfailures.

Figure 1 shows that the models had difficulty in modelling the entire [ailureprocess.

1800 ` - - - - - - - - - - - - - - - - - - - - - - - - - - - - Despite the fitting and adjustmentproblems observed, we can notice two different behaviours in the software models predictions. The SS model presents a more "optimistic"vision of the failure process than the BM. These differences are often observed. and to identify which model was trustworthyfor predictions,some expert judgement was needed, since validation statistics, namely the residue and the Chi-Squared statistics, were not enough to help us deciding.

The following table summarizes the models results: Notice that the total number of failures predicted by the BM model was extraordinarybig. This didn'tmean that we didn'tconsider this model prediction. Insteadof using ifs asymptotic measures, we only considered the predictions for the next 20 time units, which revealed more accurate.

After these results have been analysed, the project team agreed that the system test had serious chances to be delayed,so they had to rethink theirtest strategy. Lateron in the project, right after the qualification test phase has started, the questions were whether if the software would be ready for shipping by the end of this phase and in case it didn`t, how much effort (testing + corrections) was still required to achieve the target of 95% of all defects forecasted found. It was an important decision to be made and the conclusions couid have seriousimplicationsin the projectschedule.

In orderto improve the accuracy of the new predictions we decided to restrictthe data set to be used by applying the Laplacetrendtest.

As it can be seen in figure 2 the Laplace trend test graphicallowed us to observe periods of local reliability growthand decreasein the data. Consideringthe models assumptions,the periods selected for a new reliability evaluationwere P2 for the SS model, since we can notice thatthere is a decrease in reliability followed by a reliabilitygrowthperiod, andPI for the GO and BM models, since there is only reliability growth observed.

By runningthe models again we noticed thatthe deviation was significantly reduced, thus improving the reliability evaluation(see figure3). defects found. Figure 4 and table 3 below summarizethe results obtainedby using this model, The validation statistics told us that that the observed residues were now lower, which gave us more confidence in the models results.

The following table summarizesthe new results observed: Based on these results, plus the expert judgement providedby the project team we considered the S-shaped model values for reference(optimistic view).

However therewas still a question that needed an answer How much test effort sciJl had to be spenc in order the software could be 95% error free? To answer to that question a differentmodel with a different approachwas needed. Since tesc effort is clear2y correlated with the defects found during the test phases, we decided to use test effort inputs, in the S-shaped model instead of calendartime, To include the test effort data in the model, we had to restrict the data range to the period from where we had reliable effort data figures. By doing so, it was possible for us to evaluate with the same model, the remaining failures in the software and the test effort needed to find a given amountof defects in the software system. We decidedto apply this new model Cs-shapedmodified SSM) to the da set containing , suggest by e pJace end tt (s 68e 2). th a few adjustmen in order for the test eo to reBect more accurately e As it can be seen, the model fitting is quite accurateand reasonably adapted to the failure data observed. These results were a major help to the project team, who was able to make more accuratedecisions based on the results providedby this model.

As it was mentioned before, expert judgement provided by people from the projects,plays an essential role in the process of deciding which model results to select. Unless we are pretty shore aboutthe stability of our product,i.e., we know that we shouldn'texpect too many defects in the near future, and the test environment is not suppose to change much, we can not rely significantly on these results.

Conclusions

Software reliability models are an important aid to the test/maintenance planning and reliability evaluation, However, it is well known that no particularmodel is better suited for predicting software behaviour for all software systems in any circumstances. Our work helps the already existing models to give better predictions since they are applied to data displaying trends in accordanceto their assumptions, With respect to the applicationof the proposedmethod to the failure data of our network management project 2 models, namely, the S-shaped and Brooks/Motley, have been analysed according to their predictive capabilities. The resultsobtainedshow that: . .

The trendtest help partitionthe observed failuredata according to the assumptions of reliability growth models; it also indicates the segment of data from which the occurrence of future failures can be predictedmore accurately; The prediction approachproposed for the validation phases yields good resultsover a time period of a few months, showing thatreliability modeling constitutes a major aid tool for test/maintenanceplanning and follow up.

[II DerriennicH ., andGall G-, "Useof Failure-IntensityModels in the SoftwareValidationPhase forTelecommunications" , IEEETrans~on Reliability ,Vol. 44 , No. 4 , December 1995 . pp. 658 - 665 .

[2] GoeI A.L , and OkumotoK ., Time dependent error detection ratemodel for softwareand otherperformance measures" , IEEE Trans-on Reliability ,Vol. R-' 28 ~No~ 3 , August

2979

, pp. 206 212 .

[3] KanounK.. MartiniM.R.B. , andSouza J.M. , "AMethod for Software Reliability Analysis andPrediction Applicationto the TftOPICO-RSwitching System" , IEEETrans.on SoftwareEngineering ,Vol. 27 . No. 4 , April

2991

, pp. 334 344 .

[4} Lyn

M.R.

, "Handbookof SoftwareReliability Engineering" , Publishedby IEEEComputerSociety Press and McGrawHill Book Company .

[5] YarnadaS..Hishitani J. , and OsakiS.. .Software Reliability growth with a Weibull TestEffort:A Model & Application" , IEEETrans.on Reliability ,Vol. 42 , No, l, March 1993 , pp. 2OO2 - 06 .