-

Estimating the Software Size of Open-Source PHP-Based Systems Using Non-Linear Regression Analysis

Sergiy Prykhodko

Natalia Prykhodko

Lidiia Makarova

0 0 . Department of Software of Automated Systems, Admiral Makarov National University of Shipbuilding, UKRAINE , Mykolaiv, Heroes of

2018

1 3

The equation, confidence and prediction intervals of multivariate non-linear regression for estimating the software size of open-source PHP-based systems are constructed on the basis of the Johnson multivariate normalizing transformation. Comparison of the constructed equation with the linear and non-linear regression equation based on the Johnson univariate transformation is performed.

I. INTRODUCTION

Software size is one of the most important internal metrics of software. The information obtained from estimating the software size are useful for predicting the software development effort by such model as COCOMO II. The papers [1, 2] proposed the linear regression equations for estimating the software size of some programming languages, such as VBA, PHP, Java and C++. The proposed equations are constructed by multiple linear regression analysis on the basis of the metrics that can be measured from class diagram. However, there are four basic assumptions that justify the use of linear regression models, one of which is normality of the error distribution. But this assumption is valid only in particular cases. This leads to the need to use the non-linear regression equations including for estimating the software size of open-source PHP-based systems.

A normalizing transformation is often a good way to build the equations, confidence and prediction intervals of multiply non-linear regressions [3-5]. According [4] transformations are used for essentially four purposes, two of which are: first, to obtain approximate normality for the distribution of the error term (residuals), second, to transform the response and/or the predictor in such a way that the strength of the linear relationship between new variables (normalized variables) is better than the linear relationship between dependent and independent random variables. Well-known techniques for building the equations, confidence and prediction intervals of multivariate non-linear regressions are based on the univariate normalizing transformations, which do not take into account the correlation between random variables in the case of normalization of multivariate nonGaussian data. This leads to the need to use the multivariate normalizing transformations.

In this paper, we build the equation, confidence and prediction intervals of multivariate non-linear regression for estimating the software size of open-source PHP-based systems on the basis of the Johnson multivariate normalizing transformation (the Johnson normalizing translation) with the help of appropriate techniques proposed in [5].

II. THE TECHNIQUES

The techniques to build the equations, confidence and prediction intervals of non-linear regressions are based on the multiple non-linear regression analysis using the multivariate normalizing transformations. A multivariate normalizing transformation of non-Gaussian random vector to

Gaussian random vector P = {Y , X1, X 2,, X k }T T = {ZY , Z1, Z2,, Zk }T is given by and the inverse transformation for ( 1 )

P = ψ −1(T) .

T = ψ(P)

The linear regression equation for normalized data according to ( 1 ) will have the form [4]

ZˆY = ZY + (Z+X )bˆ , where ZˆY is prediction linear regression equation result for values of components of vector z X = {Z1, Z2,, Zk } ; + Z X is the matrix of centered regressors that contains the values Z1i − Z1 , Z2i − Z2 ,  , Zki − Zk ; bˆ is estimator for vector of linear regression equation parameters, b = {b1, b2 ,, bk }T .

The non-linear regression equation will have the form

Yˆ = ψ1−1[ZY + (Z+X )bˆ ] , where Yˆ is prediction non-linear regression equation result.

The technique to build a non-linear regression equation is based on transformations ( 1 ) and ( 2 ), Eq. ( 3 ) and a confidence interval of linear regression for normalized data ZˆY ± tα 2,νSZY  N1 + (z+X )T (Z+X )T Z+X −1(z+X )1 2 , ( 5 ) where tα 2,ν is a quantile of student's t-distribution with ν degrees of freedom and α 2 significance level; (z+ )T is one X ( 1 ) ( 2 ) ( 3 ) ( 4 ) of the rows of Z+X ; SZ2Y = (Z+X )T Z+X is the k × k matrix 1 N 2

∑ (ZY − ZˆY ) , ν = N − k −1 ; ν i=1 i i  SZ1Z2 (Z+X )T Z+X =     SZ1Z1  SZ1Zk

N where SZqZr = ∑ [Z qi − Z q ][Z ri − Z r ], q, r = 1,2,, k .

i=1

The confidence interval for non-linear regression is built on the basis of the interval ( 5 ) and inverse transformation ( 2 )  −1 ψ1−1 ZˆY ± tα 2,νSZY  N1 + (z+X )T (Z+X )T Z+X  (z+X )1 2  . ( 6 ) The technique to build a prediction interval is based on multivariate transformation ( 1 ), the inverse transformation ( 2 ), linear regression equation for normalized data ( 3 ) and a prediction interval for normalized data

ZˆY ± tα 2,νSZY 1 + N1 + (z+X )T (Z+X )T Z+X −1(z+X )1 2 . ( 7 ) The prediction interval for non-linear regression is built on the basis of the interval ( 7 ) and inverse transformation ( 2 ) ψ1−1 ZˆY ± tα 2,νSZY 1+ N1 + (z+X )T (Z+X )T Z+X −1(z+X )1 2  .(8) III. THE JOHNSON NORMALIZING TRANSLATION For normalizing the multivariate non-Gaussian data, we use the Johnson translation system. The Johnson normalizing translation is given by

Z = γ + ηh[λ −1(X − ϕ)] ∼ Nm (0m , Σ) , where Σ is the covariance matrix; m = k +1 ; γ , η , ϕ and λ are parameters of translation (9); γ = (γ1, γ 2 ,, γ m )T ; η = diag(η1, η2 ,, ηm ); ϕ = (ϕ1, ϕ2 ,, ϕm )T ; hi (⋅) is one of the translation functions λ = diag(λ1, λ2 ,, λm ) ; h[(y1,, ym )] = {h1(y1),, hm (ym )}T ;  ln(y), ln[y (1 − y)], h =   Arsh(y),  y for SL (log normal) family; for SB (bounded) family; for SU (unbounded) family; for SN (normal) family.

Here y = (x − ϕ) λ ; Arsh(y) = ln y + y2 + 1  .   IV. THE EQUATION, CONFIDENCE AND PREDICTION INTERVALS OF NON-LINEAR REGRESSION TO

ESTIMATE THE SOFTWARE SIZE

The equation, confidence and prediction intervals of nonlinear regression to estimate the software size of open-source PHP-based systems are constructed on the basis of the (9) (10)

Johnson multivariate normalizing transformation for the fourdimensional non-Gaussian data set: actual software size in the thousand lines of code (KLOC) Y , the average number of attributes per class X 3 , the total number of classes X1 and the total number of relationships X 2 in conceptual data model from 32 information systems developed using the PHP programming language with HTML and SQL. Table I contains the data from [1] on four metrics of software for 32 open-source PHP-based systems.

For detecting the outliers in the data from Table 1 we use the technique based on multivariate normalizing transformations and the squared Mahalanobis distance [6]. There are no outliers in the data from Table I for 0.005 significance level and the Johnson multivariate transformation (9) for SB family. The same result was obtained in [6] for the transformation (9) for SU family. In [1] it was also assumed that the data contains no outliers.

Parameters of the multivariate transformation (9) for SB family were estimated by the maximum likelihood method. Estimators for parameters of the transformation (9) are: γˆ Y = 9.63091 , γˆ1 = 15.5355 , γˆ 2 = 25.4294 , γˆ 3 = 0.72801 , ηˆY = 1.05243 , ηˆ1 = 1.58306 , ηˆ2 = 2.54714 , ηˆ3 = 0.54312 , transformation is less than 0.25. Although all values of PRED(0.25) in the Table III are less than 0.75 nevertheless the values are greater for Eq. (12). All values of multiple coefficient of determination R2 in the Table III are greater than 0.75 but the value of R2 is greater for Eq. (12) on the basis of multivariate transformation. .

 0.1574 0.1345 0.0554 1.0000 

After normalizing the non-Gaussian data by the multivariate transformation (9) for SB family the linear regression equation ( 3 ) is built for normalized data

ZˆY = bˆ0 + bˆ1Z1 + bˆ2Z2 + bˆ3Z3 .

Estimators for parameters of the Eq. (11) are such: is prediction result by the Eq. (11), where ˆ

ZY Z j = γ j + η j ln

X j − ϕ j ϕ j + λ j − X j

, ϕ j < X j < ϕ j + λ j , j = 1,2,3 .

The prediction results by Eq. (12) for values of components of vector X = {X1, X 2, X 3} from Table I are shown in the Table II for two cases: univariate and multivariate normalizing transformations.

For univariate normalizing transformations (10) of SB family the estimators for parameters are such: γˆY = 0.77502 , γˆ1 = 0.59473 , γˆ 2 = 0.57140 , γˆ3 = 0.68734 , ηˆY = 0.44395 , ηˆ1 = 0.48171, ηˆ2 = 0.49553 , ηˆ3 = 0.51970 , ϕˆY = 2.063 , ϕˆ1 = 2.900 , ϕˆ2 = 0.900 , ϕˆ3 = 3.304 , λˆ Y = 83.059 , λˆ1 = 36.695 , λˆ 2 = 23.525 and λˆ 3 = 13.660 . In the case of univariate normalizing transformations the estimators for parameters of the Eq. (11) are such: bˆ0 = 3.11⋅10−7 , bˆ1 = 0.43519 , bˆ2 = 0.52239 and bˆ3 = 0.08546 .

The confidence and prediction intervals of non-linear regression are defined by ( 6 ) and (8) respectively for the data from Table I. on the basis of univariate and multivariate transformations respectively for 0.05 significance level.

Note the lower bounds of the prediction interval of linear regression from [1] are negative for the thirteen rows of data: 1, 8, 9, 14, 15, 17, 19, 23, 24, 26, 27, 29 and 31. All the lower bounds of the prediction interval of non-linear regressions are positive. The widths of the prediction interval of non-linear regression on the basis of the Johnson multivariate transformation are less than for linear regression from [1] for the twenty rows of data: 1, 6, 8, 9, 14-20, 22-27, 29, 31 and 32. Also the widths of the prediction interval of non-linear regression on the basis of the Johnson multivariate transformation are less than following the Johnson univariate transformation for the twenty-three rows of data: 1-4, 6, 8-10, 15-18, 20-26, 28, 29, 31 and 32. Approximately the same results are obtained for the confidence interval of non-linear regression. equality is a necessary condition for multivariate normality. In our case β2 = 24 . The estimators of multivariate kurtosis equal 28.66, 37.29 and 23.08 for the data from Table I, the normalized data on the basis of the Johnson univariate and multivariate transformations respectively. The values of these estimators indicate that the necessary condition for multivariate normality is practically performed for the normalized data on the basis of the Johnson multivariate transformation only and does not hold for other data.

V. CONCLUSION

The non-linear regression equation to estimate the software size of open-source PHP-based systems is improved on the basis of the Johnson multivariate transformation for SB family. This equation, in comparison with other regression equations (both linear and nonlinear), has a larger multiple coefficient of determination and a smaller value of MMRE.

When building the equations, confidence and prediction intervals of non-linear regressions for multivariate nonGaussian data, one should use multivariate transformations.

Usually poor normalization of multivariate non-Gaussian data or application of univariate transformations instead of multivariate ones to normalize such data may lead to increase of width of the confidence and prediction intervals of regressions, both linear and nonlinear. data on metrics of software from Table I and the normalized data on the basis of the Johnson univariate and multivariate transformations for SB family. It is known that β2 = m(m + 2) holds under multivariate normality. The given

[1]

Hee

Beng Kuan Tan , Yuan Zhao , and Hongyu Zhang, “ Estimating LOC for information systems from their conceptual data models” , in Proceedings of the 28th International Conference on Software Engineering (ICSE '06) , May 20-28 , 2006 , Shanghai, China, pp. 321 - 330 .

[2]

Matinee

Kiewkanya , and Suttipong Surak, “Constructing C+ + software size estimation model from class diagram” , in 13th International Joint Conference on Computer Science and Software Engineering (JCSSE) , July 13-15 , 2016 ,

Khon

Kaen , Thailand, pp. 1 - 6 .

[3]

D. M.

Bates , and

D. G.

Watts , Nonlinear regression analysis and its applications . Wiley, 1988 .

[4]

T. P.

Ryan , Modern regression methods . Wiley, 1997 .

[5]

S. B.

Prykhodko , “ Developing the software defect prediction models using regression analysis based on normalizing transformations”, in Abstracts of the Research and Practice Seminar on Modern Problems in Testing of the Applied Software (PTTAS- 2016 ), May 25- 26, 2016 , Poltava, Ukraine, pp. 6 - 7 .

[6]

Prykhodko ,

Makarova , and

Pukhalevych , “ Application of the squared Mahalanobis distance for detecting outliers in multivariate non-Gaussian data” , in Proceedings of 14th International Conference on Advanced Trends in Radioelectronics , Telecommunications and Computer Engineering (TCSET), Lviv-Slavske, Ukraine, February 20-24 , 2018 , pp. 962 - 965 .

[7]

K. V.

Mardia , “ Measures of multivariate skewness and kurtosis with applications” , Biometrika, 57 , 1970 , pp. 519 - 530 .