199 Estimating the Software Size of Open-Source PHP-Based Systems Using Non-Linear Regression Analysis Sergiy Prykhodko1, Natalia Prykhodko2, Lidiia Makarova1 1. Department of Software of Automated Systems, Admiral Makarov National University of Shipbuilding, UKRAINE, Mykolaiv, Heroes of Ukraine ave. 9, email: sergiy.prykhodko@nuos.edu.ua 2. Finance Department, Admiral Makarov National University of Shipbuilding, UKRAINE, Mykolaiv, Heroes of Ukraine ave. 9, email: natalia.prykhodko@nuos.edu.ua Abstract: The equation, confidence and prediction intervals of systems on the basis of the Johnson multivariate normalizing multivariate non-linear regression for estimating the software transformation (the Johnson normalizing translation) with the size of open-source PHP-based systems are constructed on the help of appropriate techniques proposed in [5]. basis of the Johnson multivariate normalizing transformation. Comparison of the constructed equation with the linear and II. THE TECHNIQUES non-linear regression equation based on the Johnson univariate transformation is performed. The techniques to build the equations, confidence and Keywords: software size estimation, PHP-based system, prediction intervals of non-linear regressions are based on the multivariate non-linear regression analysis, normalizing multiple non-linear regression analysis using the multivariate transformation, non-Gaussian data. normalizing transformations. A multivariate normalizing transformation of non-Gaussian random vector I. INTRODUCTION P = {Y , X 1 , X 2 ,, X k }T to Gaussian random vector Software size is one of the most important internal metrics of software. The information obtained from estimating the T = {ZY , Z1 , Z 2 ,, Z k }T is given by software size are useful for predicting the software T = ψ (P ) (1) development effort by such model as COCOMO II. The and the inverse transformation for (1) papers [1, 2] proposed the linear regression equations for estimating the software size of some programming languages, P = ψ −1 (T ) . (2) such as VBA, PHP, Java and C++. The proposed equations The linear regression equation for normalized data are constructed by multiple linear regression analysis on the according to (1) will have the form [4] basis of the metrics that can be measured from class diagram. However, there are four basic assumptions that justify the use ( ) Zˆ Y = Z Y + Z +X bˆ , (3) of linear regression models, one of which is normality of the where Ẑ Y is prediction linear regression equation result error distribution. But this assumption is valid only in for values of components of vector z X = {Z1 , Z 2 ,, Z k } ; particular cases. This leads to the need to use the non-linear regression equations including for estimating the software Z +X is the matrix of centered regressors that contains the size of open-source PHP-based systems. values Z1i − Z1 , Z 2i − Z 2 ,  , Z ki − Z k ; b̂ is estimator for A normalizing transformation is often a good way to build the equations, confidence and prediction intervals of multiply vector of linear regression equation parameters, non-linear regressions [3-5]. According [4] transformations b = {b1 , b2 ,  , bk } . T are used for essentially four purposes, two of which are: first, The non-linear regression equation will have the form to obtain approximate normality for the distribution of the error term (residuals), second, to transform the response and/or the predictor in such a way that the strength of the [ ( )] Yˆ = ψ1−1 Z Y + Z +X bˆ , (4) ˆ where Y is prediction non-linear regression equation linear relationship between new variables (normalized variables) is better than the linear relationship between result. dependent and independent random variables. Well-known The technique to build a non-linear regression equation is techniques for building the equations, confidence and based on transformations (1) and (2), Eq. (3) and a prediction intervals of multivariate non-linear regressions are confidence interval of linear regression for normalized data 12  1 ( ) ( ) ( )  based on the univariate normalizing transformations, which −1 ZˆY ± tα 2,ν S Z Y  + z +X  Z +X Z +X  z +X  T T do not take into account the correlation between random , (5)  N      variables in the case of normalization of multivariate non- Gaussian data. This leads to the need to use the multivariate where tα 2,ν is a quantile of student's t-distribution with ν normalizing transformations. In this paper, we build the equation, confidence and degrees of freedom and α 2 significance level; z +X ( ) is one T prediction intervals of multivariate non-linear regression for estimating the software size of open-source PHP-based ACIT 2018, June 1-3, 2018, Ceske Budejovice, Czech Republic 200 ( ) , ν = N − k −1 ; 2 Johnson multivariate normalizing transformation for the four- 1 N of the rows of Z +X ; S Z2Y = ∑ ZY − ZˆYi ν i =1 i dimensional non-Gaussian data set: actual software size in the thousand lines of code (KLOC) Y , the average number of (Z ) Z is the k × k matrix + T X + X attributes per class X 3 , the total number of classes X 1 and the total number of relationships X 2 in conceptual data  S Z1 Z1 S Z1 Z 2  S Z1 Z k    model from 32 information systems developed using the PHP ( ) + T +  SZ Z ZX ZX =  1 2  SZ 2 Z 2   SZ 2 Z k     , programming language with HTML and SQL. Table I contains the data from [1] on four metrics of software for 32    SZ Z SZ 2 Z k  S Z k Z k  open-source PHP-based systems.  1 k TABLE I. THE DATA ON SOFTWARE METRICS ∑ [Z − Z ][Z − Z ], q, r = 1,2,, k . N where S Z q Z r = qi q ri r i =1 i Y X1 X2 X3 The confidence interval for non-linear regression is built 1 3.038 5 2 10.6 on the basis of the interval (5) and inverse transformation (2) 2 22.599 17 7 7 3 32.243 21 13 4.524    12  1 ( ) ( ) ( ) −1 −1  ˆ + T + T +  +  4 16.164 13 11 7.077 ψ1  ZY ± tα 2,ν S ZY  + z X Z ZX z X   . (6)   N  X    5 83.862 35 24 6.571   6 24.22 13 9 8.077 The technique to build a prediction interval is based on 7 63.929 35 19 8.029 multivariate transformation (1), the inverse transformation 8 2.543 5 3 9.4 (2), linear regression equation for normalized data (3) and a 9 6.697 5 5 7 prediction interval for normalized data 10 55.537 25 14 8.64 12 11 55.752 39 10 9.077  ( ) ( )  ( ) −1 ZˆY ± tα 2,ν S Z Y 1 + + z +X  Z +X Z +X  z +X  . (7) 1 T T 12 62.602 30 17 7  N    13 67.111 23 22 14.957 The prediction interval for non-linear regression is built on 14 2.552 3 1 8.333 the basis of the interval (7) and inverse transformation (2) 15 12.17 10 5 3.7 16 12.757 13 9 5    12  1 ( ) ( ) ( ) −1 ψ1−1  ZˆY ± tα 2, ν S Z Y 1 + + z +X  Z +X Z +X  z +X   . (8) T T 17 5.695 7 3 8.429   N     18 7.744 9 6 9.222   19 7.514 4 1 8 20 11.054 9 9 3.667 III. THE JOHNSON NORMALIZING TRANSLATION 21 29.77 17 15 3.412 For normalizing the multivariate non-Gaussian data, we 22 11.653 9 8 8.778 use the Johnson translation system. The Johnson normalizing 23 6.847 5 4 3.6 translation is given by 24 13.389 7 5 11.714 [ ] Z = γ + ηh λ −1 (X − ϕ) ∼ N m (0m , Σ ) , (9) 25 26 14.45 4.414 12 6 6 3 16.583 3.667 where Σ is the covariance matrix; m = k + 1 ; γ , η , ϕ 27 2.102 3 1 3.333 and λ are parameters of translation (9); γ = (γ 1 , γ 2 ,  , γ m ) ; 28 42.819 20 18 3.5 T 29 4.077 4 2 9 η = diag (η1 , η 2 ,  , η m ) ; λ = diag (λ1 , λ 2 ,  , λ m ) ; 30 57.408 33 14 9.242 ϕ = (ϕ1 , ϕ 2 , , ϕ m ) ; h[( y1 ,, ym )] = {h1 ( y1 ),, hm ( ym )}T ; 31 7.428 7 3 7 T 32 8.947 15 5 4 hi (⋅) is one of the translation functions  ln( y ), For detecting the outliers in the data from Table 1 we use for S L (log normal) family; the technique based on multivariate normalizing  [ ( ln y 1 − y )], for S B (bounded) family; transformations and the squared Mahalanobis distance [6]. h= (10)  Arsh( y ), for SU (unbounded) family; There are no outliers in the data from Table I for 0.005  significance level and the Johnson multivariate y for S N (normal) family. transformation (9) for S B family. The same result was Here y = (x − ϕ) λ ; Arsh( y ) = ln y + y 2 + 1  . obtained in [6] for the transformation (9) for SU family. In   [1] it was also assumed that the data contains no outliers. IV. THE EQUATION, CONFIDENCE AND PREDICTION Parameters of the multivariate transformation (9) for S B INTERVALS OF NON-LINEAR REGRESSION TO family were estimated by the maximum likelihood method. ESTIMATE THE SOFTWARE SIZE Estimators for parameters of the transformation (9) are: γˆ Y = 9.63091 , γˆ 1 = 15.5355 , γˆ 2 = 25.4294 , γˆ 3 = 0.72801 , The equation, confidence and prediction intervals of non- linear regression to estimate the software size of open-source ηˆ Y = 1.05243 , ηˆ 1 = 1.58306 , ηˆ 2 = 2.54714 , ηˆ 3 = 0.54312 , PHP-based systems are constructed on the basis of the ACIT 2018, June 1-3, 2018, Ceske Budejovice, Czech Republic 201 ϕˆ Y = -1.4568 , ϕˆ 1 = -1,8884 , ϕˆ 2 = -6,9746 , ϕˆ 3 = 3.2925 , transformation is less than 0.25. Although all values of PRED(0.25) in the Table III are less than 0.75 nevertheless λˆ = 153102.605 , λˆ = 243051.0 , λˆ = 311229.5 and Y 1 2 the values are greater for Eq. (12). All values of multiple λˆ 3 = 13.900 . The sample covariance matrix S N of the T is coefficient of determination R 2 in the Table III are greater used as the approximate moment-matching estimator of Σ than 0.75 but the value of R 2 is greater for Eq. (12) on the  1.0000 0.9514 0.9333 0.1574  basis of multivariate transformation.    0.9514 1.0000 0.9006 0.1345  TABLE II. PREDICTION RESULTS AND MRE OF REGRESSION SN =  . 0.9333 0.9006 1.0000 0.0554  EQUATIONS    0.1574 0.1345 0.0554 1.0000  Non-linear regression equation   Linear regression After normalizing the non-Gaussian data by the univariate multivariate i equation multivariate transformation (9) for S B family the linear transformation transformation Ŷ MRE Ŷ MRE Ŷ MRE regression equation (3) is built for normalized data 1 3.237 0.0656 4.675 0.5388 4.550 0.4976 Zˆ = bˆ + bˆ Z + bˆ Z + bˆ Z . Y 0 1 1 2 2 (11) 3 3 2 24.142 0.0683 19.965 0.1166 19.990 0.1154 Estimators for parameters of the Eq. (11) are such: 3 37.524 0.1638 32.098 0.0045 33.535 0.0401 bˆ = 1.02 ⋅10 −5 , bˆ = 0.56085 , bˆ = 0.42491 , bˆ = 0.05846 . 0 1 2 3 4 25.916 0.6033 23.171 0.4335 21.292 0.3173 5 74.624 0.1102 80.265 0.0429 83.618 0.0029 After that the non-linear regression equation (4) is built 6 23.224 0.0411 20.524 0.1526 18.901 0.2196 Yˆ = ϕˆ Y + λˆ Y 1 + e −(ZY − γY ) ηY  −1 ˆ ˆ ˆ 7 67.215 0.0514 65.913 0.0310 70.647 0.1051 , (12)   8 4.127 0.6228 5.789 1.2764 5.169 1.0328 9 5.906 0.1181 7.353 0.0980 6.356 0.0509 where Ẑ Y is prediction result by the Eq. (11), 10 46.843 0.1565 42.098 0.2420 43.126 0.2235 X j −ϕj 11 57.814 0.0370 67.070 0.2030 49.823 0.1064 Z j = γ j + η j ln , ϕ j < X j < ϕ j + λ j , j = 1,2,3 . ϕj + λj − X j 12 56.995 0.0896 53.497 0.1454 56.651 0.0951 13 61.856 0.0783 65.500 0.0240 60.617 0.0968 The prediction results by Eq. (12) for values of 14 -2.395 1.9384 2.202 0.1370 2.447 0.0412 components of vector X = {X 1 , X 2 , X 3} from Table I are 15 9.959 0.1816 9.693 0.2035 10.029 0.1759 shown in the Table II for two cases: univariate and 16 21.218 0.6632 18.682 0.4644 18.105 0.4192 multivariate normalizing transformations. 17 5.976 0.0493 7.083 0.2438 6.687 0.1743 For univariate normalizing transformations (10) of S B 18 13.991 0.8067 12.911 0.6673 11.301 0.4593 19 -1.371 1.1825 2.496 0.6678 3.096 0.5880 family the estimators for parameters are such: γˆ Y = 0.77502 , 20 15.385 0.3918 13.301 0.2032 12.850 0.1625 γˆ 1 = 0.59473 , γˆ 2 = 0.57140 , γˆ 3 = 0.68734 , ηˆ Y = 0.44395 , 21 35.179 0.1817 27.321 0.0823 29.061 0.0238 ηˆ 1 = 0.48171 , ηˆ 2 = 0.49553 , ηˆ 3 = 0.51970 , ϕˆ Y = 2.063 , 22 17.045 0.4627 15.461 0.3268 13.268 0.1386 23 2.017 0.7054 5.435 0.2062 5.112 0.2534 ϕˆ = 2.900 , 1 ϕˆ = 0.900 , 2 ϕˆ = 3.304 , 3 λˆ = 83.059 , Y 24 11.462 0.1440 10.367 0.2257 8.661 0.3531 25 22.513 0.5580 20.191 0.3973 15.888 0.0995 λˆ 1 = 36.695 , λˆ 2 = 23.525 and λˆ 3 = 13.660 . In the case of 26 1.630 0.6307 5.318 0.2048 5.260 0.1916 univariate normalizing transformations the estimators for 27 -5.655 3.6902 2.142 0.0192 1.873 0.1090 parameters of the Eq. (11) are such: bˆ = 3.11 ⋅10 −7 , 0 28 43.975 0.0270 37.967 0.1133 38.631 0.0978 29 0.953 0.7662 3.892 0.0454 3.732 0.0846 bˆ1 = 0.43519 , bˆ2 = 0.52239 and bˆ3 = 0.08546 . 30 57.164 0.0043 53.121 0.0747 54.381 0.0527 Table II also contains the prediction results by linear 31 5.044 0.3209 6.861 0.0764 6.571 0.1154 regression equation from [1] for values of components of 32 16.360 0.8285 12.934 0.4456 14.258 0.5936 vector X = {X 1 , X 2 , X 3} from Table I. Note the prediction The confidence and prediction intervals of non-linear results by linear regression equation from [1] are negative for regression are defined by (6) and (8) respectively for the data the three rows of data: 14, 19 and 27. All prediction results by from Table I. non-linear regression equation (12) are positive. Magnitude of relative error (MRE), mean magnitude of TABLE III. VALUES OF R 2 , MMRE AND PRED(0.25) relative error (MMRE) and percentage of prediction Linear Non-linear regression equation (PRED(0.25)) are accepted as standard evaluations of Coefficients regression univariate multivariate prediction results by regression equations. The values of equation transformation transformation MRE for linear regression equation from [1], non-linear R2 0.9491 0.9591 0.9692 regression equation (12) for two cases (univariate and MMRE 0.4919 0.2535 0.2199 multivariate normalizing transformations) are shown in the PRED(0.25) 0.5313 0.7188 0.7188 Table II. The acceptable values of MMRE and PRED(0.25) are not more than 0.25 and not less than 0.75 respectively. Table IV contains the lower (LB) and upper (UB) bounds The values of MMRE in the Table III indicate that only the of the prediction intervals of linear and non-linear regressions value for Eq. (12) on the basis of multivariate normalizing ACIT 2018, June 1-3, 2018, Ceske Budejovice, Czech Republic 202 on the basis of univariate and multivariate transformations equality is a necessary condition for multivariate normality. respectively for 0.05 significance level. In our case β 2 = 24 . The estimators of multivariate kurtosis Note the lower bounds of the prediction interval of linear equal 28.66, 37.29 and 23.08 for the data from Table I, the regression from [1] are negative for the thirteen rows of data: normalized data on the basis of the Johnson univariate and 1, 8, 9, 14, 15, 17, 19, 23, 24, 26, 27, 29 and 31. All the lower multivariate transformations respectively. The values of these bounds of the prediction interval of non-linear regressions are estimators indicate that the necessary condition for positive. The widths of the prediction interval of non-linear multivariate normality is practically performed for the regression on the basis of the Johnson multivariate normalized data on the basis of the Johnson multivariate transformation are less than for linear regression from [1] for transformation only and does not hold for other data. the twenty rows of data: 1, 6, 8, 9, 14-20, 22-27, 29, 31 and 32. Also the widths of the prediction interval of non-linear V. CONCLUSION regression on the basis of the Johnson multivariate The non-linear regression equation to estimate the software transformation are less than following the Johnson univariate size of open-source PHP-based systems is improved on the transformation for the twenty-three rows of data: 1-4, 6, 8-10, basis of the Johnson multivariate transformation for S B 15-18, 20-26, 28, 29, 31 and 32. Approximately the same results are obtained for the confidence interval of non-linear family. This equation, in comparison with other regression regression. equations (both linear and nonlinear), has a larger multiple coefficient of determination and a smaller value of MMRE. TABLE IV. BOUNDS OF THE PREDICTION INTERVALS When building the equations, confidence and prediction Bounds for non-linear regression intervals of non-linear regressions for multivariate non- Bounds for linear Gaussian data, one should use multivariate transformations. univariate multivariate i regression Usually poor normalization of multivariate non-Gaussian transformation transformation LB UB LB UB LB UB data or application of univariate transformations instead of 1 -8.886 15.361 2.507 15.664 2.053 8.822 multivariate ones to normalize such data may lead to increase 2 12.260 36.024 5.800 53.204 11.088 35.207 of width of the confidence and prediction intervals of 3 25.530 49.517 9.341 65.987 19.149 57.962 regressions, both linear and nonlinear. 4 14.031 37.802 6.642 57.342 11.955 37.129 5 61.845 87.403 59.920 84.392 47.603 146.045 REFERENCES 6 11.451 34.998 5.956 53.906 10.617 32.866 7 54.797 79.633 31.210 81.247 40.528 122.355 [1] Hee Beng Kuan Tan, Yuan Zhao, and Hongyu Zhang, 8 -7.849 16.103 2.713 20.215 2.431 9.838 “Estimating LOC for information systems from their 9 -5.998 17.810 2.996 26.099 3.097 11.949 conceptual data models”, in Proceedings of the 28th 10 34.901 58.785 13.397 72.304 24.761 74.346 International Conference on Software Engineering (ICSE 11 43.606 72.022 26.251 82.571 26.759 91.726 '06), May 20-28, 2006, Shanghai, China, pp. 321-330. 12 44.844 69.146 19.861 77.358 32.563 97.782 [2] Matinee Kiewkanya, and Suttipong Surak, “Constructing 13 47.957 75.755 28.542 81.562 33.153 109.857 C++ software size estimation model from class diagram”, 14 -14.415 9.625 2.084 2.994 0.811 5.262 in 13th International Joint Conference on Computer 15 -2.080 21.999 3.441 33.425 5.255 18.197 Science and Software Engineering (JCSSE), July 13-15, 16 9.355 33.081 5.492 51.258 10.150 31.513 17 -5.925 17.877 2.964 24.822 3.336 12.381 2016, Khon Kaen, Thailand, pp. 1-6. 18 2.136 25.846 4.145 40.894 6.095 20.095 [3] D. M. Bates, and D. G. Watts, Nonlinear regression 19 -13.374 10.632 2.127 4.916 1.198 6.351 analysis and its applications. Wiley, 1988. 20 3.243 27.527 4.154 42.480 6.867 23.133 [4] T. P. Ryan, Modern regression methods. Wiley, 1997. 21 22.801 47.556 7.324 63.400 15.978 51.960 [5] S. B. Prykhodko, “Developing the software defect 22 5.148 28.943 4.693 46.152 7.200 23.590 prediction models using regression analysis based on 23 -10.093 14.128 2.635 19.103 2.367 9.829 normalizing transformations”, in Abstracts of the 24 -0.715 23.638 3.576 35.238 4.477 15.796 Research and Practice Seminar on Modern Problems in 25 9.337 35.689 5.323 56.560 8.396 29.076 Testing of the Applied Software (PTTAS-2016), May 25- 26 -10.481 13.741 2.621 18.450 2.464 10.048 26, 2016, Poltava, Ukraine, pp. 6-7. 27 -17.916 6.606 2.073 2.648 0.410 4.484 [6] S. Prykhodko, N. Prykhodko, L. Makarova, and 28 31.335 56.615 10.895 70.978 21.432 68.748 A. Pukhalevych, “Application of the squared 29 -11.043 12.949 2.371 12.014 1.575 7.423 Mahalanobis distance for detecting outliers in 30 44.632 69.696 19.170 77.441 30.902 94.883 31 -6.838 16.926 2.926 23.959 3.273 12.168 multivariate non-Gaussian data”, in Proceedings of 14th 32 4.173 28.547 4.090 41.560 7.530 26.021 International Conference on Advanced Trends in Radioelectronics, Telecommunications and Computer Following [7] multivariate kurtosis β 2 is estimated for the Engineering (TCSET), Lviv-Slavske, Ukraine, February data on metrics of software from Table I and the normalized 20–24, 2018, pp. 962-965. data on the basis of the Johnson univariate and multivariate [7] K. V. Mardia, “Measures of multivariate skewness and transformations for S B family. It is known that kurtosis with applications”, Biometrika, 57, 1970, β 2 = m(m + 2 ) holds under multivariate normality. The given pp. 519–530. ACIT 2018, June 1-3, 2018, Ceske Budejovice, Czech Republic