<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Estimating the Software Size of Open-Source PHP-Based Systems Using Non-Linear Regression Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sergiy Prykhodko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Natalia Prykhodko</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lidiia Makarova</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>. Department of Software of Automated Systems, Admiral Makarov National University of Shipbuilding, UKRAINE</institution>
          ,
          <addr-line>Mykolaiv, Heroes of</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <fpage>1</fpage>
      <lpage>3</lpage>
      <abstract>
        <p>The equation, confidence and prediction intervals of multivariate non-linear regression for estimating the software size of open-source PHP-based systems are constructed on the basis of the Johnson multivariate normalizing transformation. Comparison of the constructed equation with the linear and non-linear regression equation based on the Johnson univariate transformation is performed.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>I. INTRODUCTION</p>
      <p>Software size is one of the most important internal metrics
of software. The information obtained from estimating the
software size are useful for predicting the software
development effort by such model as COCOMO II. The
papers [1, 2] proposed the linear regression equations for
estimating the software size of some programming languages,
such as VBA, PHP, Java and C++. The proposed equations
are constructed by multiple linear regression analysis on the
basis of the metrics that can be measured from class diagram.
However, there are four basic assumptions that justify the use
of linear regression models, one of which is normality of the
error distribution. But this assumption is valid only in
particular cases. This leads to the need to use the non-linear
regression equations including for estimating the software
size of open-source PHP-based systems.</p>
      <p>A normalizing transformation is often a good way to build
the equations, confidence and prediction intervals of multiply
non-linear regressions [3-5]. According [4] transformations
are used for essentially four purposes, two of which are: first,
to obtain approximate normality for the distribution of the
error term (residuals), second, to transform the response
and/or the predictor in such a way that the strength of the
linear relationship between new variables (normalized
variables) is better than the linear relationship between
dependent and independent random variables. Well-known
techniques for building the equations, confidence and
prediction intervals of multivariate non-linear regressions are
based on the univariate normalizing transformations, which
do not take into account the correlation between random
variables in the case of normalization of multivariate
nonGaussian data. This leads to the need to use the multivariate
normalizing transformations.</p>
      <p>In this paper, we build the equation, confidence and
prediction intervals of multivariate non-linear regression for
estimating the software size of open-source PHP-based
systems on the basis of the Johnson multivariate normalizing
transformation (the Johnson normalizing translation) with the
help of appropriate techniques proposed in [5].</p>
    </sec>
    <sec id="sec-2">
      <title>II. THE TECHNIQUES</title>
      <p>The techniques to build the equations, confidence and
prediction intervals of non-linear regressions are based on the
multiple non-linear regression analysis using the multivariate
normalizing transformations. A multivariate normalizing
transformation of non-Gaussian random vector
to</p>
      <p>
        Gaussian
random
vector
P = {Y , X1, X 2,, X k }T
T = {ZY , Z1, Z2,, Zk }T is given by
and the inverse transformation for (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
      </p>
      <p>P = ψ −1(T) .</p>
      <p>T = ψ(P)</p>
      <p>
        The linear regression equation for normalized data
according to (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) will have the form [4]
      </p>
      <p>ZˆY = ZY + (Z+X )bˆ ,
where ZˆY is prediction linear regression equation result
for values of components of vector z X = {Z1, Z2,, Zk } ;
+
Z X is the matrix of centered regressors that contains the
values Z1i − Z1 , Z2i − Z2 ,  , Zki − Zk ; bˆ is estimator for
vector of linear regression equation parameters,
b = {b1, b2 ,, bk }T .</p>
      <p>The non-linear regression equation will have the form</p>
      <p>Yˆ = ψ1−1[ZY + (Z+X )bˆ ] ,
where Yˆ is prediction non-linear regression equation
result.</p>
      <p>
        The technique to build a non-linear regression equation is
based on transformations (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ), Eq. (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) and a
confidence interval of linear regression for normalized data
ZˆY ± tα 2,νSZY  N1 + (z+X )T (Z+X )T Z+X −1(z+X )1 2
, (
        <xref ref-type="bibr" rid="ref5">5</xref>
        )
where tα 2,ν is a quantile of student's t-distribution with ν
degrees of freedom and α 2 significance level; (z+ )T is one
X
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
of the rows of Z+X ; SZ2Y =
(Z+X )T Z+X is the k × k matrix
1 N 2
      </p>
      <p>∑ (ZY − ZˆY ) , ν = N − k −1 ;
ν i=1 i i
 SZ1Z2
(Z+X )T Z+X = 
 
 SZ1Z1
 SZ1Zk</p>
      <p>N
where SZqZr = ∑ [Z qi − Z q ][Z ri − Z r ], q, r = 1,2,, k .</p>
      <p>i=1</p>
      <p>
        The confidence interval for non-linear regression is built
on the basis of the interval (
        <xref ref-type="bibr" rid="ref5">5</xref>
        ) and inverse transformation (
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
 −1
ψ1−1 ZˆY ± tα 2,νSZY  N1 + (z+X )T (Z+X )T Z+X  (z+X )1 2  . (
        <xref ref-type="bibr" rid="ref6">6</xref>
        )
The technique to build a prediction interval is based on
multivariate transformation (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ), the inverse transformation
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ), linear regression equation for normalized data (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) and a
prediction interval for normalized data
      </p>
      <p>
        ZˆY ± tα 2,νSZY 1 + N1 + (z+X )T (Z+X )T Z+X −1(z+X )1 2 . (
        <xref ref-type="bibr" rid="ref7">7</xref>
        )
The prediction interval for non-linear regression is built on
the basis of the interval (
        <xref ref-type="bibr" rid="ref7">7</xref>
        ) and inverse transformation (
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
ψ1−1 ZˆY ± tα 2,νSZY 1+ N1 + (z+X )T (Z+X )T Z+X −1(z+X )1 2  .(8)
III. THE JOHNSON NORMALIZING TRANSLATION
For normalizing the multivariate non-Gaussian data, we
use the Johnson translation system. The Johnson normalizing
translation is given by
      </p>
      <p>Z = γ + ηh[λ −1(X − ϕ)] ∼ Nm (0m , Σ) ,
where Σ is the covariance matrix; m = k +1 ; γ , η , ϕ
and λ are parameters of translation (9); γ = (γ1, γ 2 ,, γ m )T ;
η = diag(η1, η2 ,, ηm );
ϕ = (ϕ1, ϕ2 ,, ϕm )T ;
hi (⋅) is one of the translation functions
λ = diag(λ1, λ2 ,, λm ) ;
h[(y1,, ym )] = {h1(y1),, hm (ym )}T ;
 ln(y),
ln[y (1 − y)],
h = 
 Arsh(y),
 y
for SL (log normal) family;
for SB (bounded) family;
for SU (unbounded) family;
for SN (normal) family.</p>
      <p>Here y = (x − ϕ) λ ; Arsh(y) = ln y + y2 + 1  .
 
IV. THE EQUATION, CONFIDENCE AND PREDICTION
INTERVALS OF NON-LINEAR REGRESSION TO</p>
      <p>ESTIMATE THE SOFTWARE SIZE</p>
      <p>The equation, confidence and prediction intervals of
nonlinear regression to estimate the software size of open-source
PHP-based systems are constructed on the basis of the
(9)
(10)</p>
      <p>Johnson multivariate normalizing transformation for the
fourdimensional non-Gaussian data set: actual software size in
the thousand lines of code (KLOC) Y , the average number of
attributes per class X 3 , the total number of classes X1 and
the total number of relationships X 2 in conceptual data
model from 32 information systems developed using the PHP
programming language with HTML and SQL. Table I
contains the data from [1] on four metrics of software for 32
open-source PHP-based systems.</p>
      <p>For detecting the outliers in the data from Table 1 we use
the technique based on multivariate normalizing
transformations and the squared Mahalanobis distance [6].
There are no outliers in the data from Table I for 0.005
significance level and the Johnson multivariate
transformation (9) for SB family. The same result was
obtained in [6] for the transformation (9) for SU family. In
[1] it was also assumed that the data contains no outliers.</p>
      <p>Parameters of the multivariate transformation (9) for SB
family were estimated by the maximum likelihood method.
Estimators for parameters of the transformation (9) are:
γˆ Y = 9.63091 , γˆ1 = 15.5355 , γˆ 2 = 25.4294 , γˆ 3 = 0.72801 ,
ηˆY = 1.05243 , ηˆ1 = 1.58306 , ηˆ2 = 2.54714 , ηˆ3 = 0.54312 ,
transformation is less than 0.25. Although all values of
PRED(0.25) in the Table III are less than 0.75 nevertheless
the values are greater for Eq. (12). All values of multiple
coefficient of determination R2 in the Table III are greater
than 0.75 but the value of R2 is greater for Eq. (12) on the
basis of multivariate transformation.
.</p>
      <p> 0.1574 0.1345 0.0554 1.0000 </p>
      <p>
        After normalizing the non-Gaussian data by the
multivariate transformation (9) for SB family the linear
regression equation (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) is built for normalized data
      </p>
      <p>ZˆY = bˆ0 + bˆ1Z1 + bˆ2Z2 + bˆ3Z3 .</p>
      <p>Estimators for parameters of the Eq. (11) are such:
is prediction result by the Eq. (11),
where
ˆ</p>
      <p>ZY
Z j = γ j + η j ln</p>
      <p>X j − ϕ j
ϕ j + λ j − X j</p>
      <p>, ϕ j &lt; X j &lt; ϕ j + λ j , j = 1,2,3 .</p>
      <p>The prediction results by Eq. (12) for values of
components of vector X = {X1, X 2, X 3} from Table I are
shown in the Table II for two cases: univariate and
multivariate normalizing transformations.</p>
      <p>For univariate normalizing transformations (10) of SB
family the estimators for parameters are such: γˆY = 0.77502 ,
γˆ1 = 0.59473 , γˆ 2 = 0.57140 , γˆ3 = 0.68734 , ηˆY = 0.44395 ,
ηˆ1 = 0.48171, ηˆ2 = 0.49553 , ηˆ3 = 0.51970 , ϕˆY = 2.063 ,
ϕˆ1 = 2.900 ,
ϕˆ2 = 0.900 ,
ϕˆ3 = 3.304 ,
λˆ Y = 83.059 ,
λˆ1 = 36.695 , λˆ 2 = 23.525 and λˆ 3 = 13.660 . In the case of
univariate normalizing transformations the estimators for
parameters of the Eq. (11) are such: bˆ0 = 3.11⋅10−7 ,
bˆ1 = 0.43519 , bˆ2 = 0.52239 and bˆ3 = 0.08546 .</p>
      <p>
        The confidence and prediction intervals of non-linear
regression are defined by (
        <xref ref-type="bibr" rid="ref6">6</xref>
        ) and (8) respectively for the data
from Table I.
on the basis of univariate and multivariate transformations
respectively for 0.05 significance level.
      </p>
      <p>Note the lower bounds of the prediction interval of linear
regression from [1] are negative for the thirteen rows of data:
1, 8, 9, 14, 15, 17, 19, 23, 24, 26, 27, 29 and 31. All the lower
bounds of the prediction interval of non-linear regressions are
positive. The widths of the prediction interval of non-linear
regression on the basis of the Johnson multivariate
transformation are less than for linear regression from [1] for
the twenty rows of data: 1, 6, 8, 9, 14-20, 22-27, 29, 31 and
32. Also the widths of the prediction interval of non-linear
regression on the basis of the Johnson multivariate
transformation are less than following the Johnson univariate
transformation for the twenty-three rows of data: 1-4, 6, 8-10,
15-18, 20-26, 28, 29, 31 and 32. Approximately the same
results are obtained for the confidence interval of non-linear
regression.
equality is a necessary condition for multivariate normality.
In our case β2 = 24 . The estimators of multivariate kurtosis
equal 28.66, 37.29 and 23.08 for the data from Table I, the
normalized data on the basis of the Johnson univariate and
multivariate transformations respectively. The values of these
estimators indicate that the necessary condition for
multivariate normality is practically performed for the
normalized data on the basis of the Johnson multivariate
transformation only and does not hold for other data.</p>
    </sec>
    <sec id="sec-3">
      <title>V. CONCLUSION</title>
      <p>The non-linear regression equation to estimate the software
size of open-source PHP-based systems is improved on the
basis of the Johnson multivariate transformation for SB
family. This equation, in comparison with other regression
equations (both linear and nonlinear), has a larger multiple
coefficient of determination and a smaller value of MMRE.</p>
      <p>When building the equations, confidence and prediction
intervals of non-linear regressions for multivariate
nonGaussian data, one should use multivariate transformations.</p>
      <p>Usually poor normalization of multivariate non-Gaussian
data or application of univariate transformations instead of
multivariate ones to normalize such data may lead to increase
of width of the confidence and prediction intervals of
regressions, both linear and nonlinear.
data on metrics of software from Table I and the normalized
data on the basis of the Johnson univariate and multivariate
transformations for SB family. It is known that
β2 = m(m + 2) holds under multivariate normality. The given</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Hee</given-names>
            <surname>Beng Kuan Tan</surname>
          </string-name>
          , Yuan
          <string-name>
            <surname>Zhao</surname>
          </string-name>
          , and Hongyu Zhang, “
          <article-title>Estimating LOC for information systems from their conceptual data models”</article-title>
          ,
          <source>in Proceedings of the 28th International Conference on Software Engineering (ICSE '06)</source>
          ,
          <source>May 20-28</source>
          ,
          <year>2006</year>
          , Shanghai, China, pp.
          <fpage>321</fpage>
          -
          <lpage>330</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Matinee</given-names>
            <surname>Kiewkanya</surname>
          </string-name>
          , and Suttipong Surak, “Constructing C+
          <article-title>+ software size estimation model from class diagram”</article-title>
          ,
          <source>in 13th International Joint Conference on Computer Science and Software Engineering (JCSSE)</source>
          ,
          <source>July 13-15</source>
          ,
          <year>2016</year>
          ,
          <string-name>
            <given-names>Khon</given-names>
            <surname>Kaen</surname>
          </string-name>
          , Thailand, pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Bates</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D. G.</given-names>
            <surname>Watts</surname>
          </string-name>
          ,
          <article-title>Nonlinear regression analysis and its applications</article-title>
          . Wiley,
          <year>1988</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T. P.</given-names>
            <surname>Ryan</surname>
          </string-name>
          ,
          <article-title>Modern regression methods</article-title>
          . Wiley,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Prykhodko</surname>
          </string-name>
          , “
          <article-title>Developing the software defect prediction models using regression analysis based on normalizing transformations”, in Abstracts of the Research and Practice Seminar on Modern Problems in Testing of the Applied Software (PTTAS-</article-title>
          <year>2016</year>
          ), May 25- 26,
          <year>2016</year>
          , Poltava, Ukraine, pp.
          <fpage>6</fpage>
          -
          <lpage>7</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Prykhodko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Prykhodko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Makarova</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Pukhalevych</surname>
          </string-name>
          , “
          <article-title>Application of the squared Mahalanobis distance for detecting outliers in multivariate non-Gaussian data”</article-title>
          ,
          <source>in Proceedings of 14th International Conference on Advanced Trends in Radioelectronics</source>
          , Telecommunications and Computer Engineering (TCSET),
          <source>Lviv-Slavske, Ukraine, February 20-24</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>962</fpage>
          -
          <lpage>965</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K. V.</given-names>
            <surname>Mardia</surname>
          </string-name>
          , “
          <article-title>Measures of multivariate skewness and kurtosis with applications”</article-title>
          , Biometrika,
          <volume>57</volume>
          ,
          <year>1970</year>
          , pp.
          <fpage>519</fpage>
          -
          <lpage>530</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>