1. Introduction

O. Oriekhov);

A Five-Factor Nonlinear Regression Model for JAVA Applications Size Estimation

Oleksandr Oriekhov

oleksandr.oriekhov@nuos.edu.ua 0

Tetyana Farionova

tetyana.farionova@nuos.edu.ua 0

Liubava Chernova

liubava.chernova@nuos.edu.ua 0

Mykhaіlo Vorona

mykhailo.vorona@nuos.edu.ua 0 0 Admiral Makarov National University of Shipbuilding, Ukraine , Heroes avenue, 9, Mykolaiv, 54007 , Ukraine

000 0 0003

The research proposes a five-factor nonlinear regression model for JAVA applications size estimation at the early stages of project planning for further usage in parametric models for effort estimation of software development. Accurate software development effort estimation is necessary for project planning to manage risk assessment, identify potential planning gaps, enhance the efficiency of the software development process, resource allocation, and costs. JAVA is one of the widely used programming languages in the world and is actively used in the development of various software projects. The aim of the research is to improve the accuracy and reliability of JAVA-applications size estimation at the early stages of software project planning. To achieve this goal, existing equations and models for JAVA-application KLOC estimations were reviewed and compared. A dataset of 571 open-source JAVA applications code metrics was collected using the CK static code analysis tool, and it was split up into learning and validation samples for model construction and validation. Firstly, the five-factor regression model is constructed using the total number of actual classes and interfaces metrics and averages of VMQ, TFQ, and CBO per class on the basis of a multivariate Box-Cox normalizing function. The model is constructed through an iterative process by detecting and removing anomalies from the sample. The constructed model is compared to the existing models by the standard regression model quality criteria such as the coefficient of determination, , and the (0.25). The estimates of criteria !, , and (0.25) for the latest iteration on the learning sample are 0.9759, 0.1276, and 0.9008 respectively, and the estimates on the basis of initial learning and validation samples exceeded the thresholds, which indicates good accuracy and reliability of the constructed regression model. The forecast interval was constructed for the regression and compared with four-factor nonlinear regression. The study confirms that the accuracy and reliability of KLOC estimation for JAVA applications have been successfully improved.

eol>Software project management application size estimation nonlinear regression model JAVA application normalizing function non-Gaussian data multivariate Box-Cox normalizing function multicollinearity code metrics 1

1. Introduction

The effort estimation of software development (SDEE) is one of critical factors in effective project management. Accurate and reliable SDEE is necessary for project planning to manage risk assessment, identify potential planning gaps, enhance the efficiency of the software development process, resource allocation, and costs. Valid estimates allow reducing uncertainty, enabling project managers to optimal resource allocation and reduce unforeseen challenges. When estimations are close to actual software development efforts, the risk is reduced for the objective project goals, as a result, it leads to more predictable and controlled software development lifecycles. The application size can be represented as functional points (FP) or number of code lines (KLOC - kilo lines of code). Both variants have their own pros and cons. Compared to functional points, usage of KLOC application size considers important parameters such as environmental factors, including programming languages and software categories. Moreover, the KLOC metric is highly used in parametric models for SDEE, for example COCOMO, COCOMO II, SLIM, etc. [1,2].

Over the past 25 years, statistical data on software project success, reported by [3], indicate a moderate positive trend in the share of successfully completed projects. In 1994, only 16% of software projects were successfully implemented, meeting deadlines, aligning with budget, and achieving all requirements. By 2020, the share of successfully implemented projects had risen to 35%. Also, the percentage of failed projects reduced from 31% to 19% for the period from 1994 to 2020. Meanwhile, the share of challenged projects, which had issues with deadlines exceeding, budget planning, or failing to meet the declared requirements - has shown a minor downward trend of approximately 4%. Besides, the research [3] demonstrates a strong correlation between project size and the probability of failure or challenges and risks during software project development. Larger software projects are more difficult to manage because they require bigger resource allocation, wider coordination, and they have a higher probability of undefined obstacles. The CHAOS Report findings suggest that large software projects require more accurate estimation and risk management strategies to improve their chances of success.

The programming language JAVA is one of the most popular in software development [4] for diverse domains, including web-platforms, utility software, enterprise solutions, and information systems. The KLOC estimation of JAVA applications is necessary for effective project management to use resources efficiently, control costs, reduce risks, and increase confidence that JAVA applications are delivered on time and within budget.

Despite the rapid growth of the information technology industry, research on software project success rate shows that accurate SDEE remains a challenge. Recent studies indicate that improving the reliability and accuracy of application size estimation is possible by considering such environmental factors as a programming language when using parametric models such as COCOMO II, COSYSMO, and others [1,2] which are based on parameter of code lines. To ensure a high level of accuracy in application size estimation, it is essential to develop mathematical models that consider specific characteristics of programming languages, including JAVA.

2. Literature Review

There are regression equations and nonlinear regression (NR) models [5-12] have been constructed to estimate the size of JAVA applications on basis of open-source code metric samples. The models depend on certain code metrics that are possible to obtain from a UML class diagram, such as the total number of classes (CLASS), the total number of methods (visual, public, static, etc), the total number of class fields (private, public, protected, visual), CBO (coupling between objects), lack of cohesion (LCOM), response for class (RFC), etc. Different combinations of code metrics, sample size, quality, and representation of the general population impact differently on the accuracy, reliability, and robustness of application size estimation. Overall, the estimation process uses metrics of code from UML class diagrams, by applying regression models for KLOC estimation.

The detailed review of previously constructed models for JAVA-application size estimation is given in [10-12] on the total sample of JAVA application metrics with size 571 rows. The models [59] were compared by statistical quality assessment criteria for regression models, for instance the coefficient of determination !, a mean magnitude of relative error and percentage of prediction for magnitude of relative error (MRE) level 0.25 (0.25) [13]. It shows the threefactor linear regression (LR) [5, 6] cannot be used because of the deprecated learning sample, nonGaussian distribution of the learning and validation samples and heteroscedastic nature of the sample. The three-factor NR model [7] also uses the sample of software code metrics from [5, 6], the JAVA web-based one-factor NR model on the basis of CLASS metric [8], and a four-factor NR model on the basis of CLASS, SMQ (static methods quantity), LCOM and TFC [9], fail to exceed satisfactory accuracy according to the estimates of regression model quality criteria, or it’s impossible to obtain estimates on a extended sample of the metrics because of the restrictions of normalizing functions.

The improved regression models for JAVA-application KLOC estimation are devoted in [10-12]. The models from [10] uses restricted set of independent factors which does not allow to exceed required accuracy thresholds. The three-factor NR model on the basis of CLASS, RFC, and aVMQ (average visual methods quantity) [11] and the four-factor NR model on the basis of CLASS, RFC, aCBO (average CBO), and aVMQ [12] have good accuracy by !, and (0.25) criteria, but the models are based on RFC metric. The drawback of the RFC is that it requires knowledge about all the individual messages initiated by a given class, as this metric calls for the computation of all methods potentially executed in response to a message received by an object of a given class [14]. It requires information about internal class logic and algorithms, which UML class diagram does not provide, as it lacks information about internal method calls and algorithms, and the metric considers all types of methods - public, protected, and private. In comparison with public and protected methods, which are defined as part of the class interface for interaction, private methods and their relationships with other methods are typically determined under development rather than the system design stage.

The literature analysis proves, the estimating JAVA-applications size at the early stages of software project development is a demanding scientific and practical task, that proves necessity to construct an appropriate NR model to improve the estimation of the size of JAVA applications using appropriate normalization functions on the basis of quantitative software metrics, which should be available in UML class diagrams. It is necessary to compare the constructed model with the existing models [11, 12] using !, , and (0.25) quality criteria to prove the accuracy and reliability of JAVA application size estimation on the independent validation sample.

3. Objectives of the Research

The aim of the research is to improve the reliability and accuracy of JAVA-applications size estimation at the early stages of software development project planning on the basis of quantitative software metrics from a conceptual data model by constructing a NR model.

The object of the research is the process of JAVA-applications size estimation.

The subject of the research is NR models for estimating the size of JAVA applications.

4. Materials and Methods

4.1. Nonlinear Regression Model Building Methodology and Methods JAVA code metrics generally have a non-Gaussian distribution, which restricts the possibility of using linear regression models for accurate estimation. A fundamental requirement for applying linear regression models is that the regression residuals ε must have a Gaussian distribution, and the sample must have homoscedastic nature.

One of the most efficient methodologies to process non-Gaussian data are based on applying a invertible normalizing transformation to convert the data into a normalized sample for further processing. Various methods can be used for this purpose, including decimal logarithm, square root transformation, univariate and multivariate Box-Cox transformations, and univariate or multivariate Johnson transformation. These normalization transformations allow the construction of linear regression models based on the transformed normalized data, which can be transformed into NR models through inverse transformation.

The NR model constructing methodology is based on statistical analysis methods [15] and it is based on detecting and removing anomalies in NR analysis of non-Gaussian sample and includes invertible normalizing transformation functions, squared Mahalanobis distance (SMD) anomalies detection method, verification of Gaussian distribution of the regression residuals ε, and detecting the forecast interval. The methodology recommends detecting and removing only one anomaly iteratively once it is detected. In case if an anomaly is detected, the methodology starts from the first where is the number of regression factors, and the inverse function of ( 1 ) is given by = &"(), where is a vector of invertible normalizing functions, = {%, ", !, . . . , #}$.

To normalize the multivariate non-Gaussian sample, we selected the multivariate invertible BoxCox transformation, which is given by ' = 6('(" − 1 )⁄' , =' >, ' ≠ 0 , ' = 0 where ' - Box-Cox transformation function parameter; j changes from 1 to ; ' – non-Gaussian random variable that is normalized; ' – Gaussian random variable.

The maximum likelihood method with logarithmic likelihood function was chosen to estimate the parameters of the normalization function ( 1 ) [16].

In the 2nd stage, the Gaussian distribution of normalized data is checked with the Mardia test [16] which uses the measurement of multivariate skewness (",#) and kurtosis (!,#) of the multivariate sample. stage using the modified sample without the detected anomalies from the previous iteration. Otherwise, the NR model is successfully built. The following stages are to construct the NR model.

The methodology begins with applying a normalizing transformation function to a non-Gaussian sample. There is the invertible normalizing function of a non-Gaussian random vector of the sample = {, ", !, . . . , #}$ into a Gaussian random vector = {%, ", !, . . . , #}$ is given by = (), ( 1 ) (2) (3) (4) (5) (7) (8) (9) ! = =* − >$+&"=* − >, +

+ + !,# ≤ "&/(, !).

where ! is an approximated value of the Chi-Square ! distribution and with degrees of freedom ( + 1)( + 2 )⁄6 and is a significance level which is 0.005 for the test.

For !,#, the test statistic is the 1 − quantile of the normal distribution with the math expectation = ( + 2) and variance ! = 8( + 2 )⁄

If the test conditions exceed expectations, the multivariate sample is considered as Gaussian. Otherwise, anomalies of the sample are detected using the SMD method [15] method. The SMD is elements on the main diagonal of the ! matrix with the size ×

where + is a biased variance matrix of the sample (6), and Z is a Gaussian random variable. The elements of the main diagonal of *!, = 1,2, . . . , matrix are detected as anomalies if the elements exceed the threshold of the Chi-Square ! distribution quantile for the significance level .

The 3rd stage is dedicated to a linear regression model constructing for the Gaussian sample which is given by the formula

0 = T% + = W1 + W"" + W!!+. . . +W## + .

where is the Gaussian random variable, ε ~ (0, σ2); W1, W", W!, . . . , W# - the LR model estimators of parameters (10). The estimated values are calculated by the method of least squares.

In the 4th stage, the normality distribution of the LR residuals are checked with Pearson ! criteria for significance level = 0.01. If the residual random variable is not normally distributed, one row * should be removed with the biggest absolute value of the residual *.

In the 5th stage, the NR model is obtained by applying the inverse transformation function to (2) to the LR models (10): = %&"(T% + ) = %&"(W1 + W""(") + W!!(!)+. . . +W##(#) + ), (11) where &" is the inverse Box-Cox transformation function.

The 6th stage is dedicated to the forecast interval W23 constructing for the NR model (11). The forecast interval is based on the LR model (10) and is given by

W23 = &"(T% ± 4⁄!,67#{1 + +" + (89 )$7&"(89 )}"⁄!), (12) where /⁄!,; is a quantile of T-Student distribution with = − − 1 degrees of freedom and /2 - significance level; 7!# = ;" ∑*+-"=%$ − T%$>!; 89 is a vector of central moments of the sample’ predictor variables, which is given by ]"$ − ", !$ − !, . . . , #$ − #^; 7 is × matrix 7 = _7%7&`, (10) (13) where 7%7& = ∑>+&"=<$ − <>==$ − =>, , = 1, 2, … , .

After the forecast interval is constructed, the predicted variable is checked if it is in the interval. If a dependent value is out of the forecast interval, it is considered as an anomaly and must be removed from the learning sample.

4.2. Collecting and Processing of JAVA-Applications Code Metrics

The authors have collected a dataset containing code metrics from 571 open-source JAVA applications hosted on the GitHub platform (https://github.com). The code metrics were extracted using the CK static code analysis tool (https://github.com/mauricioaniche/ck). The dataset includes the following metrics: lines of code per project (KLOC), total number of application classes (CLASS), string value of class types (TYPE), VMQ, TFQ, and CBO. The data is processed, and average values of the metrics per class are obtained, such as aVMQ, aCBO, aTFQ, etc. CK tool extracts abstract, open, final, inner classes, and interfaces under the one code metric name CLASS. All existing models [512] do not consider using actual quantities of classes and interfaces for the model building. Therefore, the general CLASS metric is split up into 2 different metrics: total number of actual classes (CLS) and total number of interfaces (INFC) by TYPE metric and all values of INFC metrics are incremented to make it suitable for Box-Cox transformation function. For the math model construction of JAVAapplication code size estimation (KLOC), CLS, INFC, VMQ, TFQ, and CBO metrics are chosen as independent factors. The multivariate sample is randomly split up into learning and validation samples with sizes 286 and 285 multivariate points respectively. The considered metrics, except KLOC, can be obtained at the early stages of project planning from the conceptual data model of the application.

In the next stage, the multivariate learning sample was analyzed for multicollinearity using variance inflation factors (VIFs) to assess the relationships between independent variables. For multivariate data with independent factors Xi where = 1, 2, … , , the VIFs are represented in the diagonal elements of the inverse covariance × matrix. If a VIF value exceeds the threshold value of 10, it indicates a significant multicollinearity issue, while a value is close to 10, it suggests a potential risk of multicollinearity in mathematical model construction, in case of iterative anomalies removing [17]. For the CLS, INFC, VMQ, TFQ, and CBO factors, the calculated VIFs were 9.3, 2.4, 4.9, 7.2, and 18.6, respectively, confirming the presence of multicollinearity among certain independent factors. Using the NR model construction methodology described in Section 4.1 on the basis of BoxCox normalizing function, the sample was normalized, and the anomalies were iteratively removed. The anomaly removal process was completed in 32 iterations, on the last iteration VIFs were 22.1, 4.3, 15.1, 11.6 and 31.2 which confirms that the VIFs for CLS, INFC, VMQ, TFQ, and CBO increase during the iterative anomalies’ removal process. To avoid problems with multicollinearity, the absolute values of the metrics VMQ, TFQ, and CBO were replaced with their average per CLASS values: aVMQ, aTFQ, and aCBO. For the first iteration, the VIFs coefficients were 2.1, 2.3, 1.7, 1.1, and 1.6 for the metrics CLS, INFC, aVMQ, aTFQ, and aCBO respectively, confirming the absence of multicollinearity between the independent factors. Similarly, the sample was iteratively cleaned up from anomalies in 35 iterations. For the last iteration, the VIFs coefficients were 3.3, 3.3, 1.3, 1.3, and 1.3 confirming the absence of multicollinearity during iterative anomalies removing.

5. Experiment 5.1. Constructing the Five-Factor Nonlinear Regression Model

To enhance the accuracy of early JAVA-application KLOC estimation, a five-factor NR model is built by the above methodology from 286 (learning sample) applications hosted on Github. The model is constructed in 35 iterations using the multivariate learning sample of KLOC, CLS, INFC, aVMQ, aTFQ, and aCBO metrics using the multivariate Box-Cox normalizing function.

Before analyzing the six-dimensional learning sample for multivariate anomalies. We checked the Gaussian distribution of the multivariate data. Multivariate normality Mardia test were applied, which is based on multivariate skewness (", ) and kurtosis (!, ). This test shows that, the distribution of the six-dimension data " (CLS), ! (INFC), , (aVMQ), ? (aTFQ), @ (aCBO), and Y (KLOC) of learning sample is not Gaussian, because multivariate skewness estimate "⁄6 = 12393.52 is exceeded Chi-Square quantile value 86.99 for 56 degrees of freedom and significance level = 0.005 and the estimate of multivariate kurtosis ! = 393.95 is exceeded the value of Gaussian distribution quantile which is equal to 50.77 for mean 48 and variance 1.34.

We used the statistical methodology based on multivariate normalizing transformation and the SMD for normalized data. The six-variate Box-Cox transformation (3) was iteratively applied to the learning sample for normalizing. The parameter estimates of the six-variate Box-Cox transformation are calculated by the maximum likelihood method (4,5). There were 28 anomalies iteratively detected and removed from the learning sample using the SMD method because their *! values were greater than the threshold value 18.55 of the Chi-Square ! for the significance level = 0.005. Once anomalies were found, an anomaly with max absolute value was removed, and the model constructing process started from the beginning. The estimates of the six-variate Box-Cox normalizing transformation are T% = −9,165027 ⋅ 10&, , T8' = 7.384604 ⋅ 10&, , 8( = 2.758452 ⋅ 10&!, T8) = −7.881872 ⋅ 10&!, T8* = 0.377377 and T8+ = 0.707875 for the T latest iteration.

On the latest iteration, the six-dimension learning sample ( = 252) checked Gaussian distribution by a multivariate normality Mardia test. Accordingly the test, distribution of sixdimension data is Gaussian, because the multivariate skewness estimate "⁄6 = 85.92 is not exceeded Chi-Square quantile value 86.99 for 56 degrees of freedom and significance level = 0.005 and the estimate of the multivariate kurtosis ! = 47.01 does not exceed the value Gaussian distribution quantile which is equal to 50.86 for mean 48 and variance 1.23.

In the third stage, the five-factor linear regression is constructed (10) using the learning sample. The estimators W1, W", W!, W, , W?, and W@ are calculated by the least square method and the estimates are −4.202982, 0.921816, 1.390376 ⋅ 10&!, 0.933162, 0.231490 and −2.686832 ⋅ 10&!, respectively for the latest iteration.

And in the fourth stage the regression residuals were verified on the normality of the distribution in the LR model for 258 rows of the learning sample. The observed frequency distribution of the residual values in (10) resembles Gaussian distribution, after 3 residuals were removed from the sample, because the evaluated values of the Chi-Square test with 3 residuals were higher than quantile 15.09 of the Chi-Square distribution for 6 degrees of freedom and significance level = 0.01.

In the fifth stage, the NR model (11) is constructed by applying inverse transformation to (2) to the LR model (10). Then, in the sixth stage, the forecast interval is constructed of NR model by (12). In this stage, the 3 anomalies were detected and removed from the learning sample. For the latest iteration, the inverse covariance matrix of (13) is −8.7586 ⋅ 10&? 4.8178 ⋅ 10&? −6.0266 ⋅ 10&, −3.6828 ⋅ 10&, 7.5035 ⋅ 10&, For the obtained forecast interval, the values of normalized sample mean ", !, , , ?, and @ are 6.59985, 4.12350, 1.46601, 0.90387, and 3.58194, respectively. The /⁄!,; = 2.5960 for significance level = 0.01 and 246 degrees of freedom; 7# = 0.154424.

The constructed NR model is limited to estimating KLOC JAVA-application with the following restrictions on factors: the interval for "is from 33 to 10610, ! is from 1 to 1562, , is from 2.34 to 12.14, ? is from 0.5384 to 5.64 and @ is from 2.1 to 10.43.

The constructed model’s accuracy is verified with !, , and (0.25) regression model quality criteria on the learning sample ( = 252) without 34 anomalies, which were detected and removed during the iterative model constructing process. The estimates of !, , and (0.25) are 0.9759, 0.1276, and 0.9008 respectively, which indicates a high level of the prediction accuracy of the model.

Moreover, to estimate the size of JAVA-applications, we constructed a four-factor NR with factors of total classes and interfaces number " (CLASS), ! (aVMQ), , (aTFQ), and ? (aCBO) on the basis of multivariate Box-Cox normalizing transformation for the same learning sample. The NR model with four factors is based on the multivariate Box-Cox transformation, and has the form (11) but with the following estimates of the parameters: T% = 4.072805 ⋅ 10&, , T8' = 4.1431 ⋅ 10&!, 8( = −0.143596, T8) = 0.334377, and T8* = 0.651922 and parameters of four-factor linear T regression model for the normalized sample are W1 = −3.918828, W" = 0.782652, W! = 1.037247, W, = 0.283308, and W? = −4.834321 ⋅ 10&!.

For the latest iteration, the inverse covariance matrix of (13) is given by

The constructed model’s accuracy was verified with !, , and (0.25) regression model quality criteria on the learning sample without 30 anomalies, which were determined and excluded during the model constructing process. The estimates of !, , and (0.25) are 0.9719, 0.1345, and 0.8750 respectively which indicates a high prediction accuracy of the model. 5.2. Comparing the Quality and the Accuracy of Size Estimation of the Regression

Models The obtained five-factor NR model is verified with the regression model’s quality criteria to assess the predictive accuracy and reliability using the initial learning and the validation samples, and compared with existing three-factor NR [11] and four-factor NR on the basis of RFC [12]. The estimates of !, , and (0.25) are displayed in Table 1.

The constructed five-factor model validation accuracy estimates of and (0.25) for the initial learning sample and ! for the validation sample demonstrate that the usage of the average values of VMQ, TFQ, and CBO metrics allows us to achieve better results in estimating the KLOC of JAVA-applications in comparison to all existing and previously constructed regression models which use RFC metric. The remaining ! and (0.25) estimates prove the good quality of the constructed five-factor regression mode.

The forecast interval of the five-factor regression is compared with the four-factor regressions. The comparison indicates the five-factor NR interval is 21.4% shorter than the RFC based four-factor NR forecast interval and is 9.8% shorter than the four-factor NR (CLASS, aVMQ, aTFQ, aCBO) (12) on the basis of the learning sample. The forecast intervals are visualized in Figure 1, comparison of the four-factor nonlinear model on RFC basis, the four-factor nonlinear model (11), and the fivefactor nonlinear model (11). LB and F5+UB for five-factor regression (11). Actual JAVA-application KLOC size is solid red line. The chart confirms that the width of the constructed five-factor nonlinear regression is shorter in comparison with the existing models.

6. Conclusion

The obtained five-factor nonlinear regression model on the basis of multivariate Box-Cox normalizing transformation significantly enhances the accuracy and reliability of JAVA-applications size estimation compared to existing models by using five independent factors: CLS, INFC, aVMQ, aTFQ, and aCBO, which are available in UML class diagram. The model is built without the quality metric RFC, which cannot be accurately obtained from the UML class diagram on the early stage of the project planning. Splitting the total sum of classes and interfaces into separate metrics allows to improve the accuracy of the model by !, and (0.25) criteria estimates from the learning and the validation samples compared to three- and four-factor regression models. The estimates of !, , and (0.25) exceed the thresholds, and there are 0.9257, 0.1536, and 0.8427 respectively for the initial learning sample and 0.9185, 0.1757, and 0.7719 for the validation sample, and the forecast interval is 21.4% shorter in comparison with the four-factor nonlinear regression on the basis of CLASS, RFC, aVMQ and aCBO metrics and 9.8% shorter compared to the four-factor nonlinear regression on the basis of CLASS, aVMQ, aTFQ, and aCBO metrics, which indicates good accuracy and reliability of the model.

The scientific novelty of the research is that the five-factor nonlinear regression model for multivariate non-Gaussian data is firstly constructed using Box-Cox six-variate transformation using quantitative metrics, which are available on early project planning stage from UML-class diagram. Firstly, the total number of class entities is split up into the actual number of classes and the number of interfaces for regression model construction. Firstly, the multicollinearity issue is solved in the iterative regression model constructing process on the basis of CLS, INFC, and averages of VMQ, TFQ, CBO. The model, compared to other nonlinear regression models, has a higher value of determination coefficient !, a lower values of the mean relative error , higher values of the (0.25), and the forecast interval is 21.4% shorter in comparison with the four-factor nonlinear regression on the basis of CLASS, RFC, aVMQ and aCBO metrics and 9.8% shorter compared to the four-factor nonlinear regression on the basis of CLASS, aVMQ, aTFQ, and aCBO metrics. The findings demonstrate that, splitting up the sum of classes and interfaces metric into 2 separated metrics allows to build nonlinear regression models with a higher accuracy and reliability, accordingly to the estimates of the quality criteria and forecast interval width.

The practical significance of the obtained results allows us to recommend the obtained model for use in practice for further software development effort estimation using parametric models such as COCOMO II, COSYSMO, etc. The proposed model is implemented as a software service that can be used by project managers for JAVA-applications SDEE at early stages of project planning to reduce costs and manage risks.

Prospects for further research. The obtained five-factor NR model could be improved by extending learning sample with code metrics from commercial projects.

Declaration on Generative AI

During the preparation of this work, the authors used ChatGPT-4o and Grammarly tools in order to: Grammar and spelling check. After using these tool(s)/service(s), the authors reviewed and edited the content as needed and take full responsibility for the publication’s content. [2] S. W. Munialo, A Review of Agile Software Effort Estimation Methods, volume 5 of International Journal of Computer Applications Technology and Research., Association of Technology and Science, 2016, 612-618. doi:10.7753/IJCATR0509.1009. [3] The Standish Group, Chaos report 2015, 2015. URL: https://standishgroup.com/sample_research_files/CHAOSReport2015-Final.pdf [4] TIOBE, TIOBE Index, 2024. URL: https://www.tiobe.com/tiobe-index/. [5] H. B. K. Tan, Y. Zhao, H. Zhang, Estimating LOC for information systems from their conceptual data models, in: Proceedings of 28th. International Conference on Software Engineering, 2006, pp. 321-330. doi:10.1145/1134285.1134331. [6] H. B. K. Tan, Y. Zhao, H., H. Zhang, Conceptual Data Model-Based Software Size Estimation for Information Systems, volume 19 of ACM Transactions of Software Engineering and Methodology, 2009. doi:10.1145/1571629.1571630. [7] N. V. Prykhodko, S. B. Prykhodko, A non-linear regression model for estimation of the size of JAVA enterprise information systems software, volume 85 of Modeling and Information Technologies, 2018, pp. 81-88. URL: http://nbuv.gov.ua/UJRN/Mtit_2018_85_14. [8] L. M. Makarova, N.V. Prykhodko, O. O. Kudin, Constructing the non-linear regression model for size estimation of WEB-applications implemented in JAVA, volume 69 of Herald (Kherson National Technical University), 2019, pp. 145-153. [9] S. B. Prykhodko, N. V. Prykhodko, T. G. Smykodub, Four-factor nonlinear regression model to estimate the size of open source JAVA-based applications, volume 70 of Scientific Notes of Taurida National V.I. Vernadsky University. Series: Technical Sciences, 2020, pp. 157-162. doi:10.32838/2663-5941/2020.2-1/25. [10] O. Oriekhov, T. Farionova, Mathematical models for the size estimating of JAVA applications, volume 89 of Herald (KNTU), 2024, pp. 196-203, doi:10.35546/kntu2078-4481.2024.2.28. [11] O. Oriekhov, T. Farionova, L. Chernova, Three-factor nonlinear regression model of estimating the size of JAVA-software, in Proceeding of 12th. Information Control Systems & Technologies, Odesa, Ukraine, 2024. URL:https://ceur-ws.org/Vol-3790/paper44.pdf. [12] O. Oriekhov, The four-factor nonlinear regression model for early JAVA-applications size estimation, in: N. Aksak, D. Antonov, ICST-2024: Advances in Information Control Systems and Technologies, Liha Press, Lviv, Ukraine, 2024, pp. 360-379. doi:10.36059/978-966-397-422-4. [13] D. Port, M. Korte, Comparative studies of the model evaluation criterions MMRE and PRED in software cost estimation research, Proceedings of the 2nd. ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, ACM, New York, 2008, pp. 51–60. doi:10.1145/1414004.1414015. [14] R. Subramanyam, M. Krishnan, Empirical Analysis of CK Metrics for Object-Oriented Design Complexity: Implications for Software Defects, volume 29 of IEEE Transactions on Software Engineering, pp. 297- 310. doi:10.1109/TSE.2003.1191795. [15] S. Prykhodko, N. Prykhodko, Mathematical Modeling of Non-Gaussian Dependent Random Variables by nonlinear Regression Models Based on the Multivariate Normalizing Transformations, in S. Shkarlet, A. Morozov, A. Palagin, volume 1265 of Mathematical Modeling and Simulation of Systems (MODS'2020). Advances in Intelligent Systems and Computing, 2021, pp. 166-174. doi:10.1007/978-3-030-58124-4_16 [16] K. V. Mardia, Measures of multivariate skewness and kurtosis with applications, volume 57 of

Biometrika, 1970, pp. 519–530. doi:10.1093/biomet/57.3.519. [17] I. Olkin, A. R. Sampson, Multivariate Analysis: Overview, in N. J. Smelser, P. B. Baltes, International encyclopedia of social & behavioral sciences, 1st. ed., Elsevier, Pergamon, 2001, pp. 10240–10247.

[1] S. McConnel , Software Estimation: Demystifying the Black Art , Microsoft Press, Redmond, Washington, USA, 2006 .