Predicting Software Maintainability using Ensemble
            Techniques and Stacked Generalization
               Sara Elmidaoui1, Laila Cheikhi1, Ali Idri1 and Alain Abran2
              1 SPM Team, ENSIAS, Mohammed V University in Rabat, Morocco
    2 Department of Software Engineering & Information Technology, ETS, Montréal, Canada

         sara.elmidaoui@um5s.net.ma, laila.cheikhi@um5.ac.ma, ali.idri@um5.ac.ma,
                                  alain.abran@etsmtl.ca


        Abstract. The prediction of software maintainability has emerged as an im-
        portant research topic to address industry expectations for reducing costs, in par-
        ticular maintenance costs. In the last decades, many studies have used single tech-
        niques to predict software maintainability but there is no agreement as to which
        technique can achieve the best prediction. Ensemble techniques, which combine
        two or more techniques, have been investigated in recent years. This study inves-
        tigates ensemble techniques (homogeneous as well as heterogeneous) for predict-
        ing maintainability in terms of line code changes. To this end, well-known ho-
        mogeneous ensembles such as Bagging, Boosting, Extra Trees, Gradient Boost-
        ing, and Random Forest are investigated first. Then the stacked generalization
        method is used to construct heterogeneous ensembles by combining the most ac-
        curate ones per dataset. The empirical results suggest that Gradient Boosting and
        Extra Trees are the best ensembles for all datasets, since they ranked first and
        second, respectively. Moreover, the findings of the evaluation of heterogeneous
        ensembles constructed using stacked generalization showed that they gave better
        prediction accuracy compared to all homogeneous ensembles. 1


        Keywords: Software Maintainability Prediction, Machine Learning, Ensemble
        techniques, Stacked Generalization, Stacking, Homogeneous, Heterogeneous.


1       Introduction

Maintenance of a software product is recognized as a very time-consuming activity,
and several attempts have been made to reduce its high cost by improving maintaina-
bility [1], which is defined as “the degree of effectiveness and efficiency with which a
product or system can be modified by the intended maintainers” [2]. Several attempts
have been made to predict software product maintainability (SPM) through empirical
studies. However, predicting maintainability remains an open research area since the
maintenance behaviors of software systems are complex and difficult to predict [3]. In
fact, many systematic literature reviews (SLRs) have been conducted on software prod-
uct maintainability prediction (SPMP) to provide an up-to-date review of this topic,

1 Copyright ©2020 for this paper by its authors. Use permitted under Creative Commons License

    Attribution 4.0 International (CC BY 4.0).
2


such as [4], [5], [6], [7], [8]. The results show that most studies have investigated single
techniques to identify the most accurate ones, while ensembles have received less at-
tention; more studies are therefore required in order to identify the best ones. Further-
more, researchers have been unable to identify the best guidelines for developing an
accurate technique; indeed, all SPMP techniques are prone to error as they depend to
some degree on the empirical context, and no single technique can give a correct result
in all circumstances.
   To tackle this issue, researchers have begun to investigate the use of many single
techniques together, known as Ensemble Techniques (ETs). As stated in [9]–[11], these
can take one of two forms: Heterogeneous (HT), which combines at least two different
techniques, and Homogeneous (HM), in which the single techniques (two at least) are
of the same type. Two types of HM technique are proposed: one that combines the same
base techniques with at least two configurations, and one that combines one meta-
model with one base technique. Since ETs have proved their usefulness in improving
accuracy in many areas such as in software effort estimation [11], software fault pre-
diction [12], [13] and face recognition [14], this study investigates whether they can
improve the accuracy of software maintainability prediction.
   In this context, a set of empirical studies that investigated ETs for SPMP were se-
lected in [7] and summarized in Table 1, with the corresponding techniques used (ML
techniques or base learners), the type of ensembles used (HM or HT), the rules used to
combine the ML techniques, the datasets, and the accuracy criteria. As seen in the table,
HM ensembles were used by means of Bagging [10], [15], [16], RF [17], [18], [16] and
Boosting (i.e., AdaBoost, LogitBoost) [19]. HT ensembles were used with different var-
iants, such as combining multilayer perceptron (MLP), radial basis function (RBF),
support vector machine (SVM), and M5 for inducing trees of regression models (M5P)
using the best in training (BTE) [9], average (AVG) and weighted averages (WT) [10]
rules. The ensembles were also evaluated by means of different datasets (e.g., User
Interface Management System (UIMS) and QUality Evaluation System (QUES)) and a
variety of accuracy criteria such as Mean Magnitude of Relative Error (MMRE), Stand-
ard deviation of Magnitude of Relative Error (Std.MRE), Percentage of Relative Error
Deviation (Pred), True Positive Rate (TPR), and False Positive Rate (FPR). Moreover,
from the above studies, it was found that ensembles provide greater or at least similar
prediction accuracy compared to single techniques [9], and one study suggested inves-
tigating other combination rules for constructing HT ensembles [10]. None of the stud-
ies reported evidence on the best combination rules, or used Stacked Generalization
(SG) [20], or investigated HM ensembles such as Extra Trees (ExTrees) and Gradient
Boosting (GradBoost) for predicting software maintainability. This study is the first
work that applies those variants of HM ensembles and combines the best of them using
SG to construct HT ensembles.
   The objective of this study is twofold: (1) investigate the use of five variants of HM
ensembles: Bagging, Adaptive Boosting (AdaBoost), GradBoost, RF, and ExTrees with
their default base techniques, and (2) investigate the use of HT ensembles that are con-
structed from the most accurate HM ensembles, per dataset, using the SG method. This
objective is achieved by addressing the following three research questions (RQs):
                                                                                             3


 RQ1: Among the five HM ensembles, which one generates the best SPMP accuracy?
 RQ2: Do the HT ensembles constructed with SG improve SPMP accuracy?
 RQ3: Which ensemble gives the best performance regardless of the dataset used?

   The rest of the paper is structured as follows: Section 2 gives an overview of the five
HM ensembles used. Section 3 presents the method used to construct the HT ensembles.
Section 4 presents the empirical design of this study. Section 5 presents and discusses
the empirical results obtained for HM as well as HT. Section 6 presents threats to the
validity of this study. Section 7 contains the conclusion and suggestions for future work.

                        Table 1. Related studies on ETs for SPMP
ID     Techniques             ET type   Combina-        Datasets        Accuracy criteria
                                        tion rules
[17]   AODE, SVM with         HM        AVG             Medical         Weighted Average
       linear kernel, Naïve                             imagining       Precision (WAP),
       Bayes (NB), Bayes-                                               Weighted Average
       ian Networks (BN),                               system          Recall (WARec)
       RF,     K    Nearest
       Neighbor,      C4.5,
       OneR, RBF
[15]   DT, Back Propaga-      HM         AVG            Open source     Precision, Recall,
       tion Neural Network                              system          F1 score, TPR,
       (BPNN),       SVM,                                               FPR, Area Under
       Bagging                                                          Curve (AUC)
[18]   NB, BN, Logistic       HM        AVG             Lucence,        Recall, Precision,
       Regression (LgR),                                Hotdraw,JEdit   Receiver Operat-
       MLP, RF                                          , JTreeview     ing Characteristic
[16]   LgR, RF, Bagging,      HM         AVG            Art-of-Illu-    Sensitivity, Speci-
       AdaBoost, NB, J48,                               sion, Sweet-    ficity, AUC, Cut-
       MLP, LogitBoost,                                 Home-3D         off point
       BN, Nearest-Neigh-
       bor-like that uses
       non-nested general-
       ized exemplars
[19]   TreeNet                HM        AVG             QUES, UIMS      MMRE, Pred(25),
                                                                        Pred(30)
[10]   SVM, MLP, LgR,         HT        BTE, major-     VSSPLUGIN,      Correct classifica-
       Genetic Program-                 ity voting,     PeerSim         tion rate (CCR),
       ming, K-means                    decision tree                   AUC
                                        forest
       Bagging, Boosting      HM        AVG             VSSPLUGIN,      CCR, AUC
       (AdaBoost)                                       PeerSim
       MLP, RBF, SVM,         HT        AVG, BTE,       QUES, UIMS      MMRE, Pred(30),
       M5P                              WT                              Std.MRE
[9]    MLP, RBF, SVM,         HT        BTE             QUES, UIMS      MMRE, Std.
       M5P                                                              MRE, Pred(30)
4


2      Homogeneous Ensemble Techniques

This section provides an overview of the five HM ensembles investigated in this em-
pirical study, namely Bagging, AdaBoost, RF, ExTrees, and GradBoost.

2.1    Bagging
Bagging (also known as Bootstrap aggregation) is a well-known HM ensemble pro-
posed by Breiman [21]. Bagging is based on the bootstrap method for creating a distri-
bution of datasets with replacement from an original dataset [22]. Bagging trains each
regression model (i.e., base learner) with the different training sets generated by sam-
pling with replacement from the training data, then it averages the predictions of each
constructed regression model to perform the final prediction [23]. To build a model
based on bagging, the following steps are performed: “(1) Split the dataset into training
set and test set, (2) get a bootstrap sample from the training data and train a predictor
using the sample. Repeat the steps a random number of times. The models from the
samples are combined (i.e., aggregated) by averaging the output for regression or voting
for classification” [24]. Through this process, bagging turns weak learners into strong
ones [25], reduces variance, helps to avoid overfitting [24], and improves regression
models in terms of stability and accuracy [24]. Moreover, bagging has presented good
results whenever the learning technique is unstable [21].

2.2    Adaptive Boosting (AdaBoost)

AdaBoost is one of the first practical boosting methods introduced by Freund and
Schapire [26]. In AdaBoost, weak learners are combined and “boosted” to improve en-
semble accuracy and produce a strong technique [27]. AdaBoost works as follows:
“(…) creates a sample from the training data using the sampling weight vector. The
base learner uses this sample to create a hypothesis that links the input to the output
data. This hypothesis is applied to all the data to create predictions. The absolute rela-
tive error is calculated for each prediction and compared against a threshold, which is
used to classify the predicted values as correct or incorrect. The sampling weight of
incorrectly predicted samples is increased for the next iteration” [27]. AdaBoost is also
“adaptive in that it adapts to the error rates of the individual weak hypotheses” [28] and
“tends not to over-fit; on many problems, even after hundreds of rounds of boosting,
the generalization error continues to drop, or at least does not increase” [26].

2.3    Random Forest (RF)
RF is an ensemble learner proposed by Breiman in 2001 [29] which “adds an additional
layer of randomness to bagging. In addition to constructing each tree using a different
bootstrap sample of the data, RFs change how the regression trees are constructed. In
standard trees, each node is split using the best split among all variables. In a random
forest, each node is split using the best among a subset of predictors randomly chosen
at that node” [30]. In other words, RF uses both bagging—a successful approach for
combining unstable learners [31] —and random variable selection for tree building
                                                                                           5


[17]. This strategy performs very well compared to many other ML techniques, includ-
ing SVR and ANN [30]. RF is robust against overfitting [29], it can achieve both low
bias and low variance [32], “it is very user-friendly in the sense that it has only two
parameters (the number of variables in the random subset at each node and the number
of trees in the forest), and is usually not very sensitive to their values” [30].

2.4    Extra Trees (ExTrees)
ExTrees (short for Extremely Randomized Trees) is a new tree-based ensemble method
for supervised classification and regression problems, proposed by Geurts et al. [33].
ExTrees “builds an ensemble of an unpruned decision or regression tree according to
the classical top-down procedure. Its two main differences with other tree-based en-
semble methods are that it splits nodes by choosing cut-points fully at random and that
it uses the whole learning sample to grow the trees” [33]. The technique for splitting
“consists of choosing randomly a number of inputs at each node and the minimum sam-
ple size for splitting a node with the full original data to generate a number of trees that
construct the ensembles. Then, the predictions of the trees are aggregated to yield the
final prediction, by arithmetic average in the regression problems” [33]. The advantage
of ExTrees is its capability “to reduce variance and minimize bias due to the use of the
full original sample rather than bootstrap replicas” [33].

2.5    Gradient Boosting (GradBoost)

GradBoost (or Gradient boosted tree) is an ensemble learner proposed by Friedman
[34]. GradBoost constructs “additive regression models by sequentially fitting a simple
parameterized function (base learner) to current "pseudo" residuals by least-squares at
each iteration. The pseudo residuals are the gradient of the loss functional being mini-
mized, with respect to the model values at each training data, evaluated at the current
step. It uses a gradient descent algorithm for the shortcomings of weak learners instead
of using a re-weighting mechanism. This algorithm is used to minimize the loss func-
tion (also called the error function) by moving in the opposite direction of the gradient
and finds (ou “finding”? vérifier le texte source) a local minimum” [35]. The main ad-
vantage is that “both the approximation accuracy and execution speed of Gradient
Boosting can be substantially improved by incorporating randomization into the proce-
dure” [35].


3      Heterogeneous Ensemble Techniques

As mentioned earlier, none of the SPMP models used the SG method to construct HT
ensembles. SG is a different way of combining multiple models that introduces the
concept of a meta-learner proposed in 1992 by Wolpert [36]. The learning procedure is
illustrated in the following algorithm and consists of the following steps “(1) learn first-
level classifiers based on the original training dataset, (2) construct a new dataset based
on the output of base classifiers. Here, the output predicted labels of the first-level clas-
sifiers are regarded as new features, and the original class labels are kept as the labels
6


in the new dataset, (3) learn a second-level classifier based on the newly constructed
dataset. Any learning method could be applied to learn the second-level classifier” [37].

    Algorithm Stacking [37]
    Input: Training data 𝐷 = {𝑥𝑖 , 𝑦𝑖 }𝑚             𝑛
                                        𝑖=1 {𝑥𝑖 ∈ 𝑅 , 𝑦𝑖 ∈ 𝑌}
    Output: An ensemble classifier 𝐻
     Step 1: Learn first-level classifiers
      for 𝑡 ← 1 to T do
       Learn a base classifier ℎ𝑡 based on 𝐷
      end for
     Step 2: Construct new datasets from 𝐷
      for 𝑖 ← 1 to 𝑚 do
       Construct a new dataset that contains {𝑥 ′ 𝑖 , 𝑦𝑖 }, 𝑤ℎ𝑒𝑟𝑒 𝑥′𝑖 = {ℎ1 (𝑥𝑖 ), ℎ2 (𝑥𝑖 ), … , ℎ 𝑇 (𝑥𝑖 )}
      end for
     Step 3: Learn a second-level classifier
      Learn a new classifier ℎ′ based on the newly constructed dataset
      return 𝐻(𝑥) = ℎ′(ℎ1 (𝑥), ℎ2 (𝑥), … , ℎ 𝑇 (𝑥))

In this study, we investigate the use of HT ensembles in SPMP by combining the most
accurate HM ensembles using SG. The process adopted for SG consists of taking the
predictions of each HM ensembles that appeared in the best cluster generated by the
SK test per dataset in the first level and using them as input variables in the second
level learning. In the second level, an algorithm (i.e., in our case, we used linear regres-
sion since it is a simple technique for modeling the relationship between dependent and
independent variables [38], [39]) and is trained to optimally combine the models’ pre-
dictions to form a new set of predictions [20]. Note that the second level modeling is
not restricted to any simple technique. The relationship between the predictions can be
more complex, opening the door to other ML techniques [20]. The main advantages of
SG are that it correctly classifies the target, thereby correcting any mistakes made by
the models constructed in the first level [20], and it offers robust performance [37].
Moreover, “it is flexible and does not require particular expertise in deployment, due
to its robust performance” [40].


4         Empirical Design

This section presents the empirical design used. Note that the same strategy has been
adopted previously in [11], [41] for software development effort estimation, but using
different datasets and ML techniques.

4.1       Datasets
Five publicly available datasets were used in this study, including two popular public
datasets (UIMS and QUES [42], provided by researchers from the software engineering
                                                                                            7


community) and three open-source software (Log4j2, Xalan3, and JEdit4, provided in
the PROMISE repository).
   The UIMS and QUES datasets are object-oriented (OO) software maintainability
datasets from Li and Henry (L&H) [42], developed using Ada. The UIMS contains
class-level metrics data collected from 39 classes of a user interface management sys-
tem, whereas the QUES contains the same metrics collected from 71 classes of a quality
evaluation system. Both datasets contain 10 independent variables (five OO metrics
from Chidamber and Kemerer(C&K) [43], four OO metrics from L&H [42], and the
traditional lines of code). The maintainability is expressed in terms of maintenance ef-
fort measured by the number of lines changed per class; a line change could be an ad-
dition or a deletion, and a change of the content of a line is counted as a deletion and
an addition [42].
   The JEdit, Log4j, and Xalan datasets are based on open-source software projects
written in Java and of different sizes: 161, 153, and 756, respectively. These three da-
tasets each contain 20 features including the six C&K metrics and other OO metrics.
The dependent variable was expressed in terms of bugs in the existing datasets. Since
the focus of this study is to evaluate maintainability in terms of code line changes, we
calculated the change between the same classes from two versions of the software pro-
jects JEdit, Log4j, and Xalan using the Beyond compare tool5. It should be noted that
the same methodology was used in [18], [44]–[46].

4.2              Accuracy criteria

Accuracy criteria are essential in SPMP techniques. This study focused on the predic-
tion data mining task where the most commonly used techniques (see Table 1) were
MMRE and Pred(0.25), which are based on the Magnitude of Relative Error (MRE)
(1), (2), and (3).

    MRE           =   (1)   MMRE           (2)    Pred                           (0.25)   (3)
    |𝑦𝑖 −𝑦̂𝑖 |               1
                            = ∑𝑛𝑖=0 𝑀𝑅𝐸𝑖              100     1 𝑖𝑓 𝑀𝑅𝐸𝑖 ≤ 0.25
       𝑦𝑖                    𝑛                    =     ∑𝑁
                                                         𝑖=1 {
                                                      𝑁        0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Where: 𝑦𝑖 and 𝑦̂𝑖 are the actual and predicted change for the 𝑖 𝑡ℎ project. 𝑁 is the number
of data points (examples).
    Since the MRE has been criticized for being biased and unbalanced [47], [48], [49],
in this study we applied some unbiased accuracy criteria that do not present asymmetric
distribution [50], namely: Mean Balanced Relative Error (MBRE) [48], [51], Mean In-
verted Balanced Relative Error (MIBRE) [48], [51], and Mean Absolute Error (MAE),
which is based on Absolute Error (AE). These accuracy criteria are described in (4),
(5), (7), and (6), respectively.


2
  Log4J Homepage.https://logging.apache.org/log4j/2.x/
3
  Xalan Homepage. http://xalan.apache.org/index.html
4
  JEdit Homepage.http://www.jedit.org/
5 https://www.scootersoftware.com/
8


                                                                               AE =|𝑦𝑖 −𝑦̂𝑖 |      (6)
                                          1         |𝑦𝑖 −𝑦̂𝑖 |
    MBRE=           (4)            MIBRE = ∑𝑁
                                            𝑖=1
                                                                        (5)
    1 𝑁   |𝑦𝑖 −𝑦̂𝑖 |                      𝑁       max(𝑦𝑖 ,𝑦̂𝑖 )                        1
     ∑𝑖=1                                                                      MAE = ∑𝑁
                                                                                      𝑖=1 𝐴𝐸𝑖
                                                                                                   (7)
    𝑁        min(𝑦𝑖 ,𝑦̂𝑖 )                                                             𝑁


We also used the logarithmic standard deviation (LSD) to evaluate the accuracy of pre-
diction techniques [48] (8) and Standardized Accuracy (SA) [52] (9) to solve the prob-
lem inherent in MRE‐based accuracy criteria. To verify if predictions given by a tech-
nique are generated by chance and if there is an improvement over random guessing,
we used the effect size (∆) defined by (10). The ratio in SA represents “how much
better 𝑝𝑖 is than random guessing. Clearly, a value close to zero is discouraging and a
negative value would be worrisome” [52]. It is recommended to use the 5% quantile of
random guessing: “the interpretation of the 5% quantile for 𝑝0 is similar to the use of α
for conventional statistical inference, that is, any accuracy value that is better than this
threshold has a less than one in twenty chance of being a random occurrence” [52].

    LSD=                                        𝑀𝐴𝐸𝑝𝑖                          𝑀𝐴𝐸𝑝𝑖 − ̅̅̅̅̅̅̅̅̅
                                                                                         𝑀𝐴𝐸𝑝0
                                    𝑆𝐴 = (1 −         )                   ∆=
        𝑛         𝑆2
                             (8)                𝑀𝐴𝐸𝑝0             (9)                𝑆𝑝0           (10)
    √∑𝑖=1(𝜆𝑖+ 2 )²
            𝑛−1


Where: 𝜆𝑖 = ln(𝑦𝑖 ) − ln( 𝑦̂𝑖 ). 𝑆 2 is an estimator of the variance of the residual
𝜆𝑖 . 𝑆𝑝0 is the sample standard deviation of the random guessing strategy. 𝑀𝐴𝐸𝑝𝑖 is de-
fined as the 𝑀𝐴𝐸 of the prediction method 𝑝𝑖 . 𝑀𝐴𝐸𝑝0 is the mean of a large number—
typically 1000—of runs of random guessing. Note that the above accuracy criteria were
used all together previously in [11], [41] for software development effort estimation but
have never been used in previously published SPMP studies [6] (see Table 1).

4.3         Scott–Knott significance statistical test

Scott-Knott (SK) [53] is a hierarchical clustering algorithm used as an exploratory data
analysis tool for analysis of variance (ANOVA). It was developed by Scott and Knott
as a way to find distinct identical comparison groups of treatment means considering
type I error [53]. In this study, the SK test was used to statistically investigate whether
there is a significant difference between the ensembles based on MAE, i.e., to cluster
the techniques that have identical predictions potential. Note that several studies in the
software engineering field have used the SK test to rank investigated techniques [11],
[41], [54].


5           Results and Discussion

This section evaluates and discusses the prediction performance of HM and HT ensem-
bles used in this study on the five datasets. Fig. 1 presents the process followed to per-
form this empirical study. Two main steps are identified:
                                                                                        9


 Step 1: Identify the most accurate HM ensembles (Bagging, AdaBoost, GradBoost,
  RF, and ExTrees). This step includes:
  ─ Use of the five historical datasets, namely UIMS, QUES, JEdit, Log4j, and Xalan.
  ─ Application of the LOOCV method, which consists in using one instance from a
     dataset as the test set and the remaining N-1 instances as the training set N times,
     where N is the number of instances in a specific dataset.
  ─ Evaluate the performance of the HM ensembles in terms of SA and effect size
     accuracy criteria, and eliminate those that have an SA value less than 5% quartile.
  ─ Use of the SK test to statistically compare the HM ensembles.
  ─ Use of the Borda count voting method to rank the HM ensembles that appeared
     in the best cluster using MAE, Pred(25), MBRE, MIBRE, and LSD.
 Step 2: The most accurate HM ensembles (from Step1) are combined using the SG
  to construct HT ensembles per dataset.


                         Fig. 1. Process of the empirical study
To perform this empirical study, different tools were used. A software prototype for
each ML technique and each HM ensemble based on Scikit-learn API was developed
using Python. The statistical tests (SK and Kolmogorov-Smirnov, and the Box-Cox
transformation) were performed through R software.

5.1    Results and discussion – HM
Table 2 reports the performance of HM ensembles based on SA and effect size accuracy
criteria over the five datasets. The second row of Table 2 shows the 5% quantile SA of
random guessing for each dataset (𝑆𝐴5% ). It can be seen that all HM ensembles gener-
ated better predictions than random guessing since their SA values were higher
than 𝑆𝐴5%. Moreover, all HM ensembles across all datasets showed a large improve-
ment over guessing since their effect size test results are larger than 0.8. The following
summarizes the results for each dataset and then for all datasets: For the UIMS dataset,
GradBoost achieved good accuracy compared to other ensembles with an SA value of
0.883, followed by ExTrees with 0.863. The Bagging and AdaBoost also gave good
results compared to RF, which ranked last. For the QUES dataset, GradBoost provided
10


the best prediction results with an SA value of 0.957, followed by ExTrees (SA= 0.935).
Other ensembles: Bagging, RF, and AdaBoost were more or less equal in terms of SA
values (≈ 0.91). For the JEdit dataset, GradBoost ranked first with an SA value of
0.961, followed by ExTrees (0.959). Bagging and RF gave the same results, 0.943 and
0.942, respectively. The AdaBoost ensemble ranked last with an SA value of 0.890. For
the Xalan dataset, GradBoost achieved the best results (SA= 0.981) compared to the
other ensembles. Bagging and RF were more or less equal with an SA value of 0.972
and 0.971, respectively, followed by ExTrees (SA= 0.963), while AdaBoost scored
lowest among the other ensembles with an SA value of 0.831. For the Log4j dataset,
GradBoost came first with an SA value of 0.932, followed by ExTrees (SA= 0.927),
then RF and Bagging, which were more or less equal with a SA values of 0.904 and
0.903, respectively. AdaBoost ranked last with an SA value of 0.865.
   The results for all datasets are as follows: All HM ensembles across all datasets
achieved better predictions than random guessing. GradBoost ranked first in all da-
tasets. Bagging and RF achieved generally the same results in all datasets, while in the
UIMS dataset, RF ranked last. AdaBoost ensemble ranked last in four datasets and 4th
in one dataset.

                   Table 2. SA and effect size of the five HM ensembles
  Dataset       UIMS           QUES           JEdit          Xalan          Log4j
     𝑺𝑨𝟓%       0.0200         0.0189         0.0184         0.0003         0.0188
  HM Tech-
                SA       |∆|   SA       |∆|   SA       |∆|   SA       |∆|   SA       |∆|
  nique
  Bagging       0.80     77    0.91     140   0.94     207   0.97     159   0.90     176
  ExTrees       0.86     160   0.93     88    0.95     165   0.96     169   0.92     262
  RF            0.78     46    0.91     174   0.94     170   0.97     209   0.90     100
  GradBoost     0.88     82    0.95     136   0.96     240   0.98     218   0.93     95
  AdaBoost      0.80     80    0.91     149   0.89     115   0.83     203   0.86     117

Furthermore, Fig. 2 shows the plot of SK testing of constructed ensembles per dataset.
The x-axis represents the selected ensembles and the y-axis represents the transformed
AEs. Every vertical line shows the variation of transformed AEs for each technique,
and the small circle represents the mean of transformed AEs. The farther to the right a
technique is positioned, the better its performance, and the brown box indicates the best
cluster. Note that we transformed AE values using the Box-Cox method [55] since the
data (i.e., AE) do not follow a normal distribution in all cases. As can be seen from Fig.
2, two groups were generated in all datasets, except for the Xalan, in which four groups
were generated by the SK test. Note that the green box in the right-hand group for each
dataset indicates the best group. ExTrees, GradBoost, Bagging, and RF appear in the
best cluster in three datasets: JEdit, Log4j, and UIMS, This means that these ensembles
are indifferent. For QUES and Xalan datasets, only GradBoost and ExTrees appear in
the best cluster. The AdaBoost technique appears in the last cluster in all datasets.
    Table 3 presents the ranks of the HM ensembles that appear in the best cluster cal-
culated by the Borda count method over five accuracy criteria (MAE, Pred(25), MBRE,
MIBRE, and LSD). GradBoost ranked first and ExTrees second in all datasets. This
                                                                                         11


leads to the conclusion that some ensembles are solid and scalable to datasets of differ-
ent software systems and can be used across software systems for predicting maintain-
ability [3]. As for Bagging and RF, they ranked third or fourth on three datasets, leading
us to conclude that these two ensembles generate the same results in predicting software
maintainability.


                   Fig. 2. Plot of SK test of HM ensembles in each dataset

             Table 3. Rank of HM ensembles of the best cluster using Borda count
      Rank     UIMS           QUES           Log4j          JEdit            Xalan
      1        GradBoost      GradBoost      GradBoost      GradBoost        GradBoost
      2        ExTrees        ExTrees        ExTrees        ExTrees          ExTrees
      3        Bagging                       RF             Bagging
      4        RF                            Bagging        RF

5.2    Results and discussion – HT
The HM ensembles generated by SK in the best cluster (Table 3) per dataset were com-
bined using the SG. Table 4 presents the constructed HT ensembles. Bagging, Grad-
Boost, ExTrees, and RF were combined for the UIMS, JEdit, and Log4j datasets, while
GradBoost and ExTrees were combined for the Xalan and QUES datasets.
12


   The results of SG in Table 4 indicate that the HT ensembles give the best results
when Bagging, GradBoost, ExTrees, and RF are used for UIMS, JEdit, and Log4j da-
tasets. The same results were obtained for QUES and Xalan by combining GradBoost
and ExTrees. We can conclude that SG with the proposed combination for each dataset
achieves good results with an SA value of 1.00. In fact, such a result was expected,
since all the HM ensembles used to construct the HT ones belong to the best cluster of
SK (i.e., Bagging, GradBoost, ExTrees, and RF), with SA values ranging from 0.783
to 0.981. The use of SG further improved the results to this good level of accuracy.
However, the result obtained in this step is too good to be true, and replicated studies
are needed to confirm this finding.

                      Table 4. Results of HT ensembles over datasets

               Dataset    Techniques combined                  SA       |∆|
               UIMS       Bagging+GradBoost+ ExTrees+RF        1.00     57
               QUES       GradBoost+ ExTrees                   1.00     191
               JEdit      Bagging+GradBoost+ ExTrees+RF        1.00     218
               Log4j      Bagging+GradBoost+ ExTrees+RF        1.00     192
               Xalan      GradBoost+ ExTrees                   1.00     178

Table 5 compares the constructed HT and HM ensembles. The proposed ETs were
ranked using the Borda count method based on five accuracy criteria: Pred(25), MBRE,
MIBRE, MAE, and LSD. As shown in Table 5, the constructed HT ensembles using
SG ranked first in all datasets compared to HM ensembles, and thus represent a good
solution based on this study finding for predicting software maintainability in terms of
line code changes.

                      Table 5. Rank of HM and HT using Borda count
 Rank     UIMS              QUES           JEdit              Log4j             Xalan
 1        Bagging+Grad-     GradBoost+     Bagging+Grad-      Bagging+Grad-     Grad-
          Boost+ Ex-        ExTrees        Boost+ Ex-         Boost+ Ex-        Boost+
          Trees+RF                         Trees+RF           Trees+RF          ExTrees
 2                                                                              Grad-
          GradBoost         GradBoost      GradBoost          GradBoost
                                                                                Boost
 3        ExTrees           ExTrees        ExTrees            ExTrees           ExTrees
 4        Bagging           RF             RF                 Bagging           RF
 5        RF                Bagging        Bagging            RF                Bagging
 6                          AdaBoost                                            Ada-
          AdaBoost                         AdaBoost           AdaBoost
                                                                                Boost


6       Threats to Validity

This section presents the threats to the three aspects of validity of this empirical study
(e.g., internal, external, and construct validity) according to [11], [56], [57].
   Internal validity is related to the use of a biased validation method to assess the
performance of prediction techniques. Many SPMP studies use the whole data or hold-
                                                                                         13


out method, so that the evaluation focuses only on a unique subset of data, which can
result in a biased and unreliable assessment of performance. To overcome this limita-
tion, LOOCV is used as a validation method. It can generate the same results when the
empirical study is replicated using a particular dataset, which is not the case for cross-
validation [58].
   External validity is related to the degree of generalizability of the results. The pro-
posed ensembles were evaluated over five datasets from different sources and different
application domains. The datasets used in this study are varied in terms of size, number
of features, and application domains, and this makes them adequate for evaluating our
techniques. Besides, this study deals only with numerical data; hence, the findings of
this study may differ from other studies that use other types of data.
   Construct validity is related to the reliability and credibility of the accuracy criteria
used to assess performance. This study used SA that is an unbiased accuracy criterion
and less vulnerable to asymmetry assumption, proposed by Shepperd and MacDonell
[52]. Moreover, five accuracy criteria were used: MAE, Pred(25), MBRE, MIBRE, and
LSD, aggregated by means of the Borda count coting method. The most widely used
accuracy criterion in SPMP area, MMRE, was not used in this study because many
researchers have criticized it for favoring underestimation [48], [52], [59].


7      Conclusion and Future Work

In this empirical study, the performance of five HM ensembles (Bagging, AdaBoost,
GradBoost, RF, and ExTrees) for predicting software maintainability was assessed over
five datasets using the LOOCV method based on SA and effect size. The SK test and
the Borda count were used to statistically compare and rank the HM ensembles based
on five accuracy criteria: MAE, Pred(25), MBRE, MIBRE, and LSD. In addition, we
took the HM ensembles that appeared in the best cluster in each dataset and combined
them using SG to construct HT ensembles. The findings with respect to the research
questions are the following:

 RQ1: Among the five HM ensembles, which one generates the best accuracy of
  software product maintainability prediction? The empirical evaluations showed
  that the GradBoost technique achieved the best results compared to other HM en-
  sembles in all datasets.
 RQ2: Do the HT ensembles constructed with SG improve SPMP accuracy? The
  ensembles combined using SG outperformed the HM ensembles in all datasets and
  generally gave good results in terms of SA values (SA=1.00 for all combinations).
 RQ3: Which ensemble gives the best performance regardless of the dataset
  used? For all datasets, GradBoost ranked first and ExTrees second. This shows that
  these two techniques perform better in different datasets.

  Ongoing work in SPMP involves using other HM ensembles such as Light Gradient
Boosting and eXtreme Gradient Boosting, as well as investigating the HT ensemble by
combining base learners with other combination rules such as average, weighted aver-
age, best in training, etc. Moreover, replicating this empirical study with other datasets
14


may help to confirm or refute our conclusions. Generalized results regarding the pre-
diction of software product maintainability are therefore still not available to meet soft-
ware industry needs and the expectations for software maintainability prediction.


References
 1. A. Abran and H. Nguyenkim, “Measurement of the maintenance process from a demand-
    based perspective,” J. Softw. Maint. Res. Pract., pp. 63–90, 5(2), (1993).
 2. “ISO/IEC 25010:2011 Systems and software engineering — Systems and software Quality
    Requirements and Evaluation (SQuaRE) — System and software quality models,” Geneva,
    Switzerland, (2011).
 3. A. Kaur and K. Kaur, “Statistical comparison of modelling methods for software maintain-
    ability prediction,” Int. J. Softw. Eng. Knowl. Eng., pp. 743–774, 23(06), (2013).
 4. M. Riaz, E. Mendes, and E. Tempero, “A systematic review of software maintainability
    prediction and metrics,” in 3rd International Symposium on Empirical Software Engineering
    and Measurement, pp. 367–377, Lake Buena Vista, FL, USA (2009).
 5. M. Riaz, “Maintainability prediction of relational database-driven applications: a systematic
    review,” In: International Conference on Evaluation & Assessment in Software Engineering,
    IET, Ciudad Real, Spain (2012).
 6. S. Elmidaoui, L. Cheikhi, A. Idri, and A. Abran, “Empirical Studies on Software Product
    Maintainability Prediction: A Systematic Mapping and Review,” e-Informatica Softw. Eng.
    J., pp. 141–202, 13(1), (2019).
 7. S. Elmidaoui, L. Cheikhi, A. Idri, and A. Abran, “Machine Learning Techniques for Soft-
    ware Maintainability Prediction: Accuracy Analysis,” J. Comput. Sci. Technol., (2020).
 8. H. Alsolai and M. Roper, “A systematic literature review of machine learning techniques
    for software maintainability prediction,” Inf. Softw. Technol., pp. 106-214, vol. 119, (2020).
 9. H. Aljamaan, M. O. Elish, and I. Ahmad, “An ensemble of computational intelligence mod-
    els for software maintenance effort prediction,” in International Work-Conference on Arti-
    ficial Neural Networks, pp. 592–603, (2013).
10. M. O. Elish, H. Aljamaan, and I. Ahmad, “Three empirical studies on predicting software
    maintainability using ensemble methods,” Soft Comput., pp. 2511–2524, 19(9), (2015).
11. M. Hosni, A. Idri, A. Abran, and A. B. Nassif, “On the Value of Parameter Tuning in Het-
    erogeneous Ensembles Effort Estimation,” Soft Comput., pp. 5977–6010, 22(18), (2018).
12. P. L. Braga, A. L. I. Oliveira, G. H. T. Ribeiro, and S. R. L. Meira, “Bagging Predictors for
    Estimation of Software Project Effort,” in 2007 International Joint Conference on Neural
    Networks, pp. 1595–1600, (2007).
13. H. I. Aljamaan and M. O. Elish, “An empirical study of bagging and boosting ensembles for
    identifying faulty classes in object-oriented software,” in IEEE Symposium on Computa-
    tional Intelligence and Data Mining, pp. 187–194, (2009).
14. S. Gutta and H. Wechsler, “Face recognition using hybrid classifier systems,” in Interna-
    tional Conference on Neural Networks (ICNN’96), pp. 1017–1022, vol.2, (1996).
15. F. Ye, X. Zhu, and Y. Wang, “A new software maintainability evaluation model based on
    multiple classifiers combination,” in International Conference on Quality, Reliability, Risk,
    Maintenance, and Safety Engineering (QR2MSE), pp. 1588–1591, (2013).
16. M. Ruchika and J. Ravi, “Prediction and Assessment of Change Prone Classes Using Statis-
    tical and Machine Learning Techniques,” J. Inf. Process. Syst., pp. 778–804, 13(4), (2017).
                                                                                            15


17. Y. Tian, C. Chen, and C. Zhang, “Aode for source code metrics for improved software main-
    tainability,” in International Conference on Semantics, Knowledge and Grid, pp. 330–335,
    Beijing, China (2008).
18. A. Kaur, K. Kaur, and K. Pathak, “Software maintainability prediction by data mining of
    software code metrics,” in International Conference on Data Mining and Intelligent Com-
    puting (ICDMIC), pp. 1–6, New Delhi, India (2014)..
19. M. O. Elish and K. O. Elish, “Application of treenet in predicting object-oriented software
    maintainability: A comparative study,” in European Conference on Software Maintenance
    and Reengineering, pp. 69–78, Kaiserslautern, Germany (2009).
20. C. Sammut and G. I. Webb, Eds., “Stacked Generalization,” in Encyclopedia of Machine
    Learning, Boston, MA: Springer US, (2010).
21. L. Breiman, “Bagging Predictors,” Mach. Learn., pp. 123–140, 24(2), (1996).
22. B. Efron, D. Rogosa, and R. Tibshirani, “Resampling Methods of Estimation,” Int. Encycl.
    Soc. Behav. Sci., pp. 492–495, (2015).
23. S. B. Kotsiantis, D. Kanellopoulos, and I. D. Zaharakis, “Bagged Averaging of Regression
    Models,” in Artificial Intelligence Applications and Innovations, pp. 53–60, (2006).
24. G. P. Kumari, “A Study Of Bagging And Boosting Approaches To Develop Meta-Classi-
    fier,” Eng. Sci. Technol. An Int. J., 2(5), (2012).
25. L. L. Minku and X. Yao, “A Principled Evaluation of Ensembles of Learning Machines for
    Software Effort Estimation,” in International Conference on Predictive Models in Software
    Engineering, pp. 9:1--9:10, Banff, Alberta, Canada (2011).
26. Y. Freund and R. E. Schapire, “A Decision-Theoretic Generalization of On-Line Learning
    and an Application to Boosting,” J. Comput. Syst. Sci., pp. 119–139, 55(1), (1997).
27. N. Kummer and H. Najjaran, “Adaboost. MRT: Boosting regression for multivariate esti-
    mation.,” Artif. Intell. Res., pp. 64–76, 3(4), (2014).
28. Y. Freund, R. Schapire, and N. Abe, “A short introduction to boosting,” Journal-Japanese
    Soc. Artif. Intell., pp. 771–780, 14(1612), (1999).
29. L. Breiman, “Random Forests,” Mach. Learn., pp. 5–32, 45(1), (2001).
30. A. Liaw, M. Wiener, and others, “Classification and regression by random Forest,” R news,
    pp. 18–22, 2(3), (2002).
31. C. Breiman, L., Friedman, J., Olshen, R., & Stone, “Classification and regression trees,”
    Wadsworth Int. Gr., pp. 237–251, 37(15), (1984).
32. R. Díaz-Uriarte and S. Alvarez de Andrés, “Gene selection and classification of microarray
    data using random forest,” BMC Bioinformatics, 7(1), (2006).
33. P. Geurts, D. Ernst, and L. Wehenkel, “Extremely randomized trees,” Mach. Learn., pp. 3–
    42, 63(1), (2006).
34. J. H. Friedman, “Greedy Function Approximation: A Gradient Boosting Machine,” Ann.
    Stat., pp. 1189–1232, vol. 29, (2000).
35. R. Tugay and S. G. Ögüdücü, “Demand Prediction using Machine Learning Methods and
    Stacked Generalization,” In: International Conference on Data Science, Technology and Ap-
    plications - Volume 1: DATA, pp. 216-222, Madrid, Spain (2017).
36. D. H. Wolpert, “Stacked generalization,” Neural Networks, pp. 241–259, 5(2), (1992).
37. C. C. Aggarwal, Data classification: algorithms and applications. CRC press, (2014).
38. C. Catal, “Review: Software Fault Prediction: A Literature Review and Current Trends,”
    Expert Syst. Appl., pp. 4626–4636, 38(4), (2011).
39. C. Catal and B. Diri, “A Systematic Review of Software Fault Prediction Studies,” Expert
    Syst. Appl., pp. 7346–7354, 36(4), (2009).
40. L. Jonsson, Machine Learning-Based Bug Handling in Large-Scale Software Development,
    vol. 1936. Linköping University Electronic Press, (2018).
16


41. M. Azzeh, A. B. Nassif, and L. L. Minku, “An Empirical Evaluation of Ensemble Adjust
    Methods for Analogy-based Effort Estimation,” J. Syst. Softw., pp. 36–52, 103(C), (2015).
42. W. Li and S. Henry, “Object-oriented metrics that predict maintainability,” J. Syst. Softw.,
    pp. 111–122, 23(2), (1993).
43. S. R. Chidamber and C. F. Kemerer, “A metrics suite for object oriented design,” IEEE
    Trans. Softw. Eng., pp. 476–493, 20(6), (1994).
44. Kumar Lov and S. Ashish, “A Comparative Study of Different Source Code Metrics and
    Machine Learning Algorithms for Predicting Change Proneness of Object Oriented Sys-
    tems,” arXiv Prepr. arXiv1712.07944, (2018).
45. L. Kumar and S. K. Rath, “Hybrid Functional Link Artificial Neural Network Approach for
    Predicting Maintainability of OO Software,” J. Syst. Softw., pp. 170–190, 121(C), (2016).
46. A. Kaur, K. Kaur, and K. Pathak, “A proposed new model for maintainability index of open
    source software,” in 3rd International Conference on Reliability, Infocom Technologies and
    Optimization, pp. 1–6, Noida, India (2014).
47. A. Idri, I. Abnane, and A. Abran, “Evaluating Pred(p) and standardized accuracy criteria in
    software development effort estimation,” J. Softw. Evol. Process, 30(4), (2018).
48. T. Foss, E. Stensrud, B. Kitchenham, and I. Myrtveit, “A simulation study of the model
    evaluation criterion MMRE,” IEEE Trans. Softw. Eng., pp. 985–995, 29 (11), (2003).
49. I. Myrtveit, E. Stensrud, and M. Shepperd, “Reliability and Validity in Comparative Studies
    of Software Prediction Models,” IEEE Trans. Softw. Eng., pp. 380–391, 31(5), (2005).
50. L. L. Minku and X. Yao, “Ensembles and locality: Insight on improving software effort
    estimation,” Inf. Softw. Technol., pp. 1512–1528, 55(8), (2013).
51. Y. Miyazaki, A. Takanou, H. Nozaki, N. Nakagawa, and K. Okada, “Method to estimate
    parameter values in software prediction models,” Inf. Softw. Technol., pp. 239–243, 33(3),
    (1991).
52. M. Shepperd and S. MacDonell, “Evaluating Prediction Systems in Software Project Esti-
    mation,” Inf. Softw. Technol., pp. 820–827, 54(8), (2012).
53. A. J. Scott and M. Knott, “A cluster analysis method for grouping means in the analysis of
    variance,” Biometrics, pp. 507–512, (1974).
54. N. Mittas and L. Angelis, “Ranking & Clustering Software Cost Estimation Models through
    a Multiple Comparisons Algorithm,” IEEE Trans. Softw. Eng., pp. 537–551, 39, (2013).
55. G. E. P. Box and D. R. Cox, “An analysis of transformations,” J. R. Stat. Soc. Ser. B, pp.
    211–252, (1964).
56. S. Elmidaoui, L. Cheikhi, and A. Idri, “The Impact of SMOTE and Grid Search on Main-
    tainability Prediction Models,” In: ACS/IEEE International Conference on Computer Sys-
    tems and Applications, AICCSA, Abu Dhabi, United Arab Emirates, (2019).
57. M. Hosni, A. Idri, and A. Abran, “Investigating Heterogeneous Ensembles with Filter Fea-
    ture Selection for Software Effort Estimation,” In: International Workshop on Software
    Measurement and 12th International Conference on Software Process and Product Measure-
    ment, pp. 207–220, (2017).
58. E. Kocaguneli and T. Menzies, “Software effort models should be assessed via leave-one-
    out validation,” J. Syst. Softw., pp. 1879–1890, 86, (2013).
59. Y. Miyazaki, M. Terakado, K. Ozaki, and H. Nozaki, “Robust regression for developing
    software estimation models,” J. Syst. Softw., pp. 3–16, 27(1), (1994).