=Paper= {{Paper |id=Vol-2273/QuASoQ-01 |storemode=property |title=Cross-Sub-Project Just-in-Time Defect Prediction on Multi-Repo Projects |pdfUrl=https://ceur-ws.org/Vol-2273/QuASoQ-01.pdf |volume=Vol-2273 |authors=Yeongjun Cho,Jung-Hyun Kwon,In-Young Ko |dblpUrl=https://dblp.org/rec/conf/apsec/ChoKK18 }} ==Cross-Sub-Project Just-in-Time Defect Prediction on Multi-Repo Projects== https://ceur-ws.org/Vol-2273/QuASoQ-01.pdf
                   6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018)



    Cross-Sub-Project Just-in-Time Defect Prediction
                on Multi-Repo Projects
                                        Yeongjun Cho, Jung-Hyun Kwon, In-Young Ko
                                                       School of Computing
                                        Korea Advanced Institute of Science and Technology
                                                   Daejeon, Republic of Korea
                                            {yj cho, junghyun.kwon, iko}@kaist.ac.kr


    Abstract—Just-in-time (JIT) defect prediction, which predicts          developers identify modules that are likely to have defects
defect-inducing code changes, can provide faster and more                  and provide a list of the modules that need to be treated
precise feedback to developers than traditional module-level               first to efficiently assign limited resources [1]. To predict
defect prediction methods. We find that large-scale projects such
as Google Android and Apache Maven divide their projects into              the likelihood of defects in each module, most of the defect
multiple sub-projects, in which relevant source code is managed            prediction techniques provide a prediction model that is built
separately in different repositories. Although sub-projects tend           based on various metrics, such as complexity [2] and change-
to suffer from a lack of the historical data required to build             history measures [3].
a defect prediction model, the feasibility of applying cross-sub-             Although a defect prediction technique is helpful for de-
project JIT defect prediction has not yet been studied. A cross-
sub-project model to predict bug-inducing commits in the target            velopers to narrow down the modules to inspect, module in-
sub-project could be built with data from all other sub-projects           spection usually requires many resources because it is difficult
within the project of the target sub-project, or data from the sub-        for developers to locate a defect inside a suspicious module.
projects of other projects, as traditional project-level JIT defect        To overcome the drawback of the file-level defect prediction
prediction methods. Alternatively, we can rank sub-projects and            technique, recent studies introduced just-in-time (JIT) defect
select high-ranked sub-projects within the project to build a
filtered-within-project model. In this work, we define a sub-              prediction models that output a prediction of whether a code
project similarity measure based on the number of developers               change induces defects, rather than a file. Because code change
who have contributed to both sub-projects to rank sub-projects.            level is more fine-grained than the file level, predicting defect-
We extract the commit data from 232 sub-projects across five               inducing code changes is known to be effective to provide
different projects and evaluate the cost effectiveness of various          the precise feedback to developers. Moreover, JIT defect
cross-sub-project JIT defect prediction models. Based on the
results of the experiments, we conclude that 1) cross-sub-project          prediction technique can provide faster feedback as soon as
JIT defect prediction generally has better cost effectiveness              a change is made into the source code repository [4]–[6].
than within-sub-project JIT defect prediction, especially when                In general, building a defect prediction model usually
the sub-projects from the same project are used as training                requires a large amount of historical data from a project;
data; 2) in filtered-within-project JIT defect-prediction models,          therefore, it is difficult to build a model for projects that have
the developer similarity-based ranking can achieve higher cost
effectiveness than the other ranking methods; and 3) although a            just started or for legacy systems in which past change history
developer similarity-based filtered-within-project model achieves          data are not available. A cross-project defect prediction model
lower cost effectiveness than a within-project model in general,           can be built by utilizing the history data of other projects
we find that there is room for further improvement to the filtered-        to predict defects in a project that lacks data. Cross-project
within-project model that may outperform the within-project                defect prediction techniques have been actively studied [7],
model.
    Index Terms—Just-in-time prediction, defect prediction, sub-
                                                                           [8] and existing studies state that when building a cross-project
project                                                                    defect prediction model, choosing a set of training data that
                                                                           are closely related to the target project is essential to ensure
                                                                           the performance of defect prediction [8].
                       I. I NTRODUCTION
                                                                              We found that many large-scale projects, such as Apache
   The quality assurance (QA) process on large-scale soft-                 Maven, Google Android, and Samsung Tizen, divide their
ware usually requires a large amount of computing and                      projects into multiple repositories, called sub-projects. Each
human resources and time, which may not be affordable                      sub-project contains files that are relevant to the sub-project’s
for organizations with limited testing resources. Because of               main concerns—like the project core, independent artifacts,
the resource limits, an organization could not investigate all             and plug-ins—and those files are managed in a separate source
modules—files or packages—in a project but still needs to                  code repository.
prioritize the inspection of the modules. If they fail to prioritize          Each sub-project generally has a lesser number of commits
modules which have any defects inside and assign limited                   within its repository compared to other monolithic repositories,
testing resources to others, it will result in a higher chance             where all files and commits are managed within a single
of defects in the released software. Defect prediction helps               repository. For instance, Google Android in 2012 had 275



      Copyright © 2018 for this paper by its authors.                  6
                       6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018)

 sub-projects and 183 sub-projects that contain less than 100 RQ 2 Which sub-project ranking method performs best in
 commits [9]. In addition, there might be changes in the design             filtered-within-project JIT defect prediction?
 and architecture of the project, which consequently deprecates         In this research question, we build a filtered-within-project
 some sub-projects and introduces new sub-projects to the JIT defect prediction model that filters low-ranked sub-projects
 project. In the case of Google Android, there were 275 sub- within the project, which are ranked by a score calculated by a
 projects in 2012 [9] and the number of sub-projects grew by rank scoring method. We use four different sub-project ranking
 over one thousand at 20181 .                                        algorithms: developer similarity-based ranking by us, domain-
     Therefore, applying JIT defect prediction to sub-projects agnostic similarity-based ranking by Kamei et al. [11], size-
 could be problematic because it is well known that building based ranking, and random ranking. We then compare the cost
 a prediction model based on a small amount of data may effectiveness of four similarity algorithms with different the
 increase the risk of overfitting and make the prediction less numbers of sub-projects selected to find the best-performing
 robust to outliers [10]. This motivated us to investigate the ranking algorithm for cross-sub-project JIT defect prediction.
 feasibility of applying cross-sub-project JIT defect prediction        By answering the research questions above, we conclude
 to multi-repo projects.                                             that 1) a JIT defect prediction model built with the training
     As Kamei at el. [11] studied at the project level, it might data from the same project is preferred over a model built with
 be enough to use all available sub-projects to build a cross- the training data from other projects; 2) the amount of training
 sub-project JIT defect-prediction model without selecting only data is not the only factor that affects the cost effectiveness
 similar sub-projects to build a model. However, as the sub- of cross-sub-project models; 3) a developer similarity that
 project level change history is more fine-grained than that of counts the number of developers in both sub-projects is the
 the project level, we may achieve higher cost effectiveness by most preferred way of filtering out irrelevant sub-projects
 filtering out irrelevant data based on fine-grained information in building filtered-within-project models; and 4) although
 about the data. For instance, we find that developers of a filtered-within-project models have lower cost effectiveness
 target sub-project usually make contributions to multiple sub- than within-project models, we notice that there is room
 projects rather than to a single sub-project of the target project. for further improvement of the cost effectiveness of filtered-
 This inspired us to develop a new similarity metric that within-project models.
 measures the similarity between two sub-projects based on              The main contributions of this paper are 1) proposing a
 the number of authors (developers) who made commits to sub-project-level defect prediction that can achieve higher cost
 both sub-projects. Because different developers usually have effectiveness than the traditional project-level within-project
 different defect patterns [12], we expect that using the JIT method with developer similarity-based filtered-within-project
 defect prediction model built from sub-project repositories models; and 2) evaluating the cost effectiveness of various
 whose contributors are similar to those of the target project cross-sub-project JIT defect prediction models on 232 sub-
 could show better prediction performance than a prediction projects to check its feasibility.
 model that is built from all available repositories.                   The rest of the paper is organized as follows: Section II
     In this paper, we study the ways of transferring the JIT describes the experiment settings and Section III presents
 defect-prediction models to the models that are appropriate for the design and result of experiments. Section IV provides
 multi-repo projects in terms of cost effectiveness. We establish related works and Section V reports threats to validity. Finally,
 two research questions (RQs) to check the effectiveness of Section VI conclude this study.
 using the cross-sub-project JIT defect prediction method:
                                                                                        II. E XPERIMENT S ETTING
RQ 1 Is the cross-sub-project model trained by sub-projects
         from the same project more cost effective than that A. Studied Projects and Sub-projects
         trained by sub-projects from other projects?                   We select five open-source projects that are divided into
    To build a cross-sub-project JIT defect prediction model,          multiple sub-projects. We try to choose projects with different
 training data from other sub-projects are required. As training       characteristics in terms of programming language, domain, and
 data are among the most important factors to improve a                distribution of the number of commits.
 model’s cost effectiveness, choosing proper sources of training          Table I shows the statistics of five projects. For each project,
 data is the first concern to build cross-sub-project models.          we count the number of sub-projects used in the experiments
 Before selecting sub-projects based on the similarity against         and the distribution of commit counts in each sub-project. We
 the target sub-project, we want to check the cost effectiveness       excluded sub-projects with less than 50 commits because we
 of the models built with sub-projects from the same projects          regard such sub-projects have not enough data to evaluate.
 and those built with sub-projects from other projects. If the         During the experiment, we split commits from each sub-
 models built with the sub-projects from the same project              project into ten slices for a cross-validation. If a slice only
 perform better than the other models, an organization may not         contains clean commits without bug-inducing commits, we
 need to spend time collecting commit data from other projects.        cannot evaluate the performance measure for that slice. If there
                                                                       are too many discarded slices, then performance measures with
   1 https://android.googlesource.com                                  extreme value may occur due to the lack of valid evaluation



       Copyright © 2018 for this paper by its authors.             7
                   6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018)

results. If a slice has five commits, the probability that the           be a serious problem, because imbalanced training data could
slice has at least one buggy commit could be calculated by               lead to biased prediction results. To deal with this problem,
((1 − avg. defect ratio)5 ). Because the average defect ratio            we apply an under-sampling method to the training data.
across the five projects we collected is 0.14, this probability          This sampling method randomly removes instances with the
is about 0.52. Thus, we can expect that near half of slices              majority label until the numbers of instances with the buggy
will give valid evaluation results from sub-projects with more           and clean labels are the same.
than 50 commits. For Tizen and Android projects, we use the                 2) Log Transformation: When investigating the extracted
subset of sub-project within those project because of their large        metric values, which are all non-negative, we notice that most
size. For the Tizen project, we use the sub-projects with the            of them are highly skewed. To make the distribution of metric
prefix of platform/core/api. For the Android project,                    values similar to a normal distribution, we apply a logarithm
sub-projects with the prefix of platform/package/apps                    transformation (log2 (x + 1)) to all metric values.
are used. We also calculate the ratio of commits with any
defect by using the SZZ algorithm [13], which is explained in            E. Prediction Model
Section II-C, and calculate the average number of sub-projects              There are more than 30 classification model learners used
contributed per developer, which indicates the feasibility of            in cross-project defect prediction researches [7]. Because dif-
using information about the developer who contributed in mul-            ferent model learners work better for different datasets, we
tiple sub-projects within a project to calculate the developer           choose three popular classification model learners that were
similarity between sub-projects.                                         used by other defect prediction papers. Random forest (RF)
                                                                         [11], logistic regression (LR) [4] and naive Bayes (NB) [15]
B. Change Metric                                                         are selected for our experiments.
   We used 14 change metrics that are widely adapted in the                 Those model learners build a classification model with
JIT defect prediction field [4], [14]. Table II shows descriptions       training instances. An instance consists of the change metric
of these change metrics. In metric description, the term sub-            values of a commit and a label whether the commit has any
system represents directories that are directly accessible from          defect. When the change metric values from a new commit are
the root directory.                                                      given to a classification model, the model returns a probability
   We apply some modification to those metrics because our               that the commit has any defect, so called defect-proneness,
study is conducted on the sub-project level, not a project               or a binary classification whether this commit is buggy or
level where those metrics are defined. Originally, developer             clean. New commits which needs to be inspected for quality
experience-related metrics are defined under the scope of the            assurance can be prioritized by their buggy probability to
project, but we changed this to the scope of the sub-project.            make suspicious commits checked first. Since we use cost
   To prevent the multi-collinearity problem in prediction               effectiveness as our performance measure, as explained in the
models, we exclude metric values that are correlated with any            next sub-section, we use cost-aware defect-prediction models
other metrics. We calculate the Pearson correlation coefficient          to consider the cost of investigating a commit. Whereas a
between each pair of 14 metrics across 344,005 commits.                  general defect-prediction model returns defect-proneness, a
Then, we regard pairs with Pearson correlation coefficient               cost-aware defect-prediction model returns defect-proneness
values higher than 0.8 as being correlated and exclude one               divided by the number of lines of code [16].
of the metrics in the correlated pair. As a result, we excluded
ND, NF, LT, NDEV, NUC, and SEXP and eight types of metrics               F. Performance Measures
are used throughout our experiments. These metrics are listed               Many previous cross-project defect prediction studies [7]
in bold text in Table II.                                                have adapted precision, recall, f1-score, and the area under
                                                                         the receiver operating characteristic curve (AU CROC) as
C. Labeling                                                              performance measures, which are widely used in prediction
   To build and evaluate a prediction model, buggy or clean              problems. These performance measures indicate how many
labels must be specified for each change. Instead of manual              testing instances are predicted correctly by a classification
labeling, which will cost a lot of time, we use the SZZ algo-            model. However, they do not consider the effort for QA testers
rithm [13] to label bug inducing commits automatically. SZZ              to inspect the instances predicted as defects, which makes the
algorithm firstly finds bug-fixing changes by inspecting log             measures less practical for QA testers [17].
messages from each change. Then, the algorithm backtracks                   Instead, we use the area under the cost-effectiveness curve
bug-inducing changes from bug-fixing changes by looking                  (AU CCE) [18] as a performance measure in this experiment.
at change history of files which are modified by bug-fixing              This measure considers the effort to investigate the source
changes.                                                                 code, which enables more practical evaluation.
                                                                            As developers examine the commits that are ordered by the
D. Pre-processing                                                        defect-proneness one by one, the total number of LOCs inves-
   1) Sampling Training Data: The number of buggy changes                tigated and the total number of defects found will increase.
from a source code repository is less than that from a clean             The cost-effectiveness curve, which is a monotonic function,
one, as shown in the Defect Ratio column in TableI. This could           plots changes in the total percentage of LOCs investigated



      Copyright © 2018 for this paper by its authors.                8
                     6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018)

                                                                     TABLE I
                                                      S TATISTICS ABOUT SIZE OF FIVE PROJECTS



                      # of                   Avg. # of Contributed                                     # of Commits
 Project      Sub-projects   Defect Ratio    Sub-projects per Dev.         Sum       Mean       Std.   Min.     25%      50%       75%      Max.
 Android                45           0.12                     2.68       248860    5530.22   7751.91   53.0 568.00     3070.0   6732.00   39449.0
 Appium                 32           0.21                     1.79        18034     563.56   1108.50   51.0 127.50      253.0    541.00    6326.0
 Cordova                38           0.16                     2.60        23551     619.76    808.88   52.0 197.25      330.5    618.75    3494.0
 Maven                  69           0.19                     6.50        43703     633.38   1288.80   51.0 187.00      296.0    614.00   10344.0
 Tizen                  48           0.21                     2.09         9857     205.35    518.24   52.0    77.25    116.0    153.00    3677.0



                               TABLE II                                        within the target sub-project. Fig. 1 shows which training data
                   D ESCRIPTIONS OF CHANGE METRICS                             are selected for two cross-sub-project models and one within-
                                                                               sub-project model. A rectangle represents the commit data of
    Name        Description                                                    each sub-project and a rounded rectangle represents a project.
    NS          total # of changed subsystems
    ND          total # of changed directories                                 Sub-projects within a dashed line are selected as training data
    NF          total # of changed files                                       for each JIT defect prediction model.
    Entropy     distribution of modified code across files
    LA          total # of added code lines                                       In the process of building a JIT defect prediction model,
    LD          total # of deleted code lines                                  especially when applying the under-sampling method, there is
    LT          average # of code lines before the change                      randomness that yields non-deterministic experimental results.
    FIX         whether containing fix related keyword in a change log
    NDEV        average # of developers who touched a file so far              Thus, we repeated the experiments 30 times to minimize
    AGE         average # of days passed since the last modification           the effect of randomness. In addition, we statistically tested
    NUC         total # of unique changes                                      whether the cost effectiveness of the two models is signifi-
    EXP         # of commits an author made in a sub-project
    SEXP        # of commits an author made in a subsystem                     cantly different. We used the Wilcoxon signed-rank test [19]
    REXP        # of commits an author made in a sub-project, weighted         for the test as the %W SPAU CCE values are paired between
                by commit time                                                 the two models and distribution of the %W SPAU CCE values
                                                                               are not from a normal distribution. In addition, we calculated
                                                                               effect sizes of the Wilcoxon signed-rank tests to see the
to the horizontal axis and changes in the total percentage of                  %W SPAU CCE difference in the two models.
defects found to the vertical axis. Thus, a higher AU CCE                         2) Result: With 41,760 experimental results (3 model
value can be achieved if defect-inducing commits that changed                  learners × 2 cross-sub-project models × 232 sub-projects ×
only small amount of source code are investigated earlier.                     30 repetitions), Table III shows the median %W SPAU CCE
   Although AU CCE values are always between 0 and 1,                          value of two models on different projects and model learn-
the maximum and minimum achievable AU CCE values can                           ers. The Diff. row represents the differences in the median
differ across model learners and sub-projects, so it is difficult              %W SPAU CCE between the two models and the number in
to understand the overall performance of defect prediction                     the parenthesis represents the effect size between models. Stars
models. Thus, we normalized the AU CCE value of a pre-                         next to the effect size represent that there is a statistically
diction model by dividing the value by the AU CCE value of                     significant difference between the performance measures of
a within-sub-project model. This percentage of AU CCE from                     the two models. Three stars to one star represent a statistical
the within-sub-project model (%W SPAU CCE )” shows the                         significance at 99% (α = 0.01), 95% (α = 0.05) and 90%
cost effectiveness a JIT prediction model achieves compared                    (α = 0.1) confidence intervals, respectively.
to the within-sub-project model.                                                  As shown in Table III, the within-project model out-
                         III. E XPERIMENTS                                     performed the cross-project model in terms of the median
                                                                               %W SPAU CCE value, except in cases when logistic regression
A. RQ1: Is the cross-sub-project model trained by sub-projects                 and naive Bayes learners are used in the Android project and
from the same project more cost effective than that trained by                 logistic regression is used in the Cordova project. Although
sub-projects from other projects?                                              the cross-project models are built with more training data
   1) Design: We evaluate the median %W SPAU CCE of the                        than the within-project models, their cost effectiveness is
two cross-sub-project models across all sub-projects. The first                generally lower than that of the within-project models. It
model, a within-project model, is built with commits from sub-                 shows an evidence that more training data does not mean
projects that belong to the same project as a target sub-project.              better performance. These results show that when applying
The other, a cross-project model, is constructed with commits                  cross-sub-project JIT defect prediction, an organization may
from sub-projects from the other four projects. The goal                       not need to collect change data from other projects because
of both cross-sub-project models—within-project model and                      prediction models built with those data may not perform
cross-project model—are predicting defect-inducing commits                     better than those built with change data from sub-projects



     Copyright © 2018 for this paper by its authors.                       9
                   6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018)

                                                                                                TABLE III
                                                                           M EDIAN %W SPAU CCE VALUE OF WITHIN - AND CROSS - PROJECT
                                                                            MODELS ACROSS FIVE PROJECTS AND THREE MODEL LEARNERS


                                                                                                            Model Learner
                                                                          Project   Model    LR             NB             RF
                                                                          Android   Cross    1.00           1.14           0.94
                                                                                    Within   0.99           1.00           1.01
                                                                                    Diff.    -0.01(-0.17**) -0.14(-1.00**) +0.06(+0.54**)
                                                                          Appium    Cross    1.02           0.95           1.01
                                                                                    Within   1.05           1.03           1.02
                                                                                    Diff     +0.02(+0.77**) +0.09(+0.89**) +0.01(+0.08*)
       Fig. 1. Training data of within- and cross-project models
                                                                          Cordova   Cross    1.07           0.88           1.02
                                                                                    Within   1.06           1.02           1.06
                                                                                    Diff     -0.01(+0.02)   +0.15(+0.99**) +0.04(+0.5***)
within the project. It is possible that filtering training data           Maven     Cross    1.00           0.77           0.96
                                                                                    Within   1.11           1.05           1.11
may increase the cost effectiveness of the cross-sub-project                        Diff     +0.11(+0.98**) +0.27(+1.00**) +0.14(+0.96**)
JIT prediction model. In this paper, however, we will focus               Tizen     Cross    0.80           0.77           0.88
on the feasibility of using within-project sub-project data and                     Within   1.10           1.03           1.08
                                                                                    Diff     +0.30(+0.99**) +0.26(+0.98**) +0.20(+0.96**)
further data utilization would be handled in future work.                 All       Cross    0.99           0.90           0.96
   Another finding from the results is that the within-project                      Within   1.06           1.02           1.05
models seem to have higher cost effectiveness than the within-                      Diff     +0.07(+0.65**) +0.13(+0.65**) +0.09(+0.76**)
sub-project models. Across all projects, except Android and
all three model learners, the within-project models achieved
a median %W SPAU CCE value higher than 1.0, meaning that                projects by their scores and choose the top N sub-projects
more than half of the within-project models achieved higher             as training data for the filtered-within-project model. We use
AU CCE than the within-sub-project models. This result is               two different sub-project similarity-based ranking methods and
different from the result of Kamei et al. [11] because, in their        two baseline ranking methods for this research question. The
work, which was conducted at the project level, there was               first is the developer similarity, which is our proposed method.
almost no cross-project JIT prediction models that achieved             The developer similarity is calculated with the number of
higher AU CROC than the within-project model. We may                    developers who made any commit in both sub-projects. Sub-
explain a possible reason for the difference with the number            projects with more contributing developers in the target sub-
of commits per sub-project. Table I shows that, except in the           project achieve higher similarity. As it can be seen in the
Android project, where the mean number of commits per sub-              average number of contributed sub-projects per developer in
project is up to 27 times higher than in other projects, median         Table I, people tend to contribute to various sub-project within
commit counts (50% column) of the sub-project per project               the project. The second is domain-agnostic similarity, used in
are near 500, which is much smaller than the number of                  the work by Kamei et al. [11]. Domain-agnostic similarity
commits per project (Sum column). We conducted additional               measure between the target sub-project and another sub-
experiments to confirm whether cross-sub-projects can be                project is calculated in this order: 1) calculate the Spearman
beneficial to sub-projects with a small number of commits,              correlation between label values, which is 1 if a commit
as explained in section I. The Spearman correlation coefficient         introduced any defect or 0 otherwise, and the values of each
between the number of commits in a target sub-project and the           metric from the other sub-project; 2) select three metrics that
%W SPAU CCE value of its within-project model is -0.379,                achieved the highest Spearman correlation values; 3) calculate
indicating that there is a negative linear relationship. This           the Spearman correlation between each unordered pair of
means that the fewer commits a sub-project has, the greater             selected metrics for each sub-project. This generates a three-
the improvement in AU CCE that its cross-sub-project models             dimensional vector ( 32 ) for each sub-project; 4) Calculate the
can achieve.                                                            domain-agnostic similarity by the Euclidean distance between
                                                                        the two vectors. A smaller distance represents greater similar-
B. RQ2: Which sub-project ranking method performs best in               ity.
filtered-within-project JIT defect prediction?                             Two baseline ranking methods we used are random and
   1) Design: We see from Section III-A2 that a small amount            size rank. Random rank ranks the sub-projects randomly as a
of training data could lead to better prediction performance.           dummy baseline, and size rank ranks the sub-projects by the
Thus, instead of using all the sub-project data within a project        number of commits. The more commits a sub-project has, the
to build a cross-sub-project JIT defect-prediction model, we            higher the rank it achieves. Size ranking is comparable with
can filter out sub-projects that are less helpful in defect             our proposed ranking method because our method is correlated
prediction to improve cost effectiveness of the JIT defect-             with the size of the sub-project.
prediction model. Fig. 2 shows how the training data of such               We build cross-sub-project JIT defect prediction models for
models are selected. First, we calculate a score for each               a target sub-project with commits from 1, 3, 5, 10, and 20
sub-project by using a ranking method. Then, we rank sub-               highest ranked sub-projects for each ranking method. Similar



     Copyright © 2018 for this paper by its authors.               10
                   6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018)

to RQ1, we evaluate the median %W SPAU CCE values from
each cross-sub-project JIT defect prediction model, repeat
the experiments 30 times. P-values and effect sizes using
the Wilcoxon signed-rank test are also calculated to compare
performance of developer similarity based ranking method and
other ranking methods.
   2) Result: As Table IV shows, when only one sub-project is
selected to build a filtered-within-project JIT defect prediction
model, the developer similarity ranking method outperforms
the domain-agnostic similarity and random ranking in all three
learners and outperforms the size ranking in two learners. This                   Fig. 2. Training data of filtered-within-project models
result shows that a similarity measure designed for sub-project-
level JIT defect prediction can be preferred over a similarity
measure that is originally designed for cross-project JIT defect         some research has been performed on just-in-time defect
prediction when picking a sub-project to build a cross-sub-              prediction, which predicts software changes that may introduce
project JIT defect-prediction model.                                     defects.
   As a prediction model is built with more sub-projects,                   Kamei et al. conducted experiments applying just-in-
median cost effectiveness generally increases and the differ-            time defect prediction to six large open-source projects
ences of median %W SPAU CCE value between four ranking                   and five large commercial projects [4]. To build the de-
methods become smaller. When we conduct experiments with                 fect prediction models, they used 14 change metrics in five
more than 20 sub-projects selected for each model, the cost              different dimensions—diffusion, size, purpose, history, and
effectiveness of the various models becomes almost the same,             experience—and used the logistic regression learner. These
so we do not include them in the table. This may be because              change metrics are widely used in other just-in-time defect
as more number sub-projects are selected to build the filtered-          prediction studies [21], [22]. Kim et al. proposed a change
within-project JIT defect prediction models, the models are              level defect prediction method and tested it on 12 open-source
trained with a more similar set of training data.                        software projects [5]. They used a set of text-based metrics that
   When we additionally refined our developer similarity                 were extracted from the source code, log messages, and file
ranking methods by not considering developers who barely                 names. In addition, they used metadata such as the change
contributed to the sub-project or by normalizing its value with          author and commit time. They also considered changes in the
the total number of developers who contributed in the other              complexity of source code caused by commits. There are many
sub-project. Table V shows the comparison of performance                 other just-in-time defect prediction studies, such as applying
measures normalization by the number of developers in the                deep learning [21] or unsupervised models [22] for JIT defect
other sub-project is applied or not. Since filtering developer           prediction. However, applying just-in-time defect prediction
does not success in selecting similar sub-projects to get                on projects that consist of sub-projects is not discussed yet.
improved performance measures, we do not insert the table
for that. However, in case of the normalization, it improved             B. Cross-Project Defect Prediction
performance measures greatly when Naive Bayes learner is                    Zimmermann et al. defined 30 project characteristics and
used and 3 to 5 sub-projects are selected. The improved per-             showed the influence of the similarity of each characteristic
formance measures even exceed the median %W SPAU CCE                     between target and predictor projects on a module-level cross-
value of within-project models which can be found in the “all”           project defect prediction [23]. They considered the project
row at Table III.                                                        domain, programming language, company, quality assurance
   In this research question, we see that the developer                  tools, and other aspects as project characteristics and con-
similarity-based cross-sub-project JIT defect prediction model           cluded that the characteristics of a project that transfers a
is preferable to the other ranking methods. However, we notice           defect prediction model can influence precision, recall, and
that the median values of the performance measure achieved               accuracy in cross-project defect prediction. However, we no-
with filtered-within-project models (Table IV) are smaller than          tice that this finding is barely applicable in cross-sub-project
those achieved with within-project models (Table III). Still,            defect prediction, as sub-projects within the project share many
we find the evidence that there is still room for improving the          common characteristics. For instance, sub-projects within the
cost effectiveness of filtered-within-project models over that of        same project usually are developed by the same company using
within-project models with conducting additional experiments.            the same programming language and development tools. He
                                                                         et al. [8] investigated the feasibility of cross-project defect
                     IV. R ELATED W ORK
                                                                         prediction at the module-level on 10 open-source projects.
A. JIT Defect Prediction                                                 They concluded that a model trained with data from other
  Most studies on defect prediction have focused on predicting           projects can achieve higher accuracy than one trained with data
the defectiveness of software modules—files, packages, or                from the target project in the best cases. In addition, they said
functions—by utilizing project history data [20]. Recently,              that selecting the training dataset considering the distributional



     Copyright © 2018 for this paper by its authors.                11
                  6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018)

                                                           TABLE IV
     M EDIAN %W SPAU CCE VALUE OF VARIOUS FILTERED - WITHIN - PROJECT MODELS ACROSS FIVE PROJECTS AND THREE MODEL LEARNERS


            Model      Ranking                                           # of Sub-projects Selected
            Learner    Method            1                      3                      5                       10
            LR         Developer         1.02                   1.04                   1.05                    1.06
                       Domain-agnostic   1.01(-0.02, -0.26**)   1.04(-0.01, -0.13**)   1.05(-0.00, -0.05**)    1.05(-0.01, -0.05**)
                       Random            0.99(-0.04, -0.38**)   1.03(-0.02, -0.30**)   1.04(-0.01, -0.22**)    1.05(-0.01, -0.18**)
                       Size              1.04(+0.02, +0.30**)   1.05(+0.01, -0.13**)   1.05(+0.00, -0.15**)    1.06(-0.00, -0.20**)
            NB         Developer         1.03                   1.03                   1.03                    1.03
                       Domain-agnostic   1.01(-0.02, -0.25**)   1.04(+0.01, +0.19**)   1.04(+0.01, +0.24**)    1.04(+0.01, +0.24**)
                       Random            1.01(-0.02, -0.26**)   1.03(+0.00, -0.01)     1.03(+0.00, +0.04**)    1.03(+0.00, +0.13**)
                       Size              1.00(-0.03, -0.33**)   1.01(-0.02, -0.47**)   1.01(-0.02, -0.46**)    1.01(-0.02, -0.30**)
            RF         Developer         0.98                   1.01                   1.02                    1.03
                       Domain-agnostic   0.96(-0.02, -0.14**)   1.01(-0.00, -0.09**)   1.02(-0.00, -0.09**)    1.03(+0.00, -0.01)
                       Random            0.94(-0.04, -0.29**)   0.99(-0.02, -0.30**)   1.01(-0.02, -0.29**)    1.02(-0.01, -0.20**)
                       Size              0.96(-0.02, -0.18**)   0.98(-0.03, -0.44**)   0.99(-0.03, -0.45**)    1.02(-0.02, -0.32**)


                                                        TABLE V
 M EDIAN %W SPAU CCE VALUE OF NORMALIZED DEVELOPER SIMILARITY- BASED FILTERED - WITHIN - PROJECT MODELS ACROSS FIVE PROJECTS AND
                                                          THREE MODEL LEARNERS


             Model     Developer                                          # of Sub-projects Selected
             Learner   Similarity        1                      3                       5                      10
             LR        Not Normalized    1.02                   1.04                    1.05                   1.06
                       Normalized        0.95(-0.08, -0.62**)   1.02(-0.03, -0.46**)    1.03(-0.02, -0.38**)   1.05(-0.01, -0.32**)
             NB        Not Normalized    1.03                   1.03                    1.03                   1.03
                       Normalized        0.99(-0.04, -0.30**)   1.06(+0.03, +0.22**) 1.06(+0.03, +0.24**)      1.04(+0.02, +0.25**)
             RF        Not Normalized    0.98                   1.01                    1.02                   1.03
                       Normalized        0.92(-0.06, -0.43**)   0.98(-0.03, -0.35**)    1.00(-0.03, -0.34**)   1.02(-0.01, -0.22**)



characteristics of datasets could lead to better cross-project            project JIT defect prediction model with training data from
results. However, this characteristic could not be extracted              multiple similar projects, it barely improved accuracy over
in projects where no historical data exist, which hinders the             a model trained with data from all other projects without
application of distributional characteristics in cross-project            filtering irrelevant projects. This work showed that cross-
defect prediction. In addition, neither Zimmermann et al.                 project defect prediction is feasible on JIT defect prediction,
nor He et al. considered JIT defect prediction and the cost               but there were no discussions on the sub-project-level JIT
effectiveness of the defect-prediction model.                             defect prediction. Moreover, the cost was not considered of
                                                                          investigating commits to find defects in the evaluation.
   Kamei et al. [11] conducted cross-project defect prediction
experiments in a change-level manner. They observed that                                     V. T HREAD TO VALIDITY
predicting defects for changes in a target project with a model           A. Construct Validity
built with another project’s data has lower accuracy than that               For our experiments, we implemented Python scripts to ex-
of a within-project model. However, when the prediction is                tract the change metric data from the source code repositories
done with a model built with a larger pool of training data from          to build and test JIT defect prediction models. However, the
multiple projects or combining the prediction results from mul-           scripts might have defects that affect the experiments and
tiple cross-project models, its performance is indistinguishable          results. To reduce this threat, we used open-source frameworks
from that of a within-project model. They also applied domain-            and libraries that are commonly used in other studies, such
aware or domain-agnostic similarity measures between two                  as scikit-learn [24]. In addition, we also double-checked
projects to select a similar project. For the domain-agnostic             our source code and manually inspected extracted change
type, they calculated the Spearman correlation between the                measures for validation.
metric values within a dataset and used the correlation values
to find a similar project. For the domain-aware type, they used           B. Dataset Quality
a subset of project characteristics proposed by Zimmermann                  We used CodeRepoAnalyzer by Rosen et al. [25] to extract
et al. [23] and calculated the Euclidean distance to find a               change metrics from git repositories. While using this tool, we
similar project. When these similarity measures are used to               noticed that it has some bugs. For instance, some extracted
pick one project to transfer its JIT defect-prediction model              metric values were marked as negative, which is incorrect by
to predict defects in the target project, they concluded that             definition. Although we handled the bugs found in this tool,
both measures successfully contributed to pick a better-than-             there could be other bugs that were not found and that could
average cross-project model. However, when they built a cross-            have affected the extracted values.



     Copyright © 2018 for this paper by its authors.                 12
                      6th International Workshop on Quantitative Approaches to Software Quality (QuASoQ 2018)

   Although the SZZ algorithm is widely used in JIT defect                            [9] E. Shihab, Y. Kamei, and P. Bhattacharya, “Mining Challenge 2012: The
prediction research [4], [12], it is known that the accuracy of                           Android Platform,” in Proceedings of the 9th IEEE Working Conference
                                                                                          on Mining Software Repositories, ser. MSR ’12. Piscataway, NJ, USA:
keyword-based labelling methods for bug-inducing commits                                  IEEE Press, 2012, pp. 112–115.
are limited [26]. We may improve the accuracy of automatic                           [10] M. A. Babyak, “What you see may not be what you get: A brief,
labeling by utilizing bug-repository data [13].                                           nontechnical introduction to overfitting in regression-type models,”
                                                                                          Psychosomatic Medicine, vol. 66, no. 3, pp. 411–421, 2004 May-Jun.
                       VI. C ONCLUSION                                               [11] Y. Kamei, T. Fukushima, S. McIntosh, K. Yamashita, N. Ubayashi,
                                                                                          and A. E. Hassan, “Studying just-in-time defect prediction using cross-
    In this paper, we investigated the feasibility of transferring                        project models,” Empirical Software Engineering, vol. 21, no. 5, pp.
JIT defect prediction models built with data from other sub-                              2072–2106, Oct. 2016.
                                                                                     [12] T. Jiang, L. Tan, and S. Kim, “Personalized Defect Prediction,” in Pro-
projects to predict bug-inducing commits in a target sub-                                 ceedings of the 28th IEEE/ACM International Conference on Automated
project. We conducted experiments with five projects, which                               Software Engineering, ser. ASE’13. Piscataway, NJ, USA: IEEE Press,
comprise 232 sub-projects in total, and three different model                             2013, pp. 279–289.
                                                                                     [13] J. Śliwerski, T. Zimmermann, and A. Zeller, “When Do Changes Induce
learners. With two research questions, we conclude that 1)                                Fixes?” in Proceedings of the 2005 International Workshop on Mining
a cross-sub-project model has better cost effectiveness than                              Software Repositories, ser. MSR ’05. New York, NY, USA: ACM,
within-sub-project models in general; 2) a cross-sub-project                              2005, pp. 1–5.
                                                                                     [14] X. Yang, D. Lo, X. Xia, and J. Sun, “TLEL: A two-layer ensemble
JIT defect prediction model built with data from sub-projects                             learning approach for just-in-time defect prediction,” Information and
within the targets project has higher cost effectiveness than                             Software Technology, vol. 87, pp. 206–220, Jul. 2017.
a JIT defect prediction model built with data from all avail-                        [15] B. Turhan, T. Menzies, A. B. Bener, and J. D. Stefano, “On the relative
                                                                                          value of cross-company and within-company data for defect prediction,”
able sub-projects; 3) the developer similarity-based ranking                              Empirical Software Engineering, vol. 14, no. 5, pp. 540–578, Oct. 2009.
method is preferable for filtering sub-projects that are irrel-                      [16] T. Mende and R. Koschke, “Effort-Aware Defect Prediction Models,”
evant to the target sub-project; and 4) although a developer                              in 2010 14th European Conference on Software Maintenance and
                                                                                          Reengineering, Mar. 2010, pp. 107–116.
similarity-based filtered-within-project model has lower cost                        [17] Y. Kamei and E. Shihab, “Defect Prediction: Accomplishments and
effectiveness than a within-project model in general, we further                          Future Challenges,” in 2016 IEEE 23rd International Conference on
improved the performance of the filtered-within-project model                             Software Analysis, Evolution, and Reengineering (SANER), vol. 5, Mar.
                                                                                          2016, pp. 33–45.
to outperform the within-project model in the best cases. Our                        [18] T. Mende and R. Koschke, “Revisiting the Evaluation of Defect Pre-
contributions include 1) proposing defect prediction at the                               diction Models,” in Proceedings of the 5th International Conference on
sub-project level that potentially has better cost effectiveness                          Predictor Models in Software Engineering, ser. PROMISE ’09. New
                                                                                          York, NY, USA: ACM, 2009, pp. 7:1–7:10.
than traditional within-project models by using new developer-                       [19] F. Wilcoxon, “Individual Comparisons by Ranking Methods,” Biometrics
similarity-based filtered-within-project models and 2) initially                          Bulletin, vol. 1, no. 6, pp. 80–83, 1945.
evaluating the cost effectiveness of various sub-project-level                       [20] J. Nam, “Survey on software defect prediction,” Department of Compter
                                                                                          Science and Engineerning, The Hong Kong University of Science and
JIT defect prediction models across 232 sub-projects. In future                           Technology, Tech. Rep, 2014.
work, We plan to investigate a more polished way to apply                            [21] X. Yang, D. Lo, X. Xia, Y. Zhang, and J. Sun, “Deep Learning for Just-
filtered-within-project models, such as filtering developers by                           in-Time Defect Prediction,” in 2015 IEEE International Conference on
                                                                                          Software Quality, Reliability and Security, Aug. 2015, pp. 17–26.
considering their contributions over various project resources                       [22] W. Fu and T. Menzies, “Revisiting Unsupervised Learning for Defect
[27] before calculating the developer similarity.                                         Prediction,” arXiv:1703.00132 [cs], Feb. 2017.
                                                                                     [23] T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy,
                             R EFERENCES                                                  “Cross-project Defect Prediction: A Large Scale Experiment on Data
 [1] L. Guo, Y. Ma, B. Cukic, and H. Singh, “Robust prediction of fault-                  vs. Domain vs. Process,” in Proceedings of the the 7th Joint Meeting of
     proneness by random forests,” in Software Reliability Engineering, 2004.             the European Software Engineering Conference and the ACM SIGSOFT
     ISSRE 2004. 15th International Symposium On. IEEE, 2004, pp. 417–                    Symposium on The Foundations of Software Engineering, ser. ESEC/FSE
     428.                                                                                 ’09. New York, NY, USA: ACM, 2009, pp. 91–100.
 [2] J. C. Munson and T. M. Khoshgoftaar, “The detection of fault-prone              [24] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
     programs,” IEEE Transactions on Software Engineering, vol. 18, no. 5,                O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-
     pp. 423–433, May 1992.                                                               plas, A. Passos, and D. Cournapeau, “Scikit-learn: Machine Learning in
 [3] R. Moser, W. Pedrycz, and G. Succi, “A Comparative Analysis of the                   Python,” MACHINE LEARNING IN PYTHON, p. 6.
     Efficiency of Change Metrics and Static Code Attributes for Defect              [25] C. Rosen, B. Grawi, and E. Shihab, “Commit Guru: Analytics and Risk
     Prediction,” in Proceedings of the 30th International Conference on                  Prediction of Software Commits,” in Proceedings of the 2015 10th Joint
     Software Engineering, ser. ICSE ’08. New York, NY, USA: ACM,                         Meeting on Foundations of Software Engineering, ser. ESEC/FSE 2015.
     2008, pp. 181–190.                                                                   New York, NY, USA: ACM, 2015, pp. 966–969.
 [4] Y. Kamei, E. Shihab, B. Adams, A. E. Hassan, A. Mockus, A. Sinha,               [26] T. Hall, D. Bowes, G. Liebchen, and P. Wernick, “Evaluating Three
     and N. Ubayashi, “A large-scale empirical study of just-in-time quality              Approaches to Extracting Fault Data from Software Change Reposito-
     assurance,” IEEE Transactions on Software Engineering, vol. 39, no. 6,               ries,” in Product-Focused Software Process Improvement, ser. Lecture
     pp. 757–773, 2013.                                                                   Notes in Computer Science. Springer, Berlin, Heidelberg, Jun. 2010,
 [5] S. Kim, E. J. W. Jr, and Y. Zhang, “Classifying Software Changes: Clean              pp. 107–115.
     or Buggy?” IEEE Transactions on Software Engineering, vol. 34, no. 2,           [27] G. Gousios, E. Kalliamvakou, and D. Spinellis, “Measuring Developer
     pp. 181–196, Mar. 2008.                                                              Contribution from Software Repository Data,” in Proceedings of the
 [6] A. Mockus and D. M. Weiss, “Predicting risk of software changes,” Bell               2008 International Working Conference on Mining Software Reposito-
     Labs Technical Journal, vol. 5, no. 2, pp. 169–180, Apr. 2000.                       ries, ser. MSR ’08. New York, NY, USA: ACM, 2008, pp. 129–132.
 [7] S. Herbold, “A systematic mapping study on cross-project defect pre-
     diction,” arXiv:1705.06429 [cs], May 2017.
 [8] Z. He, F. Shu, Y. Yang, M. Li, and Q. Wang, “An investigation on
     the feasibility of cross-project defect prediction,” Automated Software
     Engineering, vol. 19, no. 2, pp. 167–199, Jun. 2012.




      Copyright © 2018 for this paper by its authors.                           13