=Paper=
{{Paper
|id=Vol-2510/sattose2019_paper_11
|storemode=property
|title=The Relation between Software Maintainability and Issue Resolution Time: A Replication Study
|pdfUrl=https://ceur-ws.org/Vol-2510/sattose2019_paper_11.pdf
|volume=Vol-2510
|authors=Joren Wijnmaalen
|dblpUrl=https://dblp.org/rec/conf/sattose/Wijnmaalen19
}}
==The Relation between Software Maintainability and Issue Resolution Time: A Replication Study==
<pdf width="1500px">https://ceur-ws.org/Vol-2510/sattose2019_paper_11.pdf</pdf>
<pre>
      The Relation between Software Maintainability and
         Issue Resolution Time: A Replication Study

                         Joren Wijnmaalen                                         Cuiting Chen
                      University of Amsterdam                             Software Improvement Group
                   Amsterdam, The Netherlands                             Amsterdam, The Netherlands
                   j.wijnmaalen@protonmail.com                                   c.chen@sig.eu

                          Dennis Bijlsma                                      Ana-Maria Oprescu
                   Software Improvement Group                              University of Amsterdam
                   Amsterdam, The Netherlands                             Amsterdam, The Netherlands
                         d.bijlsma@sig.eu                                     a.m.oprescu@uva.nl


                                                                      1    Introduction
                                                                      The definition for software quality has been standard-
                        Abstract
                                                                      ized by the International Organization for Standard-
                                                                      ization (ISO) since 2001 in their document ISO 9126
    Higher software maintainability comes with
                                                                      [ISO11b]. Since then, the definition has undergone a
    certain benefits. For example, software can be
                                                                      variety of changes as it has been revised into the ISO
    updated more easily to embrace new features
                                                                      25010 in 2011 [ISO11a]. The standard decomposes
    or to fix bugs. Previous research has shown
                                                                      software quality into a set of characteristics. Software
    that there is a positive correlation between
                                                                      maintainability is one of such characteristics.
    the maintainability score measured by the SIG
                                                                         Research has shown the importance of high software
    maintainability model and shorter issue reso-
                                                                      maintainability. Bakota et al. found an exponential re-
    lution time. This study, however, dates back
                                                                      lationship between maintainability and cost [BHL+ 12].
    to 2010. Eight years later, the software indus-
                                                                      Bijlsma and Luijten showed a strong positive correla-
    try has evolved with a fast pace, as well as the
                                                                      tion between software maintainability and issue reso-
    SIG maintainability model. We would like to
                                                                      lution time [BFLV12]. Maintenance activities largely
    rerun the experiment to test if the previously
                                                                      involve solving issues that arise during development
    found relations are still valid.
                                                                      or when the product is in-use. A better maintainable
    When remeasuring the maintainability of the                       code base decreases the amount of time needed to re-
    systems with the new version of the SIG main-                     solve such issues.
    tainability model (2018), we find that major-                        However, the study by Bijlsma and Luijten dates
    ity of the systems score lower maintainability                    back to 2012. To assess the maintainability of sys-
    ratings. The overall maintainability correla-                     tems, they made use of the maintainability model de-
    tion with defect resolution time decreased sig-                   veloped by the Software Improvement Group (SIG),
    nificantly while the original system properties                   dating back to 2010. This model refers to ISO 9126
    correlate similar with defect resolution time                     for their definition of software quality, more specifi-
    compared to the original study.                                   cally, maintainability. Over the years the SIG main-
                                                                      tainability model has been evolving (a new model has
                                                                      been announced in 2018 [sig]), implementing a vari-
Copyright c 2019 for this paper by its authors. Use permitted         ety of smaller changes along the new software quality
under Creative Commons License Attribution 4.0 International
(CC BY 4.0).
                                                                      definition as documented in ISO 25010. Furthermore,
In: Anne Etien (eds.): Proceedings of the 12th Seminar on Ad-
                                                                      the software industry has been evolving at a fast pace.
vanced Techniques Tools for Software Evolution, Bolzano, Italy,       The oldest systems Bijlsma and Luijten assessed for
July 8-10 2019, published at http://ceur-ws.org                       their empirical results date back to the beginning of


                                                                  1
the 2000s. A lot has changed in the software indus-             into a single final maintainability rating.
try since then. Both the landscape of technologies has
changed, as well as the processes around software. For          Table 1: Relationship between software product
example, DevOps has emerged since the mid 2010s, in-            properties and the ISO 25010 maintainability sub-
troducing concepts such as continuous integration and           characteristics. Data taken from SIG/TViT Evalu-
delivery. These concepts, potentially, change the way           ation Criteria Trusted Product Maintainability [Vis18]
how issues are being resolved as integration is being


                                                                                                                                                                                                Component Independence
largely automated instead of being a manual action.


                                                                                                                                                                            Component Balance
   Around the broader question: ”What is the relation


                                                                                                                                                          Module Coupling
                                                                                                                     Unit Complexity

                                                                                                                                       Unit Interfacing
between software maintaina bility and issue resolution


                                                                                           Duplication

                                                                                                         Unit Size
time?”, we propose the following research question:


                                                                                  Volume
    • RQ1.1 Does the previously found strong correla-
      tion between maintainability and issue resolution
      time still hold given the latest (2018) SIG main-
      tainability model?

2     Background                                                 Analyzability    X        X             X                                                                   X
                                                                 Modifiability             X                          X                                    X
2.1    The SIG Maintainability Model                             Testability      X                                   X                                                                          X
The ISO 25010 standard defines Software Quality                  Modularity                                                                                X                 X                   X
through a range of quality characteristics. Each of              Reusability                             X                              X
these characteristics is further subdivided into a set
of sub-characteristics. Software maintainability is one         Evolution of the SIG Maintainability Model
of such characteristics and is further subdivided into
the following sub-characteristics: analyzability, mod-          Both Bijlsma and Luijten assessed the maintainabil-
ifiability, testability, modularity and reusability. The        ity characteristic of software quality as described by
standard, however, does not provide how to directly             the ISO 9126 standard using the SIG maintainability
measure the various quality characteristics and sub-            model. Since the ISO 9126 standard has been revised
characteristics. Instead, the Software Improvement              into the ISO 25010 standard, the SIG maintainability
Group (SIG) provides a pragmatic model to directly              model has evolved accordingly, as is part of the motiva-
assess maintainability through the static analysis of           tion for this replication study. In order to reason about
source code [HKV07]. The SIG maintainability model              the results of this replication study, the differences be-
lists a set of source code metrics, also called soft-           tween the ’modern’ SIG maintainability model (here-
ware product properties. The following software prod-           inafter referred to as the new model) and the model
uct properties are measured: volume, duplication,               used by Bijlsma and Luijten (hereinafter referred to as
unit size, unit complexity, unit interfacing, module            the old model) need to be highlighted.
coupling, component balance and component indepen-                 Compared to ISO 9126, ISO 25010 adds the sub-
dence. These product properties are then mapped to              characteristic modularity to maintainability. Mod-
the sub-characteristics as defined in the ISO 25010             ularity is defined as ”The degree to which a sys-
standard. These mappings, what product properties               tem or computer program is composed of discrete
influence what characteristics, are based on expert             components such that a change to one component
opinion. Table 1 illustrates these mappings.                    has minimal impact on other components.” [ISO11a].
    To calculate the maintainability rating of a sys-           In order to account for this new sub-characteristic,
tem, the model first measures the product proper-               two new system properties were introduced in the
ties. These raw measures are converted to a star                new model: component balance and component inde-
based rating based on a benchmark internal to SIG               pendence. Apart from accounting for the new sub-
(1 to 5 stars, where 3 stars is the market average).            characteristic, these properties were expected to stim-
Note, the stars do not divide the distribution of sys-          ulate discussions about the architecture of systems and
tems into even buckets, instead 5% of systems are as-           to incorporate a common viewpoint in the assessment
signed one star, 30% two, 30% three, 30% four and 5%            of implemented architectures, as mentioned by Bouw-
five stars. Secondly, the product property ratings are          ers et al. in their evaluation of the SIG maintainability
aggregated into the maintainability sub-characteristic          model metrics [BvDV13].
ratings based on the relations as defined in Table 1. Fi-          Introduction of these properties also raises ques-
nally, the sub-characteristics ratings are all aggregated       tions on the definition for a component. Visser defines


                                                            2
the term component as the following in his technical              2.3   Issue Resolution Quality Ratings
report: ”A component is a subdivision of a system in
                                                                  Given the definition for the issue resolution time met-
which source code modules are grouped together based
                                                                  ric, both Luijten and Bijlsma collected measurements
on a common trait. Often components consist of mod-
                                                                  from various projects. In order to compare these res-
ules grouped together based on a shared technical or
                                                                  olution times on a project level, the resolution times
functional aspect” [Vis18]. In practice, this definition
                                                                  per issue need to be aggregated. Intuitively, statistical
still deems too vague. It introduces the need for an
                                                                  properties such as the mean come to mind. However,
external evaluator to point out the core components
                                                                  as Luijten points out, the resolution times collected
of any specific system, based on their perception on
                                                                  are not normally distributed [Lui10]. Therefore, Lui-
how functionality is grouped and it’s granularity.
                                                                  jten created various risk categories, such that the issue
                                                                  resolution distribution is divided into buckets, choos-
                                                                  ing thresholds such that the buckets are filled equally.
2.2   Issue Resolution Time                                       Table 2 illustrates the risk categories with their thresh-
                                                                  olds as defined by Luijten for issues of type defect.
Both Luijten and Bijlsma look at issue resolution time            Similar thresholds are defined for issues of type en-
in their studies. Bijlsma defines issue resolution time           hancement.
as ”the total time an issue is in open state. [...] Resolu-          Based on these categories Luijten continues to de-
tion time is not simply the time between the issue be-            fine quality ratings, to further align with the rating sys-
ing reported and the issue being resolved.” [BFLV12]              tem the SIG maintainability model implements. Table
Instead, Bijlsma illustrates the life cycle of an issue           3 shows the mapping between risk categories and qual-
using Figure 1. Bijlsma measured the highlighted pe-              ity ratings. The thresholds are chosen such that 5% of
riod of time in the Figure for his study. Even though             the systems will receive a 5-star rating, 30% four stars,
it would seem better to start measuring when the sta-             30% three, 30% two, and 5% one star (the same distri-
tus of an issue is set to assigned, this was realistically        bution as the SIG maintainability model uses, Section
not a possibility for the data Bijlsma obtained. Many             2.1). For measurement purposes, these star ratings
projects Bijlsma analyzed were inconsistent in using              are interpolated between the interval [0.5, 5.5], as is
the assigned property in their Issue Tracking Systems             standard in the SIG maintainability model as well.
(ITS), making it impossible to accurately determine
when a developer started working on an issue.
                                                                  Table 2: Luijten’s risk categories for issues of type de-
   Next to the issue resolution time life cycle, there            fect [Lui10]. Lower risk means faster issue resolution
is also the notion of issue types. Various issue track-           times
ing systems use different terms to denote the variety
in issues. Bijlsma defined the following types: defect,                        Category      Threshold(days)
enhancement, patch and task. A defect, according to
Bijlsma, is a ”problem in the system” [BFLV12]. An                             Low                     0 - 23.6
enhancement can be ”the addition of a new feature,                             Moderate             23.6 - 68.2
or an improvement of an existing feature”. Tasks and                           High                  68.2 - 198
patches are ”usually one time activities” and unify var-                       Very High                  198+
ious other issue types with a range of urgencies. The
tools Bijlsma and Luijten used in their experiment nor-
malized all issues obtained from the various ITS’s to-            Table 3: Luijten’s rating thresholds for issues of type
wards these four types only. Luijten originally focussed          defect [Lui10]. Higher ratings means faster issue
on issues of type defect, where Bijlsma expanded with             resolution time.
issues of type enhancement.
                                                                         Rating     Moderate     High     Very High
                                                                         *****             7%      0%              0%
                                                                         ****             25%     25%              2%
                                                                         ***              43%     39%             13%
                                                                         **               43%     42%             35%


                                                                  3     Method
Figure 1: The issue resolution time (in green) as mea-
sured by Bijlsma and Luijten [BFLV12]                             This replication study aims to discover if the rela-
                                                                  tionship between maintainability and issue resolution


                                                              3
times still hold using the new SIG maintainability                   In a small amount of cases the matching snap-
model to assess maintainability. Therefore, the same                 shots were not listed in the tags, releases or other
systems and issue tracking data will be used as in Bi-               archiving Subversion root directories. In that
jlsma’s experiment. Bijlsma assessed 10 open source                  case, the Subversion trunk directory was checked
systems looking at multiple snapshots situated in var-               out at the revision closest to the date listed by
ious points of time of the systems lifespan. Given the               Bijlsma.
definition for issue resolution time, and the concept of
issue resolution time quality ratings, as described in            For every snapshot new maintainability ratings are
Section 2.3, ratings are calculated per snapshot. For          calculated. These ratings are obtained using the SIG
each snapshot, Bijlsma and Luijten consider all issues         software analysis toolkit (SAT). The SAT implements
that are closed and/or resolved between that snap-             the latest (2018) version of the SIG maintainability
shot and the next as relevant for that snapshot [Lui10].       model. It provides the final maintainability rating,
These ratings are directly re-used in this experiment          along with its sub-characteristics as described by the
as calculated by Bijlsma, as they were archived and            ISO 25010 standard.
directly ready for use.
   Table 4 shows the systems assessed by Bijlsma.              3.1    Data Acquisition
Since the original snapshots Bijlsma used in his               In order to replicate Bijlsma’s original study, the same
study were not archived, the snapshots had to be re-           data is needed. We assume, since the snapshots were
obtained. For every snapshot Bijlsma listed a version          retrieved from the official system archives or version
and a date. Using this data we were able to retrieve           control systems, that the contents of the snapshots
all snapshots by using the following two methods for           retrieved are the same. The only other metric provided
retrieval:                                                     by Bijlsma to verify this assumption is the size of the
                                                               snapshot in LOC. Figure 2 compares the snapshot sizes
  • Official System Archives Some systems main-                as found by Bijlsma against the snapshot sizes found
    tain an official archive. Snapshots matching date          by us. It can be immediately seen that the blue and
    and version number are directly retrieved from             red graphs do not align. This behaviour is expected,
    these archives. For a small amount of snapshots            however, as these numbers are retrieved after the SAT
    the date Bijlsma listed deviate a couple of days           analysis.
    from the date coupled with the version number in              The SAT requires a scoping definition per system
    the archive. These snapshots were still retrieved          in order to function. Scoping is the process of deter-
    as it was assumed that deviation in dates was              mining what parts and files of the system should be
    caused by human error.                                     included in the calculation. For example, source files
                                                               in a ’/libs/’ folder should not be included in the calcu-
  • Version Control Systems (VCS) If a systems                 lation as they are external dependencies and are not
    organization stopped hosting, or does not have             maintained by the development team directly. Ideally,
    an archive containing older versions, the snapshot         the scoping per system should be exactly the same
    was retrieved by traversing the system’s respec-           as Bijlsmas original scoping when rerunning the SAT.
    tive VCS. The majority of the systems assessed             However, scoping files were not documented in Bijls-
    by Bijlsma make use of Subversion as their main            mas study, so we had to do our own scoping. Luckily,
    VCS. Subversion, by default, contains the root             Bijlsma provided feedback in person to check if the
    folders ’/trunk’, ’/branches’ and ’/tags’. Where           scoping files were roughly the same.
    trunk is the directory where the main develop-                Given the scoping differences, a slight deviation in
    ment takes place and branches contain the fea-             SAT considered lines of code for the calculation can be
    tures that parallel main development, the tags             explained. We do expect, however, that the newly ac-
    directory is specifically interesting as it contains       quired snapshots follow the same trend line as the old
    read only copies of the source code in a specified         snapshots. For the majority of the systems listed in
    set of time.                                               Figure 2 this is the case (e.g. abiword, ant, argouml,
    Note that the Subversion repositories default to           checkstyle), but some other systems stand out. We-
    the trunk, branches and tags directories but are           bkit, for example, only has a third of the original size
    not limited to this structure. For example, We-            (in KLOC). The cause for this large difference remains
    bkit adds the ’/releases’ directory to the reposito-       unknown to us, as running the SAT with all junk files
    ries root, containing all major releases (where tags       included doesn’t come near the originally reported size
    are of a finer granularity). In this case, both the        numbers. As such, the newly measured Webkit main-
    tags and releases are navigated to find the right          tainability numbers can not be compared against the
    snapshot.                                                  old, and will impact the correlation results.


                                                           4
                                                                    LOC (Latest)
                  System               Main Language          Replication    Original   Snapshots
                  Abiword                         C++             427,885     415,647            3
                  Ant                             Java            101,647     122,130           20
                  ArgoUML                         Java            155,559     171,402           20
                  Checkstyle                      Java             38,685      38,654           22
                  Hibernate-core                  Java            114,953     145,482            5
                  JEdit                           Java            109,403     100,159            4
                  Spring-framework                Java             98,141     118,833           21
                  Subversion                         C            192,620     218,611            5
                  Tomcat                          Java            158,881     163,589           19
                  Webkit                          C++             425,929   1,255,543            1
                  Total                                                                        120

                                     Table 4: Systems assessed by Bijlsma.


Figure 2: Comparing system sizes per snapshot (KLOC). Each point represents a snapshot of the given system,
ordered by increasing date. Each snapshot is represented by two points, size in KLOC as found by Bijlsma
(Red) and us (Blue).

3.2   Method Differences                                      Equalities

To summarize, some elements of the original study’s           Given the concept of issue resolution time and is-
method have been kept exactly the same while other            sue resolution time quality ratings, these ratings for
elements have been changed.                                   all snapshots and systems have been re-used directly.
                                                              The respective issue tracking systems were not mined
                                                              again. Furthermore, the correlations are calculated in
Differences                                                   the same manner.
Since the original snapshots were not archived, the
snapshots had to be reacquired. This results in small         4     Results
data inconsistencies for most systems and for large in-
                                                              4.1    Comparing Maintainability
concistencies for one systems in particular (Webkit).
Additionally, as is the purpose of this study, the new        Maintainability ratings are directly compared between
SIG maintainability model is used to measure main-            the old and the new model to gain a better understand-
tainability as opposed to the original SIG maintain-          ing to how the SIG maintainability model evolved. It
ability model dating from 2010.                               also serves as a validation step to see if the new main-


                                                          5
tainability ratings are reasonable within expectation            systems should be easier to modify than systems with
compared to the old ones. Figure 3 gives a high level            huge, monolithic, components. Further, unit interfac-
overview on how the maintainability ratings per sys-             ing has vastly decreased in significance, from 0.042 to-
tem (distributed over all snapshots) compare between             wards 0.640.
the old and the new SIG maintainability model. We                   Table 6 shows the same comparison of correlations
observe for the majority of the systems (ant through             as Table 5, but for enhancement resolution speed.
tomcat) new maintainability ratings are lower com-               The difference in maintainability correlations is a lot
pared the the old model. This behaviour is expected              smaller compared to the difference found in the defect
because the SAT uses benchmark data to determine                 correlations. The modularity correlations also stand
the thresholds of the rating buckets and has been rising         out, since both the coefficients of modularity and com-
over the years (See Section 5 for further elaboration).          ponent balance cannot be assumed given their p-values
Note that, even though plotted, maintainability can-             being larger than 0.05. The decrease in significance is
not be compared for webkit as the reacquired snapshot            specifically interesting compared to the defect correla-
has over 500 KLOC less than the original.                        tions in Table 5.
   The maintainability rating calculated by the SIG
maintainability model is composed by a double aggre-             5     Discussion
gation. Figure 3 adds another aggregation on top of
                                                                 5.1   Comparing Maintainability
this, combining multiple maintainability ratings per
system into a single boxplot. The figure provides a              One explanation for the lower maintainability is the
high level overview, but in order to discover the other          calibration of benchmark thresholds to determine the
factors that cause the deviation in maintainability rat-         ratings. Over the years new systems that are mea-
ings we need to zoom in on the low level metrics. Fig-           sured are added to the SAT benchmark. The observa-
ure 4 illustrates all unit complexity ratings of all sys-        tion is that the distribution of quality ratings in the
tems ordered by snapshot date. The figure shows how              benchmark, over time, shifts towards a higher aver-
in general the new ratings follow the same trend as              age. In order to compensate for this phenomenon, SIG
the old ratings, but on a slightly lower rating alto-            calibrates the thresholds for all ratings (both system-
gether. Specifically the systems ant, jedit, tomcat and          properties and characteristics) yearly. This means
springframework show this behaviour well. The lower              that the thresholds (for most of the characteristics)
rating can again be explained by the rising benchmark            have become stricter. This is also documented in the
thresholds. Webkit consistently rates higher for all             ’Guidance for Producers’ documents, which SIG re-
system metrics, but are insignificant due to the large           leases yearly. For example, in their 2018 document
deviation in reacquired snapshot sizes. ArgoUML con-             it is mentioned for unit complexity that ”To be eli-
sistently shows lower ratings for the original system            gible for certification at the level of 4 stars, for each
properties (without the modularity system properties),           programming language used the percentage of lines of
but shows a higher rating in overall maintainability             code residing in units with McCabe complexity num-
(Figure 3).                                                      ber higher than 5 should not exceed 21.1%” [Vis18]
                                                                 while their 2017 document states the same but with
4.2   Comparing Correlations                                     a threshold of 24.3% [Vis17]. Remeasuring the same
                                                                 systems again with the stricter benchmark thresholds
Bijlsma and Luijting classified four types of issues: de-        results in overall lower maintainability scores.
fect, enhancement, patch and task. Bijlsma and Lui-                 This expected behaviour of lower maintainability
jten investigated issues of type defect and enhance-             ratings is consistent for eight out of ten systems. The
ment. Tables 5 and 6 illustrate the new correlations             systems Abiword and Webkit stand out as they both
found for these two types of issues. Every correlation           score higher compared to the original rating.
is tested for significance, given the following set of hy-          Webkit can be considered an outlier. The system is
potheses H0 {ρ = 0} and HA {ρ > 0}. For the zero                 composed by a single snapshot that consistently scores
hypothesis to be rejected, a confidence threshold of             higher for all system properties and aggregated rat-
5% is used.                                                      ings. This may be the result of the re-acquisition of the
    Given the new defect correlations, the correlations          snapshot, as the newly obtained snapshot has roughly
of the original system properties are comparable, ex-            500 KLOC less than documented by Bijlsma.
cept for module coupling which shows a significant                  Abiword, however, does follow the expectations of
drop from 0.55 to 0.36. The other surprising result              lower ratings for system metrics. The overall higher
is the large drop of maintainability from 0.64 to 0.33.          maintainability score can be speculated by the new
The negative correlation of modularity is surprising,            properties introduced in the new model (component
as it goes against our intuition. Intuitively, modular           balance and component independence). Specifically


                                                             6
Figure 3: Snapshot maintainability distribution per system. Each system contains two boxplots, the maintain-
ability ratings as obtained by Bijlsma (red) and the maintainability ratings obtained by the 2018 version of the
SIG maintainability model (blue), distributed over all the snapshots of the system.


Figure 4: Unit complexity ratings per system. Each point represents a snapshot of the given system, ordered
by increasing date. Each snapshot is represented by two points, unit complexity of the old model (red) and the
new model (blue).


                                                       7
Table 5: Defect resolution time correlations. The Table on the left shows the correlation statistics found by
Bijlsma. The Table on the right shows the correlation statistics as obtained by the replication study.

                       Old Correlations                                    New Correlations

             Defect resolution vs.    ρ          p-value     Defect resolution vs.            ρ         p-value
             Volume                  0.33         0.001      Volume                        0.39          0.000
             Duplication             0.34         0.001      Duplication                   0.38          0.000
             Unit size               0.53         0.000      Unit size                     0.53          0.000
             Unit complexity         0.54         0.000      Unit complexity               0.50          0.000
             Unit interfacing        0.19         0.042      Unit interfacing              0.05          0.640
             Module coupling         0.55         0.000      Module coupling               0.36          0.000
             Analysability           0.57         0.000      Analyzability                 0.33          0.002
             Changeability           0.68         0.000      Modifiability                 0.59          0.000
             Stability               0.46         0.000
             Testability             0.56         0.000      Testability                   0.49          0.000
             Maintainability         0.64         0.000      Maintainability               0.33          0.001
                                                             Modularity                    -0.30         0.004
                                                             Reusability                   0.46          0.000
                                                             Component balance             -0.34         0.001
                                                             Component independence         0.16         0.201


Table 6: Enhancement resolution time correlations. The Table on the left shows the correlation statistics found
by Bijlsma. The Table on the right shows the correlation statistics as obtained by the replication study.

                      Old Correlations                                         New Correlations

        Enhancement resolution vs.          ρ      p-value       Enhancement resolution vs.         ρ       p-value
        Volume                            0.61      0.000        Volume                            0.58      0.000
        Duplication                       0.02      0.448        Duplication                       0.09      0.499
        Unit size                         0.44      0.000        Unit size                         0.45      0.000
        Unit complexity                   0.48      0.000        Unit complexity                   0.47      0.000
        Unit interfacing                  0.10      0.213        Unit interfacing                 -0.20      0.132
        Module coupling                   0.69      0.000        Module coupling                   0.67      0.000
        Analysability                     0.44      0.000        Analyzability                     0.22      0.096
        Changeability                     0.46      0.000        Modifiability                     0.52      0.000
        Stability                         0.50      0.000
        Testability                       0.47      0.000        Testability                       0.68      0.000
        Maintainability                   0.53      0.000        Maintainability                   0.47      0.000
                                                                 Modularity                       -0.09      0.513
                                                                 Reusability                       0.37      0.004
                                                                 Component balance                -0.29      0.023
                                                                 Component independence           0.34       0.039

because the component independence scores for the                modularity with its system properties component bal-
Abiword snapshots read 5.23, 5.23 and 2.50 ordered               ance and component independence are the biggest
by date respectively.                                            causing factor for the maintainability correlation to
                                                                 drop from 0.64 to 0.33. The negative correlation for
5.2   Comparing Correlations                                     modularity and component balance is surprising as it
                                                                 goes against our intuition. Overall one would assume a
Since the original system properties are similar, it
                                                                 modular program would help defect and enhancement
seems like the added maintainability sub-characteristic


                                                             8
issue resolution time instead of the opposite. However,        regarding two new metrics in the 2018 model: (1) com-
perhaps the results make an argument for the way               ponent balance does not correlate as expected, and (2)
modularity is assessed currently. The performance of           component independence correlates only in cases en-
component balance, for example, has been debated be-           hancements are considered.
fore [BvDV13] (specifically, the discussion around the            Our next steps are to investigate the cause of the
optimal number of components and the performance               observed differences and further validate the underly-
on smaller systems).                                           ing data. Additionally we would like to extend the
                                                               data set to modern software systems.
5.3   Threats to Validity
                                                               7   Future Work
One of the main threats to validity is the variety in
SAT scoping. In order to get accurate replication re-          The system property component balance and its as-
sults, ideally, the scoping per system should be exactly       sociated quality characteristic modularity can be con-
the same as Bijlsma’s original scoping when rerunning          sidered a reason why the overall defect maintainability
the SAT. As a consequence, results obtained may de-            correlation is much lower than in the original study.
viate slightly. However, given that the SIG maintain-          Future work can expand in this direction, research-
ability model uses two level aggregation to compute            ing the effect of modularity on issue resolution time.
the final maintainability score, small deviations in re-       Specifically, does the modularity coefficient look any
sults should not affect the final maintainability score        different when the enhancement results are significant?
by a large margin.                                                Next to expanding in the direction of modularity,
   An additional difference in scoping is the com-             more questions need to be answered in order to fully
ponent depth property, which was introduced when               show the relation between maintainability and issue
evolving according the new ISO 25010 standard (as              resolution time. Does the previusly found relation still
described in section 2.1). This property needs to be           hold when tested against modern systems? Further-
set to show were the highest level components in the           more, Bijlsma analyzed mainly Java systems. How
directory of a system reside. This is needed in order to       does this extend towards other languages? In this
calculate the modularity system properties. The ambi-          paper we tested against maintainability as assessed
guity of the component definition requires an external         by the SIG maintainability model. However, in or-
validator to check for correctness. In our case, given         der to make the concept of maintainability more gen-
the age of the systems, no external validator was ap-          eralizable, do the correlations still hold when tested
proached to check if we defined the right highest level        against other maintainability implementations (e.g.
components. The component depth property was set               the maintainability index as proposed by Oman et al.
in accordance with our own interpretation of the sys-          [CALO94])?
tem.
                                                               References
6     Conclusion                                               [BFLV12] Dennis Bijlsma, Miguel Alexandre Fer-
                                                                        reira, Bart Luijten, and Joost Visser.
In order to answer the research question: What is
                                                                        Faster issue resolution with higher techni-
the relation between software maintainability and issue
                                                                        cal quality of software. Software quality
resolution time?, in this paper we provide answers to
                                                                        journal, 20(2):265–285, 2012.
the sub-question: ”Does the previously found strong
correlation between maintainability and issue resolu-          [BHL+ 12] Tibor Bakota, Peter Hegedus, Gergely
tion time still hold given the latest (2018) SIG main-                   Ladányi, Peter Kortvelyesi, Rudolf Ferenc,
tainability model?”. The experiment to find correla-                     and Tibor Gyimóthy. A cost model based
tions between maintainability (as assessed by the SIG                    on software maintainability. In Software
maintainability model) and issue resolution time, as                     Maintenance (ICSM), 2012 28th IEEE In-
originally defined and executed by Bijlsma and Luijten                   ternational Conference on, pages 316–325.
in 2012 [BFLV12] has been replicated. The experiment                     IEEE, 2012.
was run on the same, reacquired (with small devia-
tions), snapshots of systems as in the original study          [BvDV13] Eric Bouwers, Arie van Deursen, and Joost
with the new (2018) version of the SIG maintainabil-                    Visser. Evaluating usefulness of software
ity model.                                                              metrics: an industrial experience report.
   Many similar correlations are observed between the                   In 2013 35th International Conference on
2010 and 2018 maintainability ratings versus the res-                   Software Engineering (ICSE), pages 921–
olution time of defects and enhancements. However,                      930. IEEE, 2013.


                                                           9
[CALO94] Don Coleman, Dan Ash, Bruce Lowther,
         and Paul Oman. Using metrics to evalu-
         ate software system maintainability. Com-
         puter, 27(8):44–49, 1994.
[HKV07]    Ilja Heitlager, Tobias Kuipers, and Joost
           Visser. A practical model for measur-
           ing maintainability. In null, pages 30–39.
           IEEE, 2007.

[ISO11a]   ISO/IEC 25010:2011, Systems and soft-
           ware engineering – Systems and soft-
           ware Quality Requirements and Evaluation
           (SQuaRE) – System and software quality
           models. Standard, International Organi-
           zation for Standardization, Geneva, CH,
           March 2011.
[ISO11b]   ISO/IEC 25010:2011, Software engineering
           – Product quality – Part 1: Quality model.
           Standard, International Organization for
           Standardization, Geneva, CH, March 2011.
[Lui10]    Bart Luijten. Faster defect resolution with
           higher technical quality of software. 2010.
[sig]      Quality model 2018 announcement.
           www.softwareimprovementgroup.com/news-
           knowledge/sig-quality-model-2018-now-
           available/. Accessed: 2018-12-20.
[Vis17]    Joost Visser. Sig/tüvit evaluation crite-
           ria trusted product maintainability: Guid-
           ance for producers. Software Improvement
           Group, Tech. Rep., page 7, 2017.
[Vis18]    Joost Visser. Sig/tüvit evaluation crite-
           ria trusted product maintainability: Guid-
           ance for producers. Software Improvement
           Group, Tech. Rep., page 7, 2018.


                                                         10

</pre>