=Paper=
{{Paper
|id=Vol-2510/sattose2019_paper_11
|storemode=property
|title=The Relation between Software Maintainability and Issue Resolution Time: A Replication Study
|pdfUrl=https://ceur-ws.org/Vol-2510/sattose2019_paper_11.pdf
|volume=Vol-2510
|authors=Joren Wijnmaalen
|dblpUrl=https://dblp.org/rec/conf/sattose/Wijnmaalen19
}}
==The Relation between Software Maintainability and Issue Resolution Time: A Replication Study==
The Relation between Software Maintainability and
Issue Resolution Time: A Replication Study
Joren Wijnmaalen Cuiting Chen
University of Amsterdam Software Improvement Group
Amsterdam, The Netherlands Amsterdam, The Netherlands
j.wijnmaalen@protonmail.com c.chen@sig.eu
Dennis Bijlsma Ana-Maria Oprescu
Software Improvement Group University of Amsterdam
Amsterdam, The Netherlands Amsterdam, The Netherlands
d.bijlsma@sig.eu a.m.oprescu@uva.nl
1 Introduction
The definition for software quality has been standard-
Abstract
ized by the International Organization for Standard-
ization (ISO) since 2001 in their document ISO 9126
Higher software maintainability comes with
[ISO11b]. Since then, the definition has undergone a
certain benefits. For example, software can be
variety of changes as it has been revised into the ISO
updated more easily to embrace new features
25010 in 2011 [ISO11a]. The standard decomposes
or to fix bugs. Previous research has shown
software quality into a set of characteristics. Software
that there is a positive correlation between
maintainability is one of such characteristics.
the maintainability score measured by the SIG
Research has shown the importance of high software
maintainability model and shorter issue reso-
maintainability. Bakota et al. found an exponential re-
lution time. This study, however, dates back
lationship between maintainability and cost [BHL+ 12].
to 2010. Eight years later, the software indus-
Bijlsma and Luijten showed a strong positive correla-
try has evolved with a fast pace, as well as the
tion between software maintainability and issue reso-
SIG maintainability model. We would like to
lution time [BFLV12]. Maintenance activities largely
rerun the experiment to test if the previously
involve solving issues that arise during development
found relations are still valid.
or when the product is in-use. A better maintainable
When remeasuring the maintainability of the code base decreases the amount of time needed to re-
systems with the new version of the SIG main- solve such issues.
tainability model (2018), we find that major- However, the study by Bijlsma and Luijten dates
ity of the systems score lower maintainability back to 2012. To assess the maintainability of sys-
ratings. The overall maintainability correla- tems, they made use of the maintainability model de-
tion with defect resolution time decreased sig- veloped by the Software Improvement Group (SIG),
nificantly while the original system properties dating back to 2010. This model refers to ISO 9126
correlate similar with defect resolution time for their definition of software quality, more specifi-
compared to the original study. cally, maintainability. Over the years the SIG main-
tainability model has been evolving (a new model has
been announced in 2018 [sig]), implementing a vari-
Copyright c 2019 for this paper by its authors. Use permitted ety of smaller changes along the new software quality
under Creative Commons License Attribution 4.0 International
(CC BY 4.0).
definition as documented in ISO 25010. Furthermore,
In: Anne Etien (eds.): Proceedings of the 12th Seminar on Ad-
the software industry has been evolving at a fast pace.
vanced Techniques Tools for Software Evolution, Bolzano, Italy, The oldest systems Bijlsma and Luijten assessed for
July 8-10 2019, published at http://ceur-ws.org their empirical results date back to the beginning of
1
the 2000s. A lot has changed in the software indus- into a single final maintainability rating.
try since then. Both the landscape of technologies has
changed, as well as the processes around software. For Table 1: Relationship between software product
example, DevOps has emerged since the mid 2010s, in- properties and the ISO 25010 maintainability sub-
troducing concepts such as continuous integration and characteristics. Data taken from SIG/TViT Evalu-
delivery. These concepts, potentially, change the way ation Criteria Trusted Product Maintainability [Vis18]
how issues are being resolved as integration is being
Component Independence
largely automated instead of being a manual action.
Component Balance
Around the broader question: ”What is the relation
Module Coupling
Unit Complexity
Unit Interfacing
between software maintaina bility and issue resolution
Duplication
Unit Size
time?”, we propose the following research question:
Volume
• RQ1.1 Does the previously found strong correla-
tion between maintainability and issue resolution
time still hold given the latest (2018) SIG main-
tainability model?
2 Background Analyzability X X X X
Modifiability X X X
2.1 The SIG Maintainability Model Testability X X X
The ISO 25010 standard defines Software Quality Modularity X X X
through a range of quality characteristics. Each of Reusability X X
these characteristics is further subdivided into a set
of sub-characteristics. Software maintainability is one Evolution of the SIG Maintainability Model
of such characteristics and is further subdivided into
the following sub-characteristics: analyzability, mod- Both Bijlsma and Luijten assessed the maintainabil-
ifiability, testability, modularity and reusability. The ity characteristic of software quality as described by
standard, however, does not provide how to directly the ISO 9126 standard using the SIG maintainability
measure the various quality characteristics and sub- model. Since the ISO 9126 standard has been revised
characteristics. Instead, the Software Improvement into the ISO 25010 standard, the SIG maintainability
Group (SIG) provides a pragmatic model to directly model has evolved accordingly, as is part of the motiva-
assess maintainability through the static analysis of tion for this replication study. In order to reason about
source code [HKV07]. The SIG maintainability model the results of this replication study, the differences be-
lists a set of source code metrics, also called soft- tween the ’modern’ SIG maintainability model (here-
ware product properties. The following software prod- inafter referred to as the new model) and the model
uct properties are measured: volume, duplication, used by Bijlsma and Luijten (hereinafter referred to as
unit size, unit complexity, unit interfacing, module the old model) need to be highlighted.
coupling, component balance and component indepen- Compared to ISO 9126, ISO 25010 adds the sub-
dence. These product properties are then mapped to characteristic modularity to maintainability. Mod-
the sub-characteristics as defined in the ISO 25010 ularity is defined as ”The degree to which a sys-
standard. These mappings, what product properties tem or computer program is composed of discrete
influence what characteristics, are based on expert components such that a change to one component
opinion. Table 1 illustrates these mappings. has minimal impact on other components.” [ISO11a].
To calculate the maintainability rating of a sys- In order to account for this new sub-characteristic,
tem, the model first measures the product proper- two new system properties were introduced in the
ties. These raw measures are converted to a star new model: component balance and component inde-
based rating based on a benchmark internal to SIG pendence. Apart from accounting for the new sub-
(1 to 5 stars, where 3 stars is the market average). characteristic, these properties were expected to stim-
Note, the stars do not divide the distribution of sys- ulate discussions about the architecture of systems and
tems into even buckets, instead 5% of systems are as- to incorporate a common viewpoint in the assessment
signed one star, 30% two, 30% three, 30% four and 5% of implemented architectures, as mentioned by Bouw-
five stars. Secondly, the product property ratings are ers et al. in their evaluation of the SIG maintainability
aggregated into the maintainability sub-characteristic model metrics [BvDV13].
ratings based on the relations as defined in Table 1. Fi- Introduction of these properties also raises ques-
nally, the sub-characteristics ratings are all aggregated tions on the definition for a component. Visser defines
2
the term component as the following in his technical 2.3 Issue Resolution Quality Ratings
report: ”A component is a subdivision of a system in
Given the definition for the issue resolution time met-
which source code modules are grouped together based
ric, both Luijten and Bijlsma collected measurements
on a common trait. Often components consist of mod-
from various projects. In order to compare these res-
ules grouped together based on a shared technical or
olution times on a project level, the resolution times
functional aspect” [Vis18]. In practice, this definition
per issue need to be aggregated. Intuitively, statistical
still deems too vague. It introduces the need for an
properties such as the mean come to mind. However,
external evaluator to point out the core components
as Luijten points out, the resolution times collected
of any specific system, based on their perception on
are not normally distributed [Lui10]. Therefore, Lui-
how functionality is grouped and it’s granularity.
jten created various risk categories, such that the issue
resolution distribution is divided into buckets, choos-
ing thresholds such that the buckets are filled equally.
2.2 Issue Resolution Time Table 2 illustrates the risk categories with their thresh-
olds as defined by Luijten for issues of type defect.
Both Luijten and Bijlsma look at issue resolution time Similar thresholds are defined for issues of type en-
in their studies. Bijlsma defines issue resolution time hancement.
as ”the total time an issue is in open state. [...] Resolu- Based on these categories Luijten continues to de-
tion time is not simply the time between the issue be- fine quality ratings, to further align with the rating sys-
ing reported and the issue being resolved.” [BFLV12] tem the SIG maintainability model implements. Table
Instead, Bijlsma illustrates the life cycle of an issue 3 shows the mapping between risk categories and qual-
using Figure 1. Bijlsma measured the highlighted pe- ity ratings. The thresholds are chosen such that 5% of
riod of time in the Figure for his study. Even though the systems will receive a 5-star rating, 30% four stars,
it would seem better to start measuring when the sta- 30% three, 30% two, and 5% one star (the same distri-
tus of an issue is set to assigned, this was realistically bution as the SIG maintainability model uses, Section
not a possibility for the data Bijlsma obtained. Many 2.1). For measurement purposes, these star ratings
projects Bijlsma analyzed were inconsistent in using are interpolated between the interval [0.5, 5.5], as is
the assigned property in their Issue Tracking Systems standard in the SIG maintainability model as well.
(ITS), making it impossible to accurately determine
when a developer started working on an issue.
Table 2: Luijten’s risk categories for issues of type de-
Next to the issue resolution time life cycle, there fect [Lui10]. Lower risk means faster issue resolution
is also the notion of issue types. Various issue track- times
ing systems use different terms to denote the variety
in issues. Bijlsma defined the following types: defect, Category Threshold(days)
enhancement, patch and task. A defect, according to
Bijlsma, is a ”problem in the system” [BFLV12]. An Low 0 - 23.6
enhancement can be ”the addition of a new feature, Moderate 23.6 - 68.2
or an improvement of an existing feature”. Tasks and High 68.2 - 198
patches are ”usually one time activities” and unify var- Very High 198+
ious other issue types with a range of urgencies. The
tools Bijlsma and Luijten used in their experiment nor-
malized all issues obtained from the various ITS’s to- Table 3: Luijten’s rating thresholds for issues of type
wards these four types only. Luijten originally focussed defect [Lui10]. Higher ratings means faster issue
on issues of type defect, where Bijlsma expanded with resolution time.
issues of type enhancement.
Rating Moderate High Very High
***** 7% 0% 0%
**** 25% 25% 2%
*** 43% 39% 13%
** 43% 42% 35%
3 Method
Figure 1: The issue resolution time (in green) as mea-
sured by Bijlsma and Luijten [BFLV12] This replication study aims to discover if the rela-
tionship between maintainability and issue resolution
3
times still hold using the new SIG maintainability In a small amount of cases the matching snap-
model to assess maintainability. Therefore, the same shots were not listed in the tags, releases or other
systems and issue tracking data will be used as in Bi- archiving Subversion root directories. In that
jlsma’s experiment. Bijlsma assessed 10 open source case, the Subversion trunk directory was checked
systems looking at multiple snapshots situated in var- out at the revision closest to the date listed by
ious points of time of the systems lifespan. Given the Bijlsma.
definition for issue resolution time, and the concept of
issue resolution time quality ratings, as described in For every snapshot new maintainability ratings are
Section 2.3, ratings are calculated per snapshot. For calculated. These ratings are obtained using the SIG
each snapshot, Bijlsma and Luijten consider all issues software analysis toolkit (SAT). The SAT implements
that are closed and/or resolved between that snap- the latest (2018) version of the SIG maintainability
shot and the next as relevant for that snapshot [Lui10]. model. It provides the final maintainability rating,
These ratings are directly re-used in this experiment along with its sub-characteristics as described by the
as calculated by Bijlsma, as they were archived and ISO 25010 standard.
directly ready for use.
Table 4 shows the systems assessed by Bijlsma. 3.1 Data Acquisition
Since the original snapshots Bijlsma used in his In order to replicate Bijlsma’s original study, the same
study were not archived, the snapshots had to be re- data is needed. We assume, since the snapshots were
obtained. For every snapshot Bijlsma listed a version retrieved from the official system archives or version
and a date. Using this data we were able to retrieve control systems, that the contents of the snapshots
all snapshots by using the following two methods for retrieved are the same. The only other metric provided
retrieval: by Bijlsma to verify this assumption is the size of the
snapshot in LOC. Figure 2 compares the snapshot sizes
• Official System Archives Some systems main- as found by Bijlsma against the snapshot sizes found
tain an official archive. Snapshots matching date by us. It can be immediately seen that the blue and
and version number are directly retrieved from red graphs do not align. This behaviour is expected,
these archives. For a small amount of snapshots however, as these numbers are retrieved after the SAT
the date Bijlsma listed deviate a couple of days analysis.
from the date coupled with the version number in The SAT requires a scoping definition per system
the archive. These snapshots were still retrieved in order to function. Scoping is the process of deter-
as it was assumed that deviation in dates was mining what parts and files of the system should be
caused by human error. included in the calculation. For example, source files
in a ’/libs/’ folder should not be included in the calcu-
• Version Control Systems (VCS) If a systems lation as they are external dependencies and are not
organization stopped hosting, or does not have maintained by the development team directly. Ideally,
an archive containing older versions, the snapshot the scoping per system should be exactly the same
was retrieved by traversing the system’s respec- as Bijlsmas original scoping when rerunning the SAT.
tive VCS. The majority of the systems assessed However, scoping files were not documented in Bijls-
by Bijlsma make use of Subversion as their main mas study, so we had to do our own scoping. Luckily,
VCS. Subversion, by default, contains the root Bijlsma provided feedback in person to check if the
folders ’/trunk’, ’/branches’ and ’/tags’. Where scoping files were roughly the same.
trunk is the directory where the main develop- Given the scoping differences, a slight deviation in
ment takes place and branches contain the fea- SAT considered lines of code for the calculation can be
tures that parallel main development, the tags explained. We do expect, however, that the newly ac-
directory is specifically interesting as it contains quired snapshots follow the same trend line as the old
read only copies of the source code in a specified snapshots. For the majority of the systems listed in
set of time. Figure 2 this is the case (e.g. abiword, ant, argouml,
Note that the Subversion repositories default to checkstyle), but some other systems stand out. We-
the trunk, branches and tags directories but are bkit, for example, only has a third of the original size
not limited to this structure. For example, We- (in KLOC). The cause for this large difference remains
bkit adds the ’/releases’ directory to the reposito- unknown to us, as running the SAT with all junk files
ries root, containing all major releases (where tags included doesn’t come near the originally reported size
are of a finer granularity). In this case, both the numbers. As such, the newly measured Webkit main-
tags and releases are navigated to find the right tainability numbers can not be compared against the
snapshot. old, and will impact the correlation results.
4
LOC (Latest)
System Main Language Replication Original Snapshots
Abiword C++ 427,885 415,647 3
Ant Java 101,647 122,130 20
ArgoUML Java 155,559 171,402 20
Checkstyle Java 38,685 38,654 22
Hibernate-core Java 114,953 145,482 5
JEdit Java 109,403 100,159 4
Spring-framework Java 98,141 118,833 21
Subversion C 192,620 218,611 5
Tomcat Java 158,881 163,589 19
Webkit C++ 425,929 1,255,543 1
Total 120
Table 4: Systems assessed by Bijlsma.
Figure 2: Comparing system sizes per snapshot (KLOC). Each point represents a snapshot of the given system,
ordered by increasing date. Each snapshot is represented by two points, size in KLOC as found by Bijlsma
(Red) and us (Blue).
3.2 Method Differences Equalities
To summarize, some elements of the original study’s Given the concept of issue resolution time and is-
method have been kept exactly the same while other sue resolution time quality ratings, these ratings for
elements have been changed. all snapshots and systems have been re-used directly.
The respective issue tracking systems were not mined
again. Furthermore, the correlations are calculated in
Differences the same manner.
Since the original snapshots were not archived, the
snapshots had to be reacquired. This results in small 4 Results
data inconsistencies for most systems and for large in-
4.1 Comparing Maintainability
concistencies for one systems in particular (Webkit).
Additionally, as is the purpose of this study, the new Maintainability ratings are directly compared between
SIG maintainability model is used to measure main- the old and the new model to gain a better understand-
tainability as opposed to the original SIG maintain- ing to how the SIG maintainability model evolved. It
ability model dating from 2010. also serves as a validation step to see if the new main-
5
tainability ratings are reasonable within expectation systems should be easier to modify than systems with
compared to the old ones. Figure 3 gives a high level huge, monolithic, components. Further, unit interfac-
overview on how the maintainability ratings per sys- ing has vastly decreased in significance, from 0.042 to-
tem (distributed over all snapshots) compare between wards 0.640.
the old and the new SIG maintainability model. We Table 6 shows the same comparison of correlations
observe for the majority of the systems (ant through as Table 5, but for enhancement resolution speed.
tomcat) new maintainability ratings are lower com- The difference in maintainability correlations is a lot
pared the the old model. This behaviour is expected smaller compared to the difference found in the defect
because the SAT uses benchmark data to determine correlations. The modularity correlations also stand
the thresholds of the rating buckets and has been rising out, since both the coefficients of modularity and com-
over the years (See Section 5 for further elaboration). ponent balance cannot be assumed given their p-values
Note that, even though plotted, maintainability can- being larger than 0.05. The decrease in significance is
not be compared for webkit as the reacquired snapshot specifically interesting compared to the defect correla-
has over 500 KLOC less than the original. tions in Table 5.
The maintainability rating calculated by the SIG
maintainability model is composed by a double aggre- 5 Discussion
gation. Figure 3 adds another aggregation on top of
5.1 Comparing Maintainability
this, combining multiple maintainability ratings per
system into a single boxplot. The figure provides a One explanation for the lower maintainability is the
high level overview, but in order to discover the other calibration of benchmark thresholds to determine the
factors that cause the deviation in maintainability rat- ratings. Over the years new systems that are mea-
ings we need to zoom in on the low level metrics. Fig- sured are added to the SAT benchmark. The observa-
ure 4 illustrates all unit complexity ratings of all sys- tion is that the distribution of quality ratings in the
tems ordered by snapshot date. The figure shows how benchmark, over time, shifts towards a higher aver-
in general the new ratings follow the same trend as age. In order to compensate for this phenomenon, SIG
the old ratings, but on a slightly lower rating alto- calibrates the thresholds for all ratings (both system-
gether. Specifically the systems ant, jedit, tomcat and properties and characteristics) yearly. This means
springframework show this behaviour well. The lower that the thresholds (for most of the characteristics)
rating can again be explained by the rising benchmark have become stricter. This is also documented in the
thresholds. Webkit consistently rates higher for all ’Guidance for Producers’ documents, which SIG re-
system metrics, but are insignificant due to the large leases yearly. For example, in their 2018 document
deviation in reacquired snapshot sizes. ArgoUML con- it is mentioned for unit complexity that ”To be eli-
sistently shows lower ratings for the original system gible for certification at the level of 4 stars, for each
properties (without the modularity system properties), programming language used the percentage of lines of
but shows a higher rating in overall maintainability code residing in units with McCabe complexity num-
(Figure 3). ber higher than 5 should not exceed 21.1%” [Vis18]
while their 2017 document states the same but with
4.2 Comparing Correlations a threshold of 24.3% [Vis17]. Remeasuring the same
systems again with the stricter benchmark thresholds
Bijlsma and Luijting classified four types of issues: de- results in overall lower maintainability scores.
fect, enhancement, patch and task. Bijlsma and Lui- This expected behaviour of lower maintainability
jten investigated issues of type defect and enhance- ratings is consistent for eight out of ten systems. The
ment. Tables 5 and 6 illustrate the new correlations systems Abiword and Webkit stand out as they both
found for these two types of issues. Every correlation score higher compared to the original rating.
is tested for significance, given the following set of hy- Webkit can be considered an outlier. The system is
potheses H0 {ρ = 0} and HA {ρ > 0}. For the zero composed by a single snapshot that consistently scores
hypothesis to be rejected, a confidence threshold of higher for all system properties and aggregated rat-
5% is used. ings. This may be the result of the re-acquisition of the
Given the new defect correlations, the correlations snapshot, as the newly obtained snapshot has roughly
of the original system properties are comparable, ex- 500 KLOC less than documented by Bijlsma.
cept for module coupling which shows a significant Abiword, however, does follow the expectations of
drop from 0.55 to 0.36. The other surprising result lower ratings for system metrics. The overall higher
is the large drop of maintainability from 0.64 to 0.33. maintainability score can be speculated by the new
The negative correlation of modularity is surprising, properties introduced in the new model (component
as it goes against our intuition. Intuitively, modular balance and component independence). Specifically
6
Figure 3: Snapshot maintainability distribution per system. Each system contains two boxplots, the maintain-
ability ratings as obtained by Bijlsma (red) and the maintainability ratings obtained by the 2018 version of the
SIG maintainability model (blue), distributed over all the snapshots of the system.
Figure 4: Unit complexity ratings per system. Each point represents a snapshot of the given system, ordered
by increasing date. Each snapshot is represented by two points, unit complexity of the old model (red) and the
new model (blue).
7
Table 5: Defect resolution time correlations. The Table on the left shows the correlation statistics found by
Bijlsma. The Table on the right shows the correlation statistics as obtained by the replication study.
Old Correlations New Correlations
Defect resolution vs. ρ p-value Defect resolution vs. ρ p-value
Volume 0.33 0.001 Volume 0.39 0.000
Duplication 0.34 0.001 Duplication 0.38 0.000
Unit size 0.53 0.000 Unit size 0.53 0.000
Unit complexity 0.54 0.000 Unit complexity 0.50 0.000
Unit interfacing 0.19 0.042 Unit interfacing 0.05 0.640
Module coupling 0.55 0.000 Module coupling 0.36 0.000
Analysability 0.57 0.000 Analyzability 0.33 0.002
Changeability 0.68 0.000 Modifiability 0.59 0.000
Stability 0.46 0.000
Testability 0.56 0.000 Testability 0.49 0.000
Maintainability 0.64 0.000 Maintainability 0.33 0.001
Modularity -0.30 0.004
Reusability 0.46 0.000
Component balance -0.34 0.001
Component independence 0.16 0.201
Table 6: Enhancement resolution time correlations. The Table on the left shows the correlation statistics found
by Bijlsma. The Table on the right shows the correlation statistics as obtained by the replication study.
Old Correlations New Correlations
Enhancement resolution vs. ρ p-value Enhancement resolution vs. ρ p-value
Volume 0.61 0.000 Volume 0.58 0.000
Duplication 0.02 0.448 Duplication 0.09 0.499
Unit size 0.44 0.000 Unit size 0.45 0.000
Unit complexity 0.48 0.000 Unit complexity 0.47 0.000
Unit interfacing 0.10 0.213 Unit interfacing -0.20 0.132
Module coupling 0.69 0.000 Module coupling 0.67 0.000
Analysability 0.44 0.000 Analyzability 0.22 0.096
Changeability 0.46 0.000 Modifiability 0.52 0.000
Stability 0.50 0.000
Testability 0.47 0.000 Testability 0.68 0.000
Maintainability 0.53 0.000 Maintainability 0.47 0.000
Modularity -0.09 0.513
Reusability 0.37 0.004
Component balance -0.29 0.023
Component independence 0.34 0.039
because the component independence scores for the modularity with its system properties component bal-
Abiword snapshots read 5.23, 5.23 and 2.50 ordered ance and component independence are the biggest
by date respectively. causing factor for the maintainability correlation to
drop from 0.64 to 0.33. The negative correlation for
5.2 Comparing Correlations modularity and component balance is surprising as it
goes against our intuition. Overall one would assume a
Since the original system properties are similar, it
modular program would help defect and enhancement
seems like the added maintainability sub-characteristic
8
issue resolution time instead of the opposite. However, regarding two new metrics in the 2018 model: (1) com-
perhaps the results make an argument for the way ponent balance does not correlate as expected, and (2)
modularity is assessed currently. The performance of component independence correlates only in cases en-
component balance, for example, has been debated be- hancements are considered.
fore [BvDV13] (specifically, the discussion around the Our next steps are to investigate the cause of the
optimal number of components and the performance observed differences and further validate the underly-
on smaller systems). ing data. Additionally we would like to extend the
data set to modern software systems.
5.3 Threats to Validity
7 Future Work
One of the main threats to validity is the variety in
SAT scoping. In order to get accurate replication re- The system property component balance and its as-
sults, ideally, the scoping per system should be exactly sociated quality characteristic modularity can be con-
the same as Bijlsma’s original scoping when rerunning sidered a reason why the overall defect maintainability
the SAT. As a consequence, results obtained may de- correlation is much lower than in the original study.
viate slightly. However, given that the SIG maintain- Future work can expand in this direction, research-
ability model uses two level aggregation to compute ing the effect of modularity on issue resolution time.
the final maintainability score, small deviations in re- Specifically, does the modularity coefficient look any
sults should not affect the final maintainability score different when the enhancement results are significant?
by a large margin. Next to expanding in the direction of modularity,
An additional difference in scoping is the com- more questions need to be answered in order to fully
ponent depth property, which was introduced when show the relation between maintainability and issue
evolving according the new ISO 25010 standard (as resolution time. Does the previusly found relation still
described in section 2.1). This property needs to be hold when tested against modern systems? Further-
set to show were the highest level components in the more, Bijlsma analyzed mainly Java systems. How
directory of a system reside. This is needed in order to does this extend towards other languages? In this
calculate the modularity system properties. The ambi- paper we tested against maintainability as assessed
guity of the component definition requires an external by the SIG maintainability model. However, in or-
validator to check for correctness. In our case, given der to make the concept of maintainability more gen-
the age of the systems, no external validator was ap- eralizable, do the correlations still hold when tested
proached to check if we defined the right highest level against other maintainability implementations (e.g.
components. The component depth property was set the maintainability index as proposed by Oman et al.
in accordance with our own interpretation of the sys- [CALO94])?
tem.
References
6 Conclusion [BFLV12] Dennis Bijlsma, Miguel Alexandre Fer-
reira, Bart Luijten, and Joost Visser.
In order to answer the research question: What is
Faster issue resolution with higher techni-
the relation between software maintainability and issue
cal quality of software. Software quality
resolution time?, in this paper we provide answers to
journal, 20(2):265–285, 2012.
the sub-question: ”Does the previously found strong
correlation between maintainability and issue resolu- [BHL+ 12] Tibor Bakota, Peter Hegedus, Gergely
tion time still hold given the latest (2018) SIG main- Ladányi, Peter Kortvelyesi, Rudolf Ferenc,
tainability model?”. The experiment to find correla- and Tibor Gyimóthy. A cost model based
tions between maintainability (as assessed by the SIG on software maintainability. In Software
maintainability model) and issue resolution time, as Maintenance (ICSM), 2012 28th IEEE In-
originally defined and executed by Bijlsma and Luijten ternational Conference on, pages 316–325.
in 2012 [BFLV12] has been replicated. The experiment IEEE, 2012.
was run on the same, reacquired (with small devia-
tions), snapshots of systems as in the original study [BvDV13] Eric Bouwers, Arie van Deursen, and Joost
with the new (2018) version of the SIG maintainabil- Visser. Evaluating usefulness of software
ity model. metrics: an industrial experience report.
Many similar correlations are observed between the In 2013 35th International Conference on
2010 and 2018 maintainability ratings versus the res- Software Engineering (ICSE), pages 921–
olution time of defects and enhancements. However, 930. IEEE, 2013.
9
[CALO94] Don Coleman, Dan Ash, Bruce Lowther,
and Paul Oman. Using metrics to evalu-
ate software system maintainability. Com-
puter, 27(8):44–49, 1994.
[HKV07] Ilja Heitlager, Tobias Kuipers, and Joost
Visser. A practical model for measur-
ing maintainability. In null, pages 30–39.
IEEE, 2007.
[ISO11a] ISO/IEC 25010:2011, Systems and soft-
ware engineering – Systems and soft-
ware Quality Requirements and Evaluation
(SQuaRE) – System and software quality
models. Standard, International Organi-
zation for Standardization, Geneva, CH,
March 2011.
[ISO11b] ISO/IEC 25010:2011, Software engineering
– Product quality – Part 1: Quality model.
Standard, International Organization for
Standardization, Geneva, CH, March 2011.
[Lui10] Bart Luijten. Faster defect resolution with
higher technical quality of software. 2010.
[sig] Quality model 2018 announcement.
www.softwareimprovementgroup.com/news-
knowledge/sig-quality-model-2018-now-
available/. Accessed: 2018-12-20.
[Vis17] Joost Visser. Sig/tüvit evaluation crite-
ria trusted product maintainability: Guid-
ance for producers. Software Improvement
Group, Tech. Rep., page 7, 2017.
[Vis18] Joost Visser. Sig/tüvit evaluation crite-
ria trusted product maintainability: Guid-
ance for producers. Software Improvement
Group, Tech. Rep., page 7, 2018.
10