<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Machine Learning and SonarQube KPIs to Predict Increasing Bug Resolution Times</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tom Gustafsson</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Lappeenranta-Lahti University of Technology LUT</institution>
          ,
          <country country="FI">Finland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Technical dept metaphor is widely discussed topic in research, but there is no common model on how to manage technical dept [3]. Companies invest a lot of money in maintenance and in commercial software system maintenance it is typical to have some penalties on missing the SLA deadline on bug resolution times. Adding technical dept can be understandable, in order to be quicker in markets with new features, but in order to manage it effectively, it is mandatory to understand the risks and impacts of the interest of it. In this research machine learning technology was used to evaluate whether SonarQube technical dept KPIs can be used to predict bug resolution times. The fact that the data was collected only from open source projects was limitation, but the results were encouraging. Accuracy approximately of 90% was reached. As it was seen that number of lines of code is also a valid indicator of bug resolution times, it was concluded, that it would be best to repeat this study in environment of commercial company which maintains many projects of similar size.</p>
      </abstract>
      <kwd-group>
        <kwd>Machine Learning</kwd>
        <kwd>SonarQube</kwd>
        <kwd>Technical dept</kwd>
        <kwd>SLA</kwd>
        <kwd>Software product maintenance</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Technical dept metaphor is found attractive to practitioners as it communicates to
both technical and nontechnical audiences that if quality problems are not addressed,
things may get worse. In their research, Ernst et al. concluded that the even though
technical dept is widely known and accepted, making it visible and measurable is a
big gap in practice. Tools are installed, but the complexity of configuring them or
interpreting results meant that they were unused. Only very small minority of business
managers were actively managing technical debt. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
      </p>
      <p>This paper focuses to investigate whether technical depth KPIs could be used
together with machine learning in order to manage technical dept in software
maintenance project in controlled manner.</p>
      <p>
        Dataset [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] which was collected from various open source projects was used together
with machine learning technologies in order to see, if it is possible to estimate based
on technical dept KPIs, whether the bug fixing SLA thresholds are kept or not.
      </p>
      <p>
        Limitations of this paper include the fact that strict SLA policies are more common
in commercial software products under maintenance than in open source projects.
Also, the open source software products which were used in dataset [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] are collected
from various stages of the software products lifecycle.
      </p>
      <p>However, results indicated, that machine learning methodologies could estimate
quite accurately, when bug fixing times start to have big probabilities of growing too
big and outside of SLA limits. When using ROC-AUC as measure of how good the
estimation is, excellent correlation was found. When simple number of lines variable
was removed from variables, still good and close to excellent AUC numbers were
found.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>
        Software maintenance has dramatically evolved in the last four decades in order to
cope with the continuously changing development models. Maintenance is also an
increasingly popular research topic, with an increasing number of new models and
approaches being proposed. Interestingly, the number of models proposed is
increasing rather than consolidating. The fact highlights that more research effort is needed
to identify reusable and tunable models that can be applied in different contexts.[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
      </p>
      <p>
        In their work Lenarduzzi et al., presented results of a Systematic Literature
Review, highlighting the evolution of the metrics and models adopted in the last forty
years. Key findings included, that there is increase in the size of the data analyzed.
One reason is due to the availability of an enormous open source code base that can
be easily used to build maintenance models. Despite the open source ideology, it is
nearly impossible to replicate the studies, because almost all papers are based on
private data sets and/or used custom tools that are not available to other researchers to
support the replication.[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
2.1
      </p>
      <sec id="sec-2-1">
        <title>Technical Debt Dataset</title>
        <p>
          Researchers and industry are adopting various tools for static code analysis to
measure technical dept and evaluate the quality of their code [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. SonarQube is one of the
most commonly used tools to support software maintenance [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. When Lenarduzzi et
al. compared it to other commonly used tools, it was found as an exception compared
to other tools due to its increasing trend in popularity [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>
          Empirical studies on software projects are expensive because it takes a lot of time
to analyze the projects. Also, the results are difficult to compare as studies commonly
consider different projects. In their work, Lenarduzzi et al [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] proposed the
“Technical Debt Dataset”, a set of measurement data from 33 Java projects from the Apache
Software Foundation. They analyzed all commits from separately defined time frames
with SonarQube to collect Technical Debt information and with Ptidej to detect code
smells. The Dataset includes also all available commit information from the git logs
and fault information reported in the issue trackers (Jira). That information was used
together with SZZ algorithm to identify the fault-inducing and -fixing commits. In the
resulting dataset, one can find information about more than 78K commits from the
selected 33 projects, approximately 1.9M SonarQube issues, 38K code smells, and
28K faults. The analysis took more than 200 days. In their paper, researchers also
describe the data retrieval pipeline together with the tools used for the analysis. The
dataset is available in CSV format as well as in SQLite database format to facilitate
queries on the data. Aim of The Technical Debt Dataset is to open diverse
opportunities for Technical Debt research, enabling researchers to compare results on common
projects. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Reasoning</title>
      <p>
        At the level of IT management in industry, organizations are interested in the
economic consequences of technical debt and risks that they may pose. Technical debt
can decrease efficiency of running software systems and create difficulties in
extending them. Cost overhead in fixing issues or adding new functionality caused by
technical debt is considered as the interest of technical debt. In economic perspective,
technical debt is defined as the cost of repairing quality issues in software systems to
achieve an ideal quality level. An ideal quality level is the highest achievable level of
quality defined in a quality model adopted by an organization. The amount of debt is
the gap between the current and the ideal level. Interest is defined as the extra
maintenance cost spent for not achieving the ideal quality level. Maintenance activities
include adding new functionality and fixing bugs. Maintenance and technical quality
repair action is different in that former involves visible changes and their impacts are
immediately visible. Extra effort spent on new functionality or fixing bugs are
examples of interest on technical debt. Interest of technical debt is not the same as
maintenance costs, because systems without technical issues will still spend some effort on
maintenance.[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
      </p>
      <p>
        In software development and maintenance cost point of view, the earlier bugs are
found, the cheaper it is. On the other hand, companies want to be fast in market. But
faster time-to-market and quick user feedback also implies less time for testing and
bug fixing in early stages [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Code with bad quality is more expensive to maintain, but also refactoring is risky.
It requires changes to working code that can introduce subtle bugs. Refactoring, if
done wrongly, can set you back days or weeks. Refactoring becomes riskier when
practiced informally or ad hoc.[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
      </p>
      <p>
        If there is a way to apply machine learning techniques to identify problems earlier,
there is a good motivation to include it already in CI/CD pipeline, using the
SonarQube metrics and technical Dept information to forecast problems. This could help
organizations to balance between time to refactor an be fast in time-to-market. Using
the open dataset described collected from various open source projects [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], this paper
is focusing on SonarQube technical dept metrics and their relationship to actual bug
resolution times reported to issue tracking tool JIRA.
      </p>
      <p>If the technical dept KPIs can predict future bugs, those can be used as good
indicator for refactoring purposes. RQ: Can SonarQube technical depth KPIs be used
to estimate increasing bug resolution times?</p>
      <sec id="sec-3-1">
        <title>Study Design</title>
        <p>
          In this study, the dataset [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] collected from various open source projects was used.
Machine learning methods were used to figure out, if the technical dept KPIs can be
used to estimate longer bug resolution times. Bug resolution time, in commercial
maintenance processes, usually have SLAs, which set limits on how quickly bugs
should be fixed. For this reason, in this study bug resolution time was used as
independent variable. Several Technical dept KPIs were used as dependent variables and
python scripts were run against the data in order to get the results with prediction
KPIs to determine whether the prediction was accurate.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Data collection</title>
        <p>
          Technical depth, by definition, impacts to the time which is used to add features or fix
bugs. In the used dataset [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], faults table includes the JIRA issues with “bug” as a
type. The timestamps, which are available are the creation time in JIRA as well as the
resolution time in JIRA. With those timestamps, elapsed time from the moment bug
was reported in JIRA, to the moment when the bug was closed in JIRA was
calculated. Each bug was given a resolution time with this method. 30 days resolution time
was used to represent SLA violation, too long resolution time. This research only
investigates bugs reported in JIRA, as those would best represent the problems in
commercial software in sense that the more severe the bug is, the quicker it will get
fixed. In their research on technical dept diffuseness, Saarimä ki et al [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] interestingly
pointed out that in technical dept items, the more severe the finding is, the longer in
remained unresolved.
        </p>
        <p>
          The dataset [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] includes table named “sonar_measures”, where one can find the
technical depth KPIs. For technical dept KPIs, monthly figures were used, because
there is no single number to represent the dept for the period of bug fixing. Monthly
figures included: the analysis month, max sqaleindex, max sqaledebtratio, max
scalerating, max lines of code, max securityremediationeffort, max
reliabilityremediationeffort and the security and reliability remediation effort per 1000 lines of code.
        </p>
        <p>Since number of lines of code is easy KPI to predict longer resolution times, it was
decided to execute the python scripts with lines of code included as one KPI and then
collecting another set of data with same methodology and taking away the number of
lines of code, in order to verify if the results differ when the most evident single KPI
which predicts bug resolution times was removed.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Limitations</title>
      <p>
        This study is using big dataset [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] as input for collecting variables for machine
learning algorithms use. However, the projects used for the dataset are all open source
projects with different ways of working different policies on how quick to fix bugs.
Also, the selected open source projects are from projects which vary a lot in terms of
size of the project and the products phase in its lifecycle.
      </p>
      <p>The research also relies on monthly averages of technical dept measures while it
would be best to analyze the dept KPIs from the exact moment when bug is found. As
the data comes from various phases of the products lifecycle, one can find very short
bug resolution times, which are due to the fact which can be seen on commit dates
compared to bug opening and closing times, that some bugs are probably found
during development, fixed at the same time and the bug is reported afterwards.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>First executions of the python scripts against the data was executed three times. The
dataset was balanced so, that there was just as many very long bug resolution times
included as there were bugs with resolution times less than 30 days. Random forest
classifier algorithm was used. Three executions gave following results:</p>
      <p>
        Area under curve (AUC) shows that the random forest classifier gave predictions,
which were between 90 and 92% accurate. This gives an idea, that technical dept
KPIs do have strong impact to the bug resolution times. However, one of the used
KPIs from dataset [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] was the number of lines of code. It can be obvious, that the
bigger the project grows, the more it takes to implement changes. Therefore, the
second set of executions were using the exact same parameters, except the number of
lines of code was removed from the list of dependent variables.
1st run:
When comparing the results against the values with the number of lines of code
icluded, the AUC number is still giving quite good accuracy from 86 to 88 percentages.
MCC number shows also, that the classifier can be used in estimations. However,
lines of code do boost the accuracy of the estimation, which was also logical guess at
the beginning of the study.
      </p>
      <p>One can make conclusion from the results, that with the help of machine learning
classifiers one can build a model to predict SLA violations. Research question was:</p>
      <sec id="sec-5-1">
        <title>Can SonarQube technical dept KPIs be used to estimate increasing bug resolu</title>
        <p>tion times? Research shows that technical dept KPIs can be used for this purpose,
however, technical dept KPIs by themselves did not work as good as used together
with simple number of lines of code KPI. Results are encouraging, taking into
concern that all bugs were treated equal and that sizes of the projects did not impact the
results too much.</p>
        <p>It is obvious that the size of the project impacts on the bug resolution times. Also,
it is normal that severity of the bug impacts on SLAs on commercial products.
Therefore, in following research the projects themselves should be classified by size and
bugs separately based on severity. Then machine learning models could be trained
against the set which includes data from similar sized projects. This way the model it
could be possible to build logic in CI/CD pipeline to predict SLA violations. Further
research is needed to replicate similar study in environment, which includes several
similarly sized commercial software products under maintenance. As a result, it could
be possible to build model which gives good reasons for refactoring.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>
        Technical dept, by definition, impacts to the time which is used to add features or fix
bugs [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Sometimes it is good to have technical dept, because it is important to focus
resources on new features and to be quick in market [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The interest of technical dept
can be seen in bug resolution times and in commercial software systems, it is typical
to have SLAs which needs to be kept in order to avoid penalties. There is no industry
standard way of fixing technical dept issues and fixing those ad-hoc is also risky [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
One way of managing the costs of technical dept could be utilizing machine learning
models in CI/CD pipeline and use the predictions which algorithms give as triggers
for paying technical dept.
      </p>
      <p>
        Dataset, which consist of 33 projects, approximately 1.9M SonarQube issues, 38K
code smells, and 28K faults [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] was used to form a dataset for random forest
classifier, which was used to study, whether it is possible for model to learn predict SLA
violations. Various technical dept KPIs were used together with number of lines of
code and bug resolution times to teach the model.
      </p>
      <p>Results showed, that the model can predict bug resolution times to pass or meet 30
days threshold quite accurately. With all the selected KPIs included, the model
predicted correct results with 90 – 92 percentage accuracy. Lines of codes was obvious
easy single KPI to predict the resolution times, since it is assumed that the bigger the
project is, the more time it takes to make changes. When the impact of number of
lines of codes was taken away, accuracy remained high, 86 – 88 percentages.
Research concludes, that machine learning models could be used to predict software
products bug resolution times.</p>
      <p>
        Limitations were seen also. Even the dataset [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is big, it consists only of open
source projects and open source projects may not have as strict SLA handling as
commercial products may do. Also, the projects may have different ways of working
compared to other projects and therefore the best possible environment for this kind
of study would be a set of data from one company, consisting from various software
products with similar size.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Fowler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <article-title>Refactoring: Improving the Design of Existing Code, Addison-</article-title>
          <string-name>
            <surname>Wesley</surname>
          </string-name>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>M.V.</given-names>
            <surname>Mäntylä</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Adams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Khomh</surname>
          </string-name>
          . et al.
          <source>Empir Software Eng</source>
          (
          <year>2015</year>
          )
          <volume>20</volume>
          :
          <fpage>1384</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>N.</given-names>
            <surname>Ernst</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S</given-names>
            <surname>Bellomo</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Ozkaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nord</surname>
          </string-name>
          and
          <string-name>
            <surname>I. Gorton.</surname>
          </string-name>
          “
          <article-title>Measure it? Manage it? Ignore it? software practitioners and technical debt</article-title>
          .”
          <source>In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE</source>
          <year>2015</year>
          ). ACM, New York, NY, USA,
          <year>2015</year>
          , pp.
          <fpage>50</fpage>
          -
          <lpage>60</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. N. Saarimä ki,
          <string-name>
            <given-names>V.</given-names>
            <surname>Lenarduzzi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Taibi</surname>
          </string-name>
          , “
          <article-title>On the diffuseness of code technical debt in open source projects of the apache ecosystem</article-title>
          ,
          <source>” International Conference on Technical Debt (TechDebt</source>
          <year>2019</year>
          ),
          <year>2019</year>
          , pp
          <fpage>98</fpage>
          -
          <lpage>107</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>V.</given-names>
            <surname>Lenarduzzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sillitti</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Taibi</surname>
          </string-name>
          ,
          <article-title>"</article-title>
          <source>Analyzing Forty Years of Software Maintenance Models," 2017 IEEE/ACM 39th International Conference on Software Engineering Companion</source>
          (
          <string-name>
            <surname>ICSE-C)</surname>
          </string-name>
          ,
          <string-name>
            <surname>Buenos</surname>
            <given-names>Aires</given-names>
          </string-name>
          ,
          <year>2017</year>
          , pp.
          <fpage>146</fpage>
          -
          <lpage>148</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>V.</given-names>
            <surname>Lenarduzzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sillitti</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Taibi</surname>
          </string-name>
          ,
          <article-title>"A survey on code analysis tools for software maintenance prediction," in Software Engineering for Defence Applications -</article-title>
          SEDA,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>V.</given-names>
            <surname>Lenarduzzi</surname>
          </string-name>
          , N. Saarimä ki, and
          <string-name>
            <given-names>D.</given-names>
            <surname>Taibi</surname>
          </string-name>
          .
          <source>The Technical Debt Dataset. Proceedings of the 15th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE'19)</source>
          ,
          <year>September 18</year>
          ,
          <year>2019</year>
          , Recife, Brazil. https: //doi.org/10.1145/3345629.3345630 (
          <issue>3</issue>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>