Refining Software Quality Prediction with LOD

                 Davide Ceolin1 , Till Döhmen1 , and Joost Visser2
                              1
                              VU University Amsterdam
                                 de Boelelaan 1081
                        1081HV Amsterdam, The Netherlands
                                  d.ceolin@vu.nl
                           2
                             Software Improvement Group
                      Rembrandt Toren, 15th floor, Amstelplein 1
                        1096 HA Amsterdam, The Netherlands


        Abstract. The complexity of software systems is growing and the com-
        putation of several software quality metrics is challenging. Therefore,
        being able to use the already estimated quality metrics to predict their
        evolution is a crucial task. In this paper, we outline our idea to use
        Linked Open Data to enrich the information available for such predic-
        tion. We report our experience so far, and we outline the preliminary
        results obtained.


1     Introduction
Software size and complexity is growing, thus being able to estimate and predict
software quality is crucial to monitor the process of software development and
promptly steer it. In fact, a quality metric provides a value summarizing one
relevant aspect of the software that can be consulted to identify issues or risks
in the development process or in the software itself. Therefore, several different
quality dimensions have been defined, as described, for instance, by Kan [8].
    Estimating software quality is then a crucial but challenging task, for several
reasons including the complexity of the software to be measured and the fact
that these measures are often hard to quantify: some of them depend on run-
time software behavior, some on static software properties. The estimation of
the values of these measures is possible, as demonstrated, for instance, by Alves
and Visser [2] and Bouwers [4]. However, given the complexity of this task, we
propose to use such estimates to predict the temporal evolution of these values.
    Preliminary analyses on a dataset from the Software Improvement Group3
show encouraging results on the use of these estimates as starting point for the
prediction of the evolution over time of software quality ratings.4 We hypoth-
esize that, by using Linked Open Data (LOD) we can improve and refine the
accuracy of our predictions. In particular, by enriching the information available
about the projects analyzed, we can categorize these projects (e.g., by indus-
try sector or programming language), thus increasing the possibility to group
3
    http://www.sig.eu
4
    For confidentiality reasons, we could not make the dataset publicly available.
together projects showing similar quality evolution over time. We present here
some preliminary encouraging results obtained in this direction, and we discuss
a series of open issues that we need to address in order to extend this research.
    The rest of this paper is structured as follows: Section 2 introduces related
work. Section 3 describes the enrichment of software projects data. Section 4
provides preliminary results, that are discussed in Section 5.


2    Related Work

Software quality prediction is an important issue, that has been tackled from
different points of view. As Al-Jamini and Ahmed [1] describe in their review,
several relevant approaches to this problem make use of machine learning.
    We have also employed machine learning techniques (in particular, Markov
chains [12]) to predict software quality based on the starting rating of a project [5].
The results are promising and we will aim at perfecting them with additional
features, properly selected from external sources, like LOD. The future quality
value of systems shows a strong correlation with the current quality rating, due
to the fact that the rating usually changes very slowly over time. Moreover, a
second trend was discovered which revealed that higher quality systems tend to
deteriorate in quality and low-quality systems tend to improve, both with the
tendency towards the medium quality level. This could be explained as a case of
regression towards the mean [6], i.e., could be due to noise in the extreme quality
ratings that disappears as more accurate estimates are provided. However, this
possible explanation still needs to be evaluated and, anyway, could explain only
the second trend. These two trends, for very high or very low-quality systems,
yield a high uncertainty in the prediction. Using LOD, we expect to obtain more
tailored predictions (e.g., by identifying software quality trends associated to the
programming language adopted) to reduce prediction uncertainty.
    Misirli et al.[11] propose the use of Bayesian Networks to make software qual-
ity predictions. As the number of potentially useful features grows (consequently
to LOD enrichment), we will consider this approach in the future. Jing et al. [7]
use a dictionary learning-approach that represents a more specialized but limited
approach as compared to our use of LOD.


3    Enriching Software Quality Prediction with LOD

Our hypothesis is that by enriching the information about the projects we ana-
lyze with LOD, we can obtain features that are useful for improving the software
quality prediction. For instance, software quality could vary in different indus-
trial sectors or the programming language used could affect quality evolution.
    Our focus is on a dataset provided by the Software Improvement Group,
which consists mainly of projects of Dutch companies and of a few additional
European customers. We enriched the dataset using mainly DBpedia [3]. In the
enrichment process, we encountered the following issues:
Missing information DBpedia contains a description of only 209 companies
   located in the Netherlands. Additional companies have been identified in
   the Dutch DBpedia5 , which contains the description of 3.883 companies,
   but does not provide information about their location.
Disambiguation Some companies have homonyms. To disambiguate resources
   and identify the right URI for a given company, we expect to employ heuris-
   tics based on the company website, its location, and industry sector.
Consistency literals vs. URIs Some classifications are available in an incon-
   sistent manner. For instance, industry can appear both as http://dbpedia.
   org/ontology/industry and http://dbpedia.org/property/industry. In
   some cases, the value of one of these two properties is reported only as a lit-
   eral value, thus affecting the possibility to perform ontological reasoning.


4     Preliminary Results

We performed a preliminary analysis on a dataset consisting of 1019 snapshots
of maintainability of 112 companies. These snapshots already presented a first
industry classification provided by SIG. In total, 14 industrial sectors are present.
    We computed the semantic similarity between each possible combination of
industrial categories using the Wikipedia distance [10] and the WU & Palmer
distance [17]. On these data, we performed a series of preliminary analyses:

 1. We run a Wilcoxon signed-rank test [16] at 95% confidence level to check
    if the observations are significantly different when grouped per industrial
    sector. These results show a weak positive Spearman [15] correlation with
    both the Wikipedia (0.07) and the Wu & Palmer (0.14) distances.
 2. We computed the same procedure as above by using also the Kolmogorov-
    Smirnov test [9, 14] . This resulted in a slightly higher correlation, 0.16 for
    the Wikipedia distance and 0.24 for the Wu & Palmer distance.
 3. We computed the contrast analysis [13] of the linear combinations of the
    observations, again grouped per industrial sector. The resulting contrast es-
    timators showed a weak correlation with the Wikipedia distance (0.15) and
    with the Wu & Palmer distance values (0.12).
 4. We grouped a small set of observations aligned with DBpedia by industrial
    sector of the companies involved (telecommunication and financial services).
    According to a Wilcoxon signed-rank test at 90% significance, the two groups
    are significantly different, according to the Kolmogorov-Smirnov test, not.


5     Discussion and Future Work

We present an early stage work about the use of LOD to refine the precision and
accuracy of software quality prediction. We performed a series of exploratory and
preliminary studies which shows a low correlation between the maintainability
5
    http://nl.dbpedia.org
and the industry sector of these projects. These results provide the basis for fur-
ther exploration because: (1) the existence of a weak correlation is confirmed by
more tests, hence it is possible that we can identify a subset of the data analyzed
that presents a higher correlation; (2) the different methods for computing se-
mantic similarity and different statistical significance tests provided significantly
different results, thus indicating the need for exploring different computational
techniques; (3) as shown by the last item of Section 4, the industrial sector
seems to be a discriminant for software quality, although this aspect needs to
be evaluated on larger datasets; and (5) our analyses focused on a limited set of
enrichment features, but several others are utilizable. So, we plan to extend this
research to identify the most robust methods to perform these predictions, and
we will extend these analyses including additional LOD features and sources.

Acknowledgements This work is funded by Amsterdam Data Science.

References
 1. H. Al-Jamimi and M. Ahmed. Machine learning-based software quality prediction
    models: State of the art. In ICISA, pages 1–4, 2013.
 2. T. L. Alves and J. Visser. Static estimation of test coverage. In SCAM, pages
    55–64. IEEE Computer Society, 2009.
 3. S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. DBpedia: A
    Nucleus for a Web of Open Data. In ISWC, volume 4825, pages 722–735. Springer,
    2007.
 4. E. Bouwers, J. P. Correia, A. van Deursen, and J. Visser. Quantifying the analyz-
    ability of software architectures. In WICSA, pages 83–92. IEEE, 2011.
 5. T. Döhmen, D. Ceolin, and J. Visser. Towards Building a Software Quality Pre-
    diction Model. Technical report, Software Improvement Group, 2015.
 6. F. Galton. Regression towards mediocrity in hereditary stature. The Journal of
    the Anthropological Institute of Great Britain and Ireland, 15:246–263, 1886.
 7. X.-Y. Jing, S. Ying, Z.-W. Zhang, S.-S. Wu, and J. Liu. Dictionary learning based
    software defect prediction. In ICSE, pages 414–423, 2014.
 8. S. Kan. Metrics and Models in Software Quality Engineering. Pearson, 2002.
 9. A. Kolmogorov. Sulla determinazione empirica di una legge di distribuzione. Gior-
    nale dell’Istituto Italiano degli Attuari, 4:1–11, 1933.
10. D. Milne and I. H. Witten. An open-source toolkit for mining wikipedia. Artif.
    Intell., 194:222–239, 2013.
11. A. T. Misirli and A. B. Bener. A mapping study on bayesian networks for software
    quality prediction. In RAISE, pages 7–11, 2014.
12. J. R. Norris. Markov chains. Cambridge University Press, 1998.
13. R. Rosenthal and R. L. Rosnow. Contrast analysis : focused comparisons in the
    analysis of variance. Cambridge University press, 1985.
14. N. Smirnov. Table for Estimating the Goodness of Fit of Empirical Distributions.
    The Annals of Mathematical Statistics, 19(2):279–281, 1948.
15. C. Spearman. The proof and measurement of association between two things.
    Amer. J. Psychol., 15:72101, 1904.
16. F. Wilcoxon. Individual comparisons by ranking methods. Biometrics Bulletin,
    1:80–83, 1945.
17. Wu, Z. and Palmer, M. Verb semantics and lexical selection. In ACL. ACL, 1994.