-

Predicting Software Defectiveness through Network Analysis

0 Faculty of Informatics University of Lugano Via Giuseppe Bu , 13 Lugano, Switzerland 6900

2013

531 540

We used a complex network approach to study the evolution of a large software system, Eclipse, with the aim of statistically characterizing software defectiveness along the time. We studied the software networks associated to several releases of the system, focusing our attention specifically on their community structure, modularity and clustering coecient. We found that the maximum average defect density is related, directly or indirectly, to two di↵erent metrics: the number of detected communities inside a software network and the clustering coecient. These two relationships both follow a power-law distribution which leads to a linear correlation between clustering coecient and number of communities. These results can be useful to make predictions about the evolution of software systems, especially with respect to their defectiveness.

Copyright c 2015 by the paper’s authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors.

In: A.H. Bagge, T. Mens (eds.): Postproceedings of SATToSE 2015 Seminar on Advanced Techniques and Tools for Software Evolution, University of Mons, Belgium, 6-8 July 2015, published at http://ceur-ws.org 1

Introduction

Modern software systems are large and complex products, built according to a modular structure, where modules (like classes in object oriented systems) are connected with each other to enable software reuse, encapsulation, information hiding, maintainability and so on. Software modularization is acknowledged as a good programming practice [Par72, BC99, SM96] and a certain emphasis is put on the prescription that design software with low coupling and high cohesion would increase its quality [CK94]. In this work we present a study on the relationships between the quality of software systems and their modular structure. To perform this study we used an approach based on the concept of complex networks.

Due to the fact that software systems are inherently complex, the best model to represent them is by retrieving their associated networks [Mye03, SˇB11, WKD07, SˇZˇBB15, ZN08]. In other words, in a software network, nodes can be associated to software modules (e.g. classes) and edges can be associated to connections between software modules (e.g. inheritance, collaboration relationships). We investigated the software modular structure - and its impact on software defectiveness - by studying specific network properties: community structure, modularity and clustering coecient.

A community inside a network is a subnetwork formed by nodes that are densely connected if compared to nodes outside the community [GN01]. Modularity is a function that measures how marked is a community structure, namely the way the nodes are arranged in communities [NG04]. The clustering coefficient is a measure of connectedness among the nodes of a network [New03].

We studied several releases of a large software system, Eclipse, performing a longitudinal analysis of the relationship between community structure, clustering coecient and software defectiveness. Our aim is to figure out if the studied metrics can be used to better understand the evolution of software defectiveness along the time and to predict the defectiveness of future releases. The results shown in this paper are part of a more extensive research on di↵erent Java projects, which is currently under consolidation. The aim of the authors is to show eventually that the results presented in this work are valid also for other extensively used Java projects.

This paper is organized as follows. In Section 2 we review some recent literature on software network analysis, community structure and defect prediction. In Section 3 we introduce some background concepts taken from the research on complex networks, whereas in Section 4 we thoroughly report the adopted metrics and the methodology. In Section 5 we present some of our results and discuss them in Section 6. In Section 7 we illustrate the threats to validity and in Section 8 we draw some conclusions and outline the future work. 2

Related Work

Being software systems often large and complex, one of the best candidates to represent them is the complex network [FMS01], [Mye03], [TCM+11]. Many software networks, like class diagrams [RS02, VS03], collaboration graph [Mye03], package dependencies networks [CL04], have already been shown to present the typical properties of complex networks [BAJ00], such as fractal and self-similar features [TMT12], scale free [CS05], and small world properties, and consequently power law distributions for the node degree, for the bugs [CMM+11] and for refactored classes [MTM+12].

Modelling a software system as a complex network has been shown to have many applications to the study of failures and defects. For example, Wen [WKD07] reported that scale-free software networks could be more robust if compared to random failures. Other methods have been applied to understand the relationship between number of defects and LOC [Zha09], while in Ostrand et al. a negative binomial regression model is used to show that the majority of bugs is contained only in a fraction of the files (20%) [OWB05].

So far many other methods have been tried for bug prediction [DLR10, HBB+12], especially using dependency graphs [NAH10, ZN08], but only recently many researchers focused their attention on the community structure as defined in social network analysis, namely the division in subgroups of nodes among which there is a high density of connections if compared to nodes that are outside the community [NG04]. Being more connected, elements belonging to the same community might represent functional units or software modules, leading to practical applications of the community detection in the software engineering field. Community detection is usually performed with methods like hierarchical clustering and partitional clustering [For10].

Newman et al. proposed some algorithms for community detection [NG04, New04, CNM04], which are now extensively used in the literature, along with the definition modularity, a quality function which measures the strength of a network partition in communities [NG04]. In this work we use one of the algorithms proposed by Newman et al. to understand if such division can be related to software modularity as defined in software engineering and, eventually, if the community metrics may be useful to predict bugs in future releases. The issue of community structure and its application to software engineering has been recently addressed in a similar fashion by Sˇubelj and Bajek. The authors applied some community detection algorithms to several Java systems to show that their evident community structure does not correspond to the package structure devised by the designer [SˇB11].

In this work we consider the concept of modularity used in software engineering, that is often associated to high values of cohesion and low values of coupling metrics [MMCG99, MM07, ASTC11]. In the literature there are previous attempts to use software network theory to characterize a modularity function and relate it to good programming practice [MMCG99, MM07, ASTC11]. We show that a modularity function based on pure network topology can be used to assess the goodness of a division in clusters, and it is related to the software engineering concept of the separation of components. Our work uses a methodology based on community structure, which is very lightweight from the computational point of view. We are also introducing for the first time some concepts from social network analysis that allowed us to draw the same conclusions of the authors of the aforementioned papers, but getting also information on the predictability of software defectiveness. 3

Modularity and Community Structure

The concept of community derives from social networks. Nodes belonging to the same community are densely connected among each other, while they are poorly connected with nodes which are not in the same community. Inside a network, a community structure is the specific way in which the nodes are arranged in communities [For10]. Since there can be more than one community structure, we need a quantitative measure to evaluate the best division. The first and most used measure is the modularity [NG04]. Although there are some caveats to take into account while using it [GMC10, FB07], modularity is considered the standard measure for the quality of a community structure.

The original definition is based on the fact that a random graph does not possess a community structure, hence providing a null model for comparison with the community structure of real networks [NG04]. Consider a complex network of n nodes and m edges. In order to represent it we can use the following definitions: • Adjacency matrix:

Avw = If we postulate that nodes are grouped in communities, then we can compute the fraction among withincommunity edges and across-communities edges. In order to have a significant community structure this fraction has to be large. Given two communities cv and cw, the latter fraction can be written as follows: PvwPAvvww A(vcwv, cw) = 1

X Avw (cv, cw), 2m vw where (cv, cw) is the Kronecker . In order to obtain a reliable measure we need to compare the previous values to a null model. The most used null model is a graph with the same community structure but random connections among its nodes. The expected value of the fraction of edges attached to nodes in community v and to nodes in community w, in the random case would be given by: 1 X kvkw (cv, cw).

2m vw 2m where v and w are two nodes belonging to communities cv and cw; • Number of edges: m = 1 X Avw,

2 vw kv =

X Avw.

w • Node degree, namely the number of its connected edges: (1) (2) (3) (4) (5)

Subtracting (5) to (4), we get the modularity as defined in Newman [New06]:

Q = 1

X ⇣ 2m vw

Avw kvkw ⌘ 2m (cv, cw).

(6)

A good community structure corresponds to values of Q as close as possible to 1. However, in real networks, modularity values that reveal a good community structure fall in a range from 0.3 to 0.7 [NG04]. Lower values are associated to a weak community structure, whereas strong community structures, although rare in practice, may have modularity values higher than 0.7 and approaching 1. 4

Experimental Setting

In this work we analyze the structure and evolution of Eclipse IDE, a popular software system written in Java, using its associated software network. We first retrieved the network associated to each software system - specifically to each subproject in which the major system is structured, by parsing their source code retrieved from their corresponding Source Control Manager (SCM), looking for relationships like collaboration, inheritance, etc. This way we obtained the networks at class level where nodes are classes and edges are the mentioned relationship among classes (i.e inheritance, collaboration, etc.). Afterwards we annotated each class with the corresponding number of bugs retrieved using the procedure described in the following paragraph. 4.1

Retrieving Defectiveness Data

We considered the number of defects (bugs) as the main indicator of software quality. We collected data about the bugs of a software system by mining its associated Bug Tracking Systems (BTS). Bugzilla is the BTS adopted by Eclipse, where defects are tagged with a unique ID number. An entry in BTS is called with the common term ’Issue’, and there is usually no information about classes associated to defects. Usually all the changes performed on the source code are reported on the SCM. To obtain a correct mapping between Issue(s) and the related Java classes, we analyzed the SCM log messages, to identify commits associated to maintenance operations where Issues are fixed.

We analyzed the text of commit messages, looking for Issue-IDs. Every positive integer number (including dates, release numbers, copyright updates, etc) might be a potential Issue-ID in the BTS. In order to avoid wrong mappings between a file and the corresponding Issue, we filtered out any number which did not refer to bug fixes. This operation was performed by associating Issue-IDs to files belonging to the same release, and analyzing the commit logs to perform the mapping between Issues and classes. We associated to each release the Issues that are Bugs and that were classified as “closed” in BTS. In fact, very rarely Bugs which are labeled as “closed” are re-opened, this way being permanently associated with a release. The maintenance operations in Bugzilla are associated to files, called Compilation Units (CUs), which may contain one or more classes. Thus, in cases in which a file contained more than one class, we decided to assign all the defects to the biggest class of those Compilation Units. At the end of this process we obtained a network where to each node is associated the number of bugs of the corresponding class. 4.2

Metrics Analyzed

We computed the following metrics: • System Size: the number of classes of the software system. • Average Bug Number (ABN): or bug density, namely the number of defects found in a system divided by the number of classes. • Modularity: a measure of the strength of the obtained community structure, as defined in Section 3. • Number of Communities (NOC): the number of disjoint communities in which the network is partitioned. • Clustering Coecient (CC) : the average probability that if vertex i is connected to vertex j and vertex j to vertex k, then the vertex i will also be connected to vertex k. It can be defined as follows:

Ci = 3 ⇥ number of triangles in the network number of connected triples of nodes where a triangle is a set of three nodes all connected with each other, and a triple centered around node i is a set composed by two nodes connected to node i and the node i itself.

The clustering coecient for the whole graph is the average of the Ci’s:

n C = 1 X Ci, n

i=1 where n is the number of nodes in the network [New03]. (7) (8)

Release Size Sub-Projects n.

N. of defects We analyzed 5 releases of Eclipse, whose main features are presented in Table 1.

Each release is structured in almost independent sub-projects. The total number of sub-projects analyzed amounts at 375, with more than 60000 nodes (classes) and more than 350000 defects.

We detected the modularity and its associated community structure for each subproject of each release using the Clauset-Moore-Newman (CMN) community detection algorithm devised by Clauset et al. [CNM04]. The latter is an agglomerative clustering algorithm that performs a greedy optimization of the modularity. The community structure retrieved corresponds to the maximum value of the modularity. Moreover, we retrieved the number of communities in which the networks are structured, the corresponding maximum value of the modularity, and the nodes associated to each community. The CMN algorithm implementation used is that provided by the R package igraph [CN06].

We then performed a correlation analysis among the network metrics and the software metrics (size and defectiveness) for each release on its own and also for the entire dataset, in order to have relevant statistics. Finally, in order to investigate the system evolution, we studied the relationship between network metrics and software defectiveness by cumulating the first and the second releases in a single set, then adding the third release to this first set to obtain a second set and so on. Specifically, we evaluated if, with a starting dataset of N releases, the best fitting curve for the cumulated N 1 releases could also be a good fit for the N th release. To measure the forecast accuracy we adopted a 2 test. This way we were able to make predictions about the next release starting from those cumulated in the previous assembly. 5

Results

We performed di↵erent analyses among the network metrics and the software metrics for each release and the entire dataset. First and foremost, we noted a saturation e↵ect of the number of defects and the clustering coecient as the size of the analyzed systems increases. Our results show a general tendency for certain metrics to converge to a narrow range of values

1500 n. classes 500 1000 2000 2500 3000 when the number of classes increases.

Figures 1, 2 and 3 show the relationship between systems’ size (number of classes) and, respectively, modularity, average bug number (ABN) and clustering coecient (CC).

All the metrics display more or less the same behavior. For relatively small systems, where the number of classes is roughly below 100, the metrics assume values in a wide range. Specifically, the defect density (or ABN) ranges from 0 up to 25, the clustering coecient and the modularity, whose maximum value may be 1, range from 0 to 0.6-0.7. For system’s size between 100 and 500 roughly, the variation ranges become smaller: the ABN lays between 2 and 12, the clustering coefficient lays between 0.05 and 0.2, and the modularity between 0.3 and 0.6. Finally, for fairly large systems, where the number of classes is above 500 or more, the metrics stabilize, showing small oscillations and eventually converging asymptotically to precise values.

Another interesting result is the monotonic increase of the NOC metric with system’s size, reported in Figure 4. After a first nonlinear behavior the curve is aligned along a straight line. Additionally, our results show a significant correlation between NOC and both ABN and CC. Figure 5 displays a non-linear decay of the maximum values of ABN and CC versus NOC. It is worth to point out that for other network metrics, such as the mean degree or the average path length, there is not a similar trend.

In particular, Figures 5a and 5b show respectively the distributions of the maximum values of ABN and CC with respect to NOC for all the sub-projects for each release. Each point corresponds to the maximum value of the corresponding metrics computed on all the projects with the same number of communities. . n g u b v a 0 2 5 1 0 1 5 c c x a m

As these Figures illustrate, these values seem to follow a power-law like trend.

The distributions of the maximum values, when analyzed using a log-log scale, are well fitted by a straight line, suggesting two power-law-like relationships for the maximum values of both ABN and CC versus the number of communities, provided that the systems have the same number of communities. We applied a power-law best fitting algorithm in order to check this hypothesis, finding acceptable best fittings for the maximum values of CC and for the maximum values of ABN versus the number of communities. The power law parameters are reported in Tables 2 - 3. Table 4 shows the best fitting results, reporting the degrees of freedom and the normalized 2 for the relationship between these two metrics. We analyzed a large software system, Eclipse, using complex network theory with the aim of achieving a better understanding of software properties by mean of the associated software network. The application of the CMN algorithm confirms that the analyzed software networks present a meaningful community structure [SˇZˇBB15, CMOT13]. Furthermore, the results show the existence of meaningful relationships between software quality, represented by the average bug number (ABN), and community metrics, in particular the number of communities (NOC) and clustering coecient (CC).

The presence of a strong community structure in a software system reflects a strong organization of classes in groups where the number of dependencies among classes belonging to the same community (inter-dependencies) is higher with respect to the number of dependences among classes belonging to di↵erent communities (external-dependencies).

From a software engineering perspective this goal might be achieved by adopting good programming practices, where class responsabilities are well defined, classes are strongly interconnected in groups, and coupling among groups is kept low. Within this perspective the network modularity can be seen as a proxy for software modularity.

Figure 1 shows that, with the exception of subprojects with less than 500 classes, the modularity does not increase along with the size, converging to values that range from 0.6 to 0.7. As reported in Section 3, these values indicate that the community structure is significant and well defined. At the same time Figure 4 shows that there is a linear relationship between the number of communities and the number of classes.

Such relationship is not trivial: the modularity and the number of communities are theoretically independent by the size [GMC10] and, in general, the number of communities does not increase with network’s size. Moreover, by and large, there may be large networks divided in a small number of communities, depending on the network’s topology. As a consequence our findings suggest that, in the examined case, it is possible to partition the software networks into a set of communities, where the number of communities is correlated with system’s size.

Figures 2 and 3 report, respectively, the relationship of ABN and CC with the number of communities. Both metrics have a similar trend, with values converging to a range between 4 and 12 for ABN and between 0.2 and 0.6 for CC. This means that when the system’s sizes increases the number of defects stabilizes and the same happens to the clustering coecient. We already mentioned the significant increment of the number of communities with system’s size. Since the increment of NOC is not trivial, this led us to assume that there might be a relationship among the topology of software networks, that determines the number of communities, and the other metrics.

Figures 5a and 5b show the distributions of the maximum values of CC and ABN for the projects with the same number of communities. As previously mentioned, this relationship seems to follow a power-law trend. The power law relating the NOC and the maximum values of ABN indicates that the community metrics, specifically the number of communities, can be exploited in order to evaluate the evolution of the defectiveness of a software system. In other words, once the relationship between NOC and the maximum values for ABN is known, one can evaluate approximately the maximum ABN in a future release of the same system, by computing the number of communities for that release.

This way, we might assume that systems with the same number of communities should have a number of defects per class lower than a given value. The same argument applies to the clustering coecient of systems having the same number of communities. The relationship between CC and NOC is again a power 0.633 0.651 0.523 0.547 law. This implies that if the NOC of an initial release (or of a set of releases) is known, one can in principle predict that in the following releases the clustering coecient will not be greater than a certain value.

Figures 6 and 7 show, in a log-log scatterplot, the best fitting lines for the data discussed above. Each color corresponds to one set of releases cumulated according to the chronological order. The Figures confirm that the power-law like relationship appears in every cumulated release and is a regular and stable behavior throughout software evolution.

The power law parameters for the mentioned metrics and for each cumulated releases (see Section 4) are reported in Tables 2 and 3. As we can see they do not change significantly from one cumulated release to another. This suggests the existence of a progressively more stable behavior during software evolution, where the fitting with a power law becomes more accurate and tends to a fixed value as new releases are added in the cumulated dataset. These results might help developers to estimate the expected maximum ABN for software systems with a known community partition.

The two power laws indirectly connect the maximum values of ABN with the maximum values of CC in systems having the same number of communities. Such relationship can be made explicit reporting directly the scatter plot of the two metrics where each metric is computed for the same number of communities. Such plots are reported in Figure 6 for all the cumulated releases, and show that the two metrics are linearly correlated. Table 5 reports the correlation coecient as well as the degrees of freedom and the 2 for such data for all cumulated releases. It shows that the correlation coecient increases and the 2 decreases as new releases are added in the cumulated data, indicating a more stable relationship among the two metrics as the system evolves.

These results can be explained by noting that the larger the clustering coecient, the higher is the number of classes linked to each other and the higher the probability of di↵usion of defects among them. The topology of a software network is characterized by hubs, and the clustering coecient in the area of the graph around any hub is higher by definition. If one hub is a↵ected by one or more defects, it is more likely releases 2.1 − 3.1 releases 2.1 − 3.2

Internal Validity We conclude that the relationship among software defectiveness and community metrics can indicate that a high level of network modularity is related to low defectiveness, thus suggesting good programming practices. However, the relationships we found could be due to other phenomena, or deserve to be further investigated. What we propose here is just one possible formal explanation of our results.

External Validity We only consider one Java system, Eclipse, and analyzed its evolution. Our results should be validated on other systems and made more general. We are currently extending the analysis to many releases of NetBeans.

Construct Validity The rules reported in Section 5 might be faulty in some cases, not being able to correctly map defects to CUs [AMADP07]. There 8

Conclusions

In this work we presented a longitudinal analysis on the evolution of a large software system with a focus on software defectiveness. We used a complex network approach to study the structure of the system and its modularity by computing the community structure of the associated network. After having retrieved the number of defects and associated them to the software network classes, we performed a topological analysis of the system defectiveness. We found a power law relationship between the maximum values of the clustering coecient, the average bug number and the division in communities of the software network. This led to a linear relationship between the maximum values of the clustering coecient and of the average bug number. We showed that such relationship can in principle be used as a predictor for the maximum value of the average bug number in future releases. [ASTC11] [BAJ00]

Gabor Csardi and Tamas Nepusz. The igraph software package for complex network research. InterJournal, Complex Systems:1695, 2006. Aaron Clauset, M. E. J. Newman, , and Cristopher Moore. Finding community structure in very large networks. Physical Review E, pages 1– 6, 2004. Hernn A. Makse Chaoming Song, Shlomo Havlin. Self-similarity of complex networks. Nature, 433(4):392–395, January 2005. Marco D’Ambros, Michele Lanza, and Romain Robbes. An extensive comparison of bug prediction approaches.

In Mining Software Repositories (MSR), 2010 7th IEEE Working Conference on, pages 31–41. IEEE, 2010.

S. Fortunato and M. Barth´elemy. Resolution limit in community detection.

Proceedings of the National Academy of Sciences, 104(1):36, 2007.

Sergio Focardi, Michele Marchesi, and

Giancarlo Succi. A stochastic model of software maintenance and its implications on extreme programming processes, pages 191–206. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2001.

Santo Fortunato. Community detection in graphs. Physics Reports, 486(3-5):75 – 174, 2010. B.H. Good, Y.A. De Montjoye, and A. Clauset. Performance of modularity maximization in practical contexts. Physical Review E, 81(4):046106, 2010. M Girvan and M. E. J. Newman. Com

munity structure in social and biological networks. Proc. Natl. Acad. Sci. U. S. A., 99(cond-mat/0112110):8271–8276. 8 p, Dec 2001.

Tracy Hall, Sarah Beecham, David

Bowes, David Gray, and Steve Counsell. A systematic literature review on fault prediction performance in software engineering. Software Engineering, IEEE Transactions on, 38(6):1276–1304, 2012.

Brian S. Mitchell and Spiros Man

coridis. On the evaluation of the bunch search-based software modularization algorithm. Soft Comput., 12(1):77–93, August 2007.

Thomas J Ostrand, Elaine J Weyuker, and Robert M Bell. Predicting the location and number of faults in large software systems. Software Engineering, IEEE Transactions on, 31(4):340– 355, 2005.

Thomas Zimmermann and Nachiappan Nagappan. Predicting defects using network analysis on dependency graphs.

In Proceedings of the 30th Interna

[AMADP07]

Ayari ,

Meshkinfam , G. Antoniol, and

M. Di

Penta . Threats on building models from cvs and bugzilla repositories: the mozilla case study . In Proceedings of the 2007 conference of the center for advanced studies on Collaborative research , CASCON '07 , pages 215 - 228 , New York, NY, USA, 2007 . ACM.

Mahir

Arzoky , Stephen Swift, Allan Tucker, and

James

Cain . Munch: An efficient modularisation strategy to assess the degree of refactoring on sequential source code checkings . In ICST Workshops , pages 422 - 429 . IEEE Computer Society, 2011 .

Scale-free characteristics of random networks: the topology of the world wide web . Physica A: Statistical Mechanics and its Applications , 281 : 69 - 77 , 2000 .

Design

Rules : The Power of Modularity Volume 1 . MIT Press, Cambridge, MA, USA, 1999 .

IEEE

Trans. Software Eng ., 20 ( 6 ): 476 - 493 , June 1994 .

Bug propagation and debugging in asymmetric software structures . pre , 70 ( 4 ): 046109 , October 2004 .

IEEE

Trans. Software Eng ., 37 ( 6 ): 872 - 877 , 2011 .

Giulio

Concas , Cristina Monni, Matteo Orru`, and

Roberto

Tonelli . A study of the community structure of a complex software network . In Proceedings of the 2013 ICSE Workshop on Emerging Trends in Software Metrics , WETSoM