=Paper=
{{Paper
|id=Vol-2978/msr4sa-paper2
|storemode=property
|title=Two Different Facets of Architectural Smells Criticality: An Empirical Study
|pdfUrl=https://ceur-ws.org/Vol-2978/msr4sa-paper2.pdf
|volume=Vol-2978
|authors=Ilaria Pigazzini, Davide Foppiani, Francesca Arcelli Fontana
|dblpUrl=https://dblp.org/rec/conf/ecsa/PigazziniFF21
}}
==Two Different Facets of Architectural Smells Criticality: An Empirical Study==
Two different facets of architectural smells criticality: an empirical study Ilaria Pigazzini1 , Davide Foppiani1 and Francesca Arcelli Fontana1 1 University of Milano - Bicocca, Milan, Italy Abstract Architectural smells (AS) represent symptoms of problems at architectural level that have an impact on architectural debt. It is important to identify among them the most critical ones, so that developers can prioritize them for their removal. In order to evaluate the criticality of AS, in this paper we consider two facets: the PageRank metric, to assess the centrality of a smell in a project, and Severity, a metric to estimate the cost-solving of smells. We have proposed these two metrics in a previous work and here we perform an empirical analysis of the evolution and correlation of these metrics in the version history of 10 projects (at least 22 versions each, 264 projects in total). The analysis of the evolution is useful in order to identify which architectural smells types tend to become more critical. The analysis of the correlation is useful to study whether the criticality of a smell has an influence on how much it costs to remove it, and vice-versa. Keywords Architectural Smells, Architectural Debt, Architectural Smells criticality, Architectural Smells evolution, Empirical study 1. Introduction cated in a central part of the project and other facets. Moreover, while criticality gives us information about Architectural debt can be monitored through differ- the removal urgency, there is another aspect connected ent issues, such as through the presence of architectural to the removal of smells which can be considered and smells in a project. Architectural smells (AS) are de- quantified. AS have a cost-solving (cost of fixing, cost of sign decision that negatively impact internal software refactoring), which is the effort needed to remove a smell qualities and are symptoms of architectural debt [1], [2]. from the system [6]. This variable depends less from the Software systems affected by AS are difficult to main- perception of the developers but more from the specific tain and evolve, hence it is important to study them and characteristics of the interested AS. identify solutions to support developers in their removal, To resume, during AS management, developers can in particular the removal of the most critical ones (AS take into consideration two distinct aspects concerning prioritization). smells: their criticality, i.e., how much is important to In such terms, criticality of an AS models the degree remove them as soon as possible (urgency), and their of removal urgency associated to the AS, i.e., the smell cost-solving, i.e., how much it cost to remove them. should be removed as soon as possible because it affects a Both criticality and cost-solving are particularly rele- part of the project which is important for the developers vant for developers when making decisions about AS (e.g., frequently changed or highly referenced) or has a management: for instance, to choose which smell to strong impact on the maintainability of the project. refactor first [1][5]. A developer may prefer to refac- However, it is not trivial to model and evaluate the tor first the smells which require less time to be solved importance and urgency of the removal of an AS. In the (low cost solving) to quickly enhance the quality level literature, the identification of the best metrics to be used of the project, instead of fixing the most critical ones. for the evaluation of criticality is considered a complex On the other hand, the developer may decide to remove task [3], mainly because it is tightly connected to how the most difficult/critical ones, but to make this decision, smells are perceived by developers [4] and such percep- different factors must be considered: it can be too ex- tion is subjected to many variables, such as the developer pensive and risky; too many changes could compromise experience, code ownership [5], whether the smell is lo- other parts. Perhaps, the most difficult AS was created by design choice and no better solution is available, as in MSR4SA’21: 1st International Workshop on Mining Software the case of cycles created by callbacks for event listeners Repositories for Software Architecture, September 15–17, 2021, Virtual in GUI components [1][7]. Finally, the most critical AS email: i.pigazzini@campus.unimib.it (I. Pigazzini); could appear in a not-central part of the project, such d.foppiani@campus.unimib.it (D. Foppiani); as a deprecated, unessential package, and could be not arcelli@disco.unimib.it (F. A. Fontana) interesting for the developers. orcid: 0000-0003-2629-6762 (I. Pigazzini); 0000-0002-1195-530X In this paper, we consider two metrics, PageRank and (F. A. Fontana) © 2021 Copyright for this paper by its authors. Use permitted under Creative Severity, and we propose to use them to model the criti- Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) cality (PageRank) and the cost-solving (Severity) of three AS based on dependency issues, namely: Cyclic Depen- relevance of the metrics for each type of smell. Other- dency, Unstable Dependency, and Hub-Like Dependency wise, no correlation, we could infer that there is no link (see Section 3.2). PageRank, inspired by the well-known between the urgency of removing a smell and the cost of metric from Brin and Page [8] is a measure that estimates removing a smell, as computed by the proposed metrics. whether an AS is located in an important part of the In this case a developer can decide to not remove an AS project [9], where the importance is evaluated according with low PageRank and high cost solving, and to remove to how many parts of a project depend on the ones in- first an AS with high PageRank and low cost solving, volved in the AS (as a sort of centrality measure of the since this AS could become more critical since it appears AS). We want to use PageRank as a proxy of AS criticality, in a central part of the project. i.e., the higher the PageRank, the higher the criticality of We aim with our study to provide developers insights the AS. Severity, defined by us, is a measure associated on the evaluation of criticality and cost solving of AS to each specific type of AS and is computed through the through the PageRank and Severity metrics. Severity metrics used to detect each smell. Our idea is that the AS metric is focused on evaluating the cost solving in terms characteristics, such as the number of system dependen- of the number of project dependencies affected by the cies it affects, are useful to estimate how much effort is smells, while PageRank is more focused on the impor- required to refactor the smell (cost-solving), e.g., a smell tance (criticality) of the affected components (classes/- which involves many dependencies will require a deep packages). Hence, both metrics could be useful to de- analysis and a lot of time to be solved. termine the prioritization of AS, i.e., help the developer We have considered these two metrics in a previous in choosing which smell to refactor first depending on study [10], where PageRank and Severity have been eval- the developer’s needs, i.e., the need to address the most uated on only 6 single-version projects. We have now critical ones first or the most expensive ones. extended the study by conducting an empirical evaluation We have considered the two metrics in the computa- on a total of 264 versions of 10 projects with the aim to tion of an Architectural Debt Index [11] based on the empirically study criticality and cost-solving during the number of the AS found in a project and their critical- evolution of the projects, and investigate whether there ity measured in terms of both PageRank and Severity is a correlation between the trends of the two metrics, to metrics. The results of this study can be useful also to answer the following Research Questions (RQ): evaluate whether the two metrics truly capture different RQ1: How PageRank and Severity of the smells evolve aspects of a smell or not. In the latter case, one of the in the version history of a project? two metrics could be left out. RQ2: Can we find some correlation between PageRank The paper is organized through the following sections: and Severity by considering each type of smell? in Section 2 we introduce some related work, in Section 3 The answer to RQ1 aims to analyze if the values of the we describe the study design, in Section 4 we provide the two metrics tend to increase or decrease in the version results we obtained to answer the RQs. Section 5 presents history of the projects. Moreover, we are interested in the discussion of the results and Section 6 outline some understanding which AS type(s) tend to become more threats to the validity of the work. Finally in Section 7 we critical and/or difficult to remove in the version history conclude our work by outlining some threats to validity of a project, where the criticality is evaluated through and future developments. the PageRank and the cost solving is estimated with the Severity metric. In this way a developer can decide to focus the attention on these types of smells first. 2. Related Work The answer to RQ2 allows to evaluate the correlation We first briefly describe some empirical studies on between the criticality and the cost solving of a smell. If architectural smells. for example the values tend to go together, highly corre- Le et al. [12] investigated the nature and impact of lated, for a specific type of AS, it means that as long as the architectural smells through a large empirical study, by smell is critical, it is also hard to remove and vice-versa: exploiting the projects’ issue trackers to analyze the im- in this case, the two metrics would produce the same pact of smells on software development; Arcelli et al. [13] ranking of smells, i.e., the prioritization of the smells studied the relationship between code smells and archi- would be equal by considering one of the two metrics tectural smells and found that architectural smells are interchangeably. In case of positive correlation, it could independent from code smells; Sharma et al. [14] con- be also in any case interesting to analyze possible out- ducted an empirical study to investigate the relationship liers with different values of the metrics (high/low) and between design and architectural smells in C# projects. better capture the relevance of the metrics (see examples Finally Herold [15] performed a preliminary empirical in Section 4.2). We could find that the two metrics have study to investigate the relationship between architec- a strong positive correlation for a specific type of smell, tural smells and architectural degradation, the latter mea- and not for other smells. This scenario can outline the sured through the number of architectural violations. tend the previous work on a large number of projects With respect to these previous papers, we performed (10 projects, 22 versions each, for a total of 264 versions), an empirical study focused on the evaluation of different and we analyze the correlation existing between the two facets of architectural smells criticality, not previously metrics through Spearman and Kendall correlation tests. studied in the literature according to our knowledge. Moreover, we study the evolution of the metrics in the We now outline some related works done in the liter- project history. Finally, in this paper we propose to ex- ature on the evaluation of criticality and prioritization ploit PageRank as a proxy for criticality, and Severity as of code or architectural smells. What distinguishes the a metric to estimate cost-solving. following works is the kind of information used to es- timate the priority of a smell. For instance, concerning code smells, Vidal et al. [16] presented an approach to 3. Case Study Design identify the most critical smells based on a combination We describe below the analyzed projects, the data we of three criteria, namely: past component modifications, collected on AS, their Severity and PageRank and the important modifiability scenarios for the system and rel- data preparation and analysis. evance of the kind of smell. Also Rani et al. [17] pro- posed a methodology for code smell prioritization. First, it detects smelly classes using structural information of 3.1. Analyzed projects source code, then mines change history, as done by Vidal We analyzed several versions of 10 projects, for a total et al., to prioritize the smells. Always according to code of 264 versions (see Table 1). Most of the chosen projects smells studies, Sae Lim et al. [18] exploited the developers’ were picked from the Qualitas Corpus [22]. We selected context (a list of issues extracted from an issue tracking these projects since they have already been the subject of system) to define priority. Instead, Arcelli et al.[19] pro- several studies, they are publicly available and enable the posed a severity index of the smells based on how the replication of this study. These data were also combined metric thresholds used for the smells detection are ex- with data from the MavenRepository1 , also publicly avail- ceeded. Similarly, Guggulothu et al. [20] proposed a pri- able. We considered several releases for each project. To oritisation approach for four code smells (Long Method, easily compare the different projects, we chose roughly Feature Envy, God Class and Data Class), depending on the same amount of versions and preferred different re- their impact on design quality, where the impact is mea- leases, major or minor, over patches when possible. In sured depending on the overcome of a set of metrics such general, in this paper we use the term version to refer as coupling, size, complexity and cohesion. Moreover both minors and majors. The chosen systems also vary recently, Pecorelli [5] proposed a machine learning ap- in size and number of smells (see Table 1). In the column proach to prioritise the application of refactoring on code group last version we report the projects’ size (in terms smells. They generated a rank of code smells according to of classes/packages) and number of AS of the last version the perceived criticality that developers assign to them. of the project in the development history. According to architectural smells, there are fewer stud- ies about prioritization. Martini et al. [1], performed a study on the analysis of the most critical AS through 3.2. Data collection the feedback of the developers of two industrial projects. Architectural smells we performed this study by con- The smells having top refactoring priority in the opinion sidering the AS detected with the Arcan tool2 [23] de- of practitioners are the ones with the highest negative scribed below, but other AS can be considered in the impact on the maintainability and evolvability of the future [24]. We limited the analysis on the following project. On the same line, Oliveira et al. [21] investi- three smells since they are the only ones for which we gated criteria that developers use in practice to prioritize developed a Severity metric, contextually to the defini- design-relevant smelly elements with the aim to develop tion of our Architectural Debt Index (ADI) [11]. a set of prioritization heuristics. From their results, two out of nine heuristics reached an average precision higher • Unstable Dependency (UD) describes a component than 75%. Finally, Vidal et al. [3] presented and evaluated (package) dependent on other components that a set of five criteria for ranking groups of code smells as are less stable than itself; This may cause a ripple indicators of architectural problems in evolving systems. effect of changes in the system. Instability of a According to our knowledge no extensive work has component is measured with the metric proposed been previously done on the analysis of the evolution by Martin [25] as the ratio of outgoing depen- and correlation between criticality and cost-solving, eval- dencies to the total number of dependencies of uated in terms of PageRank of AS and Severity metrics. 1 https://mvnrepository.com/ In a previous study [10] we only manually analyzed the 2 Download: https://drive.google.com/file/d/ two metrics by considering only 6 projects. Here we ex- 1WNx7FHRykbyOIxz92cDQpSL2rl_gEJ4P/view?usp=sharing Table 1 Summary of the dataset Project #V #Cl. #Pkg #AS #CD-Cl #CD-Pkg #HL-Cl #HL-Pkg #UD #AS last version all versions Ant 24 1157 62 413 8131 2064 15 92 243 10545 Azureus 24 8148 480 7722 97172 29801 41 70 3478 130562 FreeCol 24 1310 35 1395 30488 1652 86 54 356 32636 Hibernate 24 2980 170 1172 12910 9026 18 129 1267 23350 JMeter 26 681 55 307 3930 2681 79 54 574 7318 JGraph 24 188 20 118 2602 79 79 1 51 2812 Jstock 24 865 19 665 13585 619 64 8 247 14523 Jung 22 705 40 133 894 658 31 27 270 1880 Lucene 31 1425 22 408 6241 407 9 59 187 6903 Weka 44 2423 80 1090 25241 5200 102 41 1042 31626 Acronyms. V: version, CL: classes, Pkg: packages, AS: Architectural Smells, CD: Cyclic Dependency, HL: Hub-Like Dependency, UD: Unstable Dependency the component. Consequences: The components cal, since they have higher maintenance costs. In particu- with an high instability are more prone to change lar, Cyclic Dependency is one of the most common smell with respect to the more stable ones, this means and is considered the most critical smell by developers that the component which depends on less stable [1]. components is forced to change along with them. We used our Arcan tool for the AS detection, since it is • Hub-Like Dependency (HL) arises when a compo- publicly available, allows to easily detect the considered nent (class or package) has outgoing and incom- AS and has been previously validated [28]. We computed 3 ing dependencies with a large number of other the PageRank and Severity metrics related to the three components [26]; The affected component rep- types of smells and we reported the “granularity level” resents a unique point of failure for the system of the considered smells, either class or package. Our and also a dependency bottleneck. Consequences: distinction between AS at class and package level can be The component in the middle of the hub is a mapped to another nomenclature adopted in the litera- unique point of failure and a dependency bot- ture [14] which calls “design smells” our class AS and tleneck. Moreover the logic inside a Hub-Like “architectural smells” our package AS. Dependency is hard to understand, and the smell We now report the definition of the two metrics under causes change ripple effect. analysis. • Cyclic Dependency (CD) refers to a component Severity is a metric that we defined for each type of (class or package) that is involved in a chain of re- AS to estimate the AS cost solving. In particular, it evalu- lations that break the desirable acyclic nature of a ates different features of the smells which have an impact component’s dependency structure. Components on the effort needed for its removal. For example, for the involved in a CD cannot be reused in isolation estimation of Hub Like Dependency cost-solving, we con- and a change on one component propagates to sider the number of dependencies affected by the smell, the other ones. Consequences: The components because this metric gives us information about how many involved in a dependency cycle can be hardly parts of code a developer investigate/change/remove to released, maintained or reused in isolation. More- refactor the HL. over, a change on one affected component will Severity is computed differently for each type of AS: propagate towards all the other ones involved in for UD it is evaluated through the number of bad de- the cycle. pendencies which cause the Unstable Dependency smell, where for bad dependency we mean a reference from We considered these three AS because they are some the affected package to the less stable packages i.e. if of the most studied smells [27][13][11][15] and they are package B has high instability and package A has low also perceived as important and detrimental for the qual- instability, the dependency A → B is a bad dependency; ity of the software systems by practitioners[1][24]. In for HL the Severity corresponds to the total number of particular, these smells are based on dependency issues. dependencies which cause the HL smell (dependencies Dependencies are of great importance in software archi- from a class/package directed to the hub and vice-versa); tecture: components that are highly coupled and with a high number of dependencies are considered more criti- 3 https://figshare.com/articles/dataset/_/13636472 for CD it is computed through the number of compo- of the data. The resulting dataset is a collection of 262155 nents involved in the cycle multiplied with the minimum smells categorized by project, version, type, granularity number of times a cycle repeats itself. A dependency be- level, Severity and PageRank. Table 1 shows the sum- tween two components can occur multiple times because mary of our dataset, where we report the project size and we count the number of references from a class/package the number of smell instances, divided by type: for each to the others. For instance, if there is a cycle between project (considering all versions in history) we show the package A and B, caused by 5 classes belonging to A number of detected CD at class and package level (CD-Cl calling B, and B’s classes calling A 3 times, the Severity and CD-Pkg), of detected HL at class and package level value is equal to 3. This means that the cycle is repeated (HL-C and HL-P), of detected UD (UD) and the sum of all at least 3 times. project’s AS (AS). A smell instance corresponds to one PageRank of an AS evaluate the criticality (urgency) occurrence of the smell in the project, thus the reported associated to an AS. The PageRank value of a smell in- numbers are the counts of all the occurrences. stance is computed as the mean value of the PageRank of We studied two different aspects: 1) Severity and the components (class or package) affected by the smell. PageRank evolution, in order to answer RQ1; 2) Severity The intuition is that components with high PageRank are and PageRank correlation to answer RQ2. important inside the project, where the importance [9] Concerning evolution, we analyzed the evolution of the corresponds to how many parts of the project depend two metrics for each type of smell in order to study their on the component. PageRank of a component is com- different behaviours. We summarised the data for each puted through the PageRank formula implemented by version by averaging the values of both metrics with Brin and Page [8], executed on the dependency graph of respect to the total number of smells detected in the the project: version. We conducted trend analysis to understand how (︃ 𝑛 )︃ the average values of PageRank and the different types 1−𝑑 ∑︁ 𝑃 𝑅(𝑝𝑘 ) of Severity evolve overtime. We exploited the Mann- 𝑃 𝑅(𝑣) = +𝑑 (1) 𝑁 𝐶(𝑝𝑘 ) Kendall test, which is a non-parametric test able to assess 𝑘=1 if there is a monotonic upward or downward trend of the where, the vertex 𝑣 is a node of the dependency graph variable of interest over time. The null hypothesis for associated to a project; 𝑃 𝑅(𝑣) is the value of PageRank this test is that there is no monotonic trend in the series. of the vertex 𝑣; 𝑁 is the total number of AS in the project; The alternate hypothesis is that a trend exists. This trend 𝑃𝑘 is a vertex with at least a link directed to 𝑣; 𝑛 is the can be positive, negative, or non-null. We also analyzed number of the 𝑝𝑘 vertexes; 𝐶(𝑝𝑘 ) is the number of links the two metrics’ evolution respect to the evolution of of vertex 𝑝𝑘 ; 𝑑 (damping factor) is a custom factor fixed the size, where size corresponds to the number of classes at 0.85, a default value defined by Brin and Page. and packages of the projects under analysis, to check The range of the metric spans from 0 to infinite and whether the two things are correlated. We ran Spearman higher values correspond to higher criticality. To as- and Kendall correlation tests to investigate this aspect. sociate a unique value of PageRank to a single smell Concerning the correlation analysis of PageRank and instance, we compute the mean value of the PageRank Severity, we first tested the normality of our data. Given scores of all the components involved in the smell. In this the large size of our dataset, we used Q-Q plots [29] to way, smells of any type can be ordered by this metric, evaluate if the measures do not follow a normal distri- from the most critical to the less critical. bution. A Q-Q plot is a graphical method for comparing Both Severity and PageRank are based on the project two probability distributions by plotting their quantiles dependencies, however they are computed in differ- against each other. These plots are often used when the ent ways and aim to evaluate two distinct aspects: im- dataset is large enough to introduce bias in the Shapiro- portance/criticality (for PageRank) and dependencies Wilk test [30], which is a commonly used normality test. structure/cost-solving (for Severity). Hence, we per- The Q-Q plots of all the projects showed a non-normal be- formed a correlation analysis to investigate the possible haviour. Then, we tested the correlation between Severity relationship between the two metrics. and PageRank for each version of the projects. We com- puted the correlation on the metrics data of all smell type 3.3. Data preparation and analysis together and also separately for each smell type. We also computed the correlation separately for each granularity We ran Arcan and we pre-processed the output data in level, to contextualize the results at package or class level. order to produce the dataset for our analysis. Other than Given the non-normal distribution of our data, we chose Arcan, we exploited the Knime platform4 and R program- the Spearman’s [31] and Kendall’s [32] coefficients to ming language5 for the processing and statistical analysis calculate the correlation. 4 https://www.knime.com/knime-analytics-platform 5 https://www.r-project.org/ 4. Results Table 2 Mann-Kendall results - PageRank We report the results both for PageRank and Severity evolution and their correlation. At the end of each section, Project Trend P-value Reference AS we also report the answer to the relative RQs. All the Ant + 0.009867 CD-package results and plots can be found in the replication package6 . Azureus + 2.77E-05 CD-class Azureus + 0 CD-package Azureus + 3.81E-06 HL-class 4.1. Evolution results Azureus + 0 HL-package In order to answer RQ1, we checked the trend of PageR- Azureus + 0 UD-package ank and Severity values throughout the versions of the Hibernate + 0.030929 CD-class projects. For every project and for both PageRank and Hibernate + 0 CD-package Hibernate + 0.000677 HL-class Severity, we run the Mann-Kendall test. Table 2 and Hibernate + 0 HL-package 3 show the outcome of the test, namely reporting the Hibernate + 2.38E-07 UD-package Trend (increasing + or decreasing -), the P-value and Jgraph + 0.001375 HL-class the Reference AS (the type of smell which the PageRank refers to) for PageRank, while Granularity (class or pack- age) for Severity. The tables report only results where Table 3 𝑝 − 𝑣𝑎𝑙𝑢𝑒 < 0.05, i.e., there is a trend. We outline from Mann-Kendall results - Severity Table 2 and 3 the following remarks: Project Trend P-value Granularity • PageRank and Severity show a trend during time Severity - CD in few projects. We found PageRank trend in four Azureus + 0.024848 class over ten projects, while Severity showed a trend Hibernate + 0.000291 class in five projects. The tables only show the projects Jstock - 0.025486 package with a positive or negative trend. Jung - 0.039728 class Lucene - 3.25E-06 class • Concerning the Severity of CDs, we observed Severity - HL both positive and negative trend at class level, in Jstock + 0.002832 package 4 projects, and a negative trend at package level, Lucene + 0.000422 class in one project. Weka + 0.002132 class • Concerning the Severity of HLs, we had examples Weka + 0.005923 package at both class and package level of positive trends. • The Severity metric of Unstable Dependency smell does not show a trend in any project, and we may say that this should be true also for PageRank com- could notice only one project (Hibernate) where puted on classes correlated with the number of classes: the PageRank of UD smells had a trend. instead, their correlation values range in [−0.87, 0.9] with median equals to 0.45. This result may be due to We extended our analysis to see if the project size (mea- the high variance in the number of classes among the sured by number of classes and packages) is correlated projects (variance which is smaller for what concerns with the values of PageRank and Severity. We tested it packages). for each project over its development evolution. We then analyzed the distribution of the correlation on the data of RQ1 Answer How PageRank and Severity of the all projects. The first thing we noticed is that the number smells evolve in the version history of a project?: in of classes and packages increases overtime. However, general we found that the average values of PageR- this does not happen for Severity and PageRank values: ank and Severity do not have a trend (neither pos- we do not find a significant correlation between size and itive or negative) over time. Concerning the com- the metrics except for the correlation between PageRank parison with projects’ size evolution, we found out computed on AS on packages and the number of pack- that PageRank computed on packages show a posi- ages in the system. The correlation values, computed for tive correlation with the evolution of the number of all the projects, have range in [0.34, 0.89], with median packages: this is reasonable, since the increase/de- equals to 0.74. We hypothesise that the correlation is crease in the number of packages has an impact also high for PageRank because of how it is computed: the on the creation/deletion of package dependencies, more the number of packages, the more the dependencies thus on PageRank. and higher the PageRank values are. For this reason, one 6 https://figshare.com/articles/dataset/_/13636472 Table 4 Severity and PageRank correlation (last version only) Project Version Spearman P-value Kendall P-value Ant 1.10.7 0.582 < 0.001 0.46 < 0.001 Azureus 4.8.1.2 0.871 < 0.001 0.704 < 0.001 FreeCol 0.10.7 0.809 < 0.001 0.64 < 0.001 Hibernate 4.2.2 0.719 < 0.001 0.573 < 0.001 JMeter 5.2.1 0.575 < 0.001 0.455 < 0.001 JGraph 5.13.0.0 0.664 < 0.001 0.581 < 0.001 Jstock 1.0.6w 0.621 < 0.001 0.494 < 0.001 Jung 1.7.6 0.643 < 0.001 0.506 < 0.001 Lucene 4.3.0 0.411 < 0.001 0.33 < 0.001 Weka 3.7.9 0.53 < 0.001 0.428 < 0.001 4.2. Correlation results is associated to the most updated codebase, hence we assume it is the most exemplary for them. In order to answer RQ2, we report in Table 4 the re- By analyzing the correlation coefficients of JMeter’s sults of the correlation between Severity and PageRank, AS, we noticed that when they are calculated separately evaluated on all AS, not considering their type. As can for each AS type, they present higher values than the ones be seen, the majority of the projects presented a strong reported in Table 4. Using Spearman’s as an example: positive correlation (𝜌 > 0.6). 0.575 is the 𝜌 value by not considering the AS type and Following, we discuss the correlation results, but by 0.638, 0.9, 0.881 are the values for CDs, HLs and UDs considering the different types of AS. The coefficient respectively. The values seem to imply that actually, values are bounded between: while the correlation in general is weak for this project, • ( CDs) 0.427 and 0.942 with Spearman’s and be- when we look at the specific smell types, the two metrics tween 0.214 and 0.812 with Kendall’s; tend to be positively correlated. However, the number • ( UDs) 0.253 and 1 with Spearman’s and between of HLs and UDs in JMeter is very small compared to 0 and 1 with Kendall’s; the number of CDs. Since correlations computed on few • ( HLs) -1 and 1 for both coefficients. observations are not significant, we can conclude that only the correlation value computed on CDs is relevant Due to their low occurrences, the metrics of HL and UD for JMeter, and it explains why the overall correlation usually present a strong correlation. However, there are value is weak for this project. cases in some projects versions where the scarce number If we closely analyze JGraph evolution, initially it of detected smells makes this calculation misleading: in shows a negative correlation for CDs at package level, some cases correlations are very high, in other ones are which progressively increases (0.2 in version 5.10.0.1) very low (fluctuate). and becomes strongly positive (0.73) in version 5.12.1.0. On the other hand, CD is the most common smell in We further investigated what caused these changes in the dataset and this has an effect on the correlation values: the correlation values. In the first versions with nega- they largely vary in the dataset, making CD the smell type tive correlation we observed 3 CDs at package level, two with some of the highest correlation values and at the of them with similar Severity and PageRank values and same time the smell with some of the lowest correlation one with a strongly higher PageRank value, probably the values. However, a clear result is that for all projects cause of the negative correlation. After version 5.10.0.1 the correlation at package level between PageRank and we noticed the presence of a 4th one. Its Severity was in Severity of CD is strong, with the exception of JGraph line with the others and also its PagerRank: this likely (see the following paragraph). balanced the PageRank values and subsequently caused the increase of the positive correlation. Observations on weak and negative correlations Hence we can conclude that the variations in the cor- From Table 4 we can observe that some projects, such relations values from negative to positive were due to as JMeter, Lucene, Weka and Ant show a weak corre- the introduction of a new smell instance, whose metrics lation between the two metrics. We aim to investigate values strongly impacted the correlation values due to, as these behaviours and we start by analyzing two projects: for JMeter, the general small amount of smell instances. JMeter, having a weak correlation, and JGraph, showing However, this specific case does not represent a common non-positive correlation values for CDs at package level. behaviour in our dataset. We focus on the last version of both projects because it RQ2 Answer Can we find some correlation between reference it (incoming dependencies). In this way, a com- PageRank and Severity by considering each type of ponent having many incoming dependencies but refer- smell?, we found out that the smell type showing enced by components with few incoming dependencies, the highest PageRank and Severity correlation is is less important with respect to another component with CD at package level. However, also the other types, many incoming dependencies and referenced by other HL and UD, showed strong correlations, but given components with many incoming dependencies. That the lower amount of HL and UD instances, we con- is why PageRank is said to evaluate the importance of a sider the result regarding CDs more meaningful. component with respect to the entire graph. We also investigated specific cases of projects with From our analysis it results that the positive correla- weak correlation and negative correlation but we tion is particularly evident in the case of CD. The reasons did not find further insights. behind the CD Severity high correlation can be multi- ple: a part of code with high PageRank is interested by more changes [33] with respect to other parts of code, 5. Discussion and thus more open to the introduction of (structurally complex) CDs. This is interesting because in the litera- We found a strong correlation between PageRank and ture we find studies which confirm the correlation in the Severity. This means that, concerning the analysed data other direction [12], i.e., the presence of AS makes the and the considered smells, the criticality and the cost- components more prone to change: if our hypothesis can solving of smells go hand in hand: in the case of this be further corroborated, the conclusion would be that study, if a smell affects an important (unimportant) part the relationship between PageRank and CD Severity is of the system, then it will also have a high (low) cost like a dog chasing its tail, one triggers the other. Another solving. We can outline two different interpretations of reason could be that components with high PageRank the results. The positive correlation could be due to the are involved in a high number of dependencies, thus still nature of the two metrics, both bounded to the depen- making easier for a developer to wrongly introduce new dencies of the system. In this case, the conclusion would entangled dependencies and create cycles very difficult be that PageRank and Severity capture the same charac- to remove. teristic of the smells, and one of the two is redundant. As To conclude, there is a positive correlation between AS consequence, in the ADI computation [11], only one of Severity and PageRank, however at the moment we can- the two metrics should be used to evaluate AS criticality. not draw a definitive conclusion about how to interpret However, given how the metrics are defined, they dif- this finding. We plan to conduct a validation of our re- fer one from the other. Severity takes into account the sults with developers from industry, who could evaluate dependencies which are directly affected by the smell, the ability of the two metrics to capture criticality and while PageRank considers also dependencies outside the cost-solving, and also manually check the specific cases smell which converge towards the components affected where smells have high PageRank and high Severity. by the smell. Take for instance the Severity of CD, which is based on the dependencies forming the cycle and their weight. If the components involved in the cycle have a 6. Threats to validity high PageRank, it means that they are involved in many dependencies with many other parts of the system, which Our study presents some threats to validity which we is unliked from the fact that those components are part address by following the structure suggested by Yin [34]. of the cycle. With such premise, the two metrics would Concerning the construct validity, the two metrics, capture different aspects of the smells, and their positive PageRank and Severity, may not measure what we claim correlation could mean that critical parts of the system they do, i.e., the criticality of the AS. However, this is a attract AS which are more expensive to solve. preliminary study and the next step is to validate the cur- Moreover, one could ask where is the difference in us- rent definition of the metrics with developers, by letting ing PageRank when we could use simple coupling metrics them check whether the prioritization produced by the such as FanIn and FanOut [25]. However, when evaluat- metrics is significant or not. Other threats regarding the ing the coupling of a component, such metrics take into internal validity could be related to the choice of the account only the incoming or outgoing dependencies of statistical methods used for the correlation analysis and the component itself. On the contrary, the PageRank their implementation in the used tools, but we exploited value of a component takes into account the PageRank of very well known and used tools (R language). Moreover, all the components belonging to the dependency graph. we did not validate the two metrics by investigating the In particular, the PageRank of a component is defined perception of developers of PageRank and Severity. How- recursively and depends on the number of dependen- ever, PageRank was adopted in other studies as software cies and the PageRank metric of all the components that ranking metric [35][33][36], and we plan for the future to validate Severity in industrial setting. Threats to ex- The smell type presenting the strongest correlation ternal validity could be caused by the fact that we only is CD, suggesting that highly critical components (with analyzed projects written in Java and publicly available. high PageRank) attract CDs hard to solve (with high However, we partially mitigate such issues by analyzing Severity). Thus, developers should pay a lot of attention 10 projects with more than 22 versions each. Moreover, to CD smell, also because CD is the most common AS and the high number of CDs could have reduced the effect of in particular those at package level tend to become more the other types of detected AS in the results. We could critical in terms of PageRank in the history of the project have mitigated this aspect by sampling the CD instances development. However, we do not exclude the possibility and thus balancing the dataset. However, this would addi- that the two metrics have strong correlation because they tionally reduce the size of the dataset, mining the validity capture the same aspects of smells. In that case, we could of the CD results too. In the future, we aim to extend exploit this information to refine the computation of our the study with additional data for the smells and further ADI and leave out one of the two. remediate to this threat. Finally, concerning threats to In any case, we need to conduct a validation of both the reliability of the study, Arcan could be subjected to metrics and on the correlation results, with expert de- a systematic bias in the detection, partially mitigated by velopers or by comparing the ranking provided by the the provided replication package and the fact that the metrics with information coming from issue trackers [12]. tool has been validated on open source and industrial The intuition behind is that a component affected by a projects [23] [28] [1] [24]. Moreover, some threats could critical smell (with high PageRank and high Severity) occur due to errors in the data extraction and prepara- should be also interested by many issues. In addition to tion phases, resulting in errors in the construction of the the validation, in future developments we aim to extend dataset. However, we carefully checked every stage of this work by analyzing more projects, also coming from the data preparation and relied on the support of Knime7 . industry, and verify if the same results can be confirmed. In this paper, we addressed the criticality evaluation of three AS, but the study can be extended also to other 7. Conclusion kinds of AS, e.g., Scattered Functionality and Feature Concentration, two smells which violates the separation We performed an empirical analysis on 22 versions of of concerns principle. Given that such smells are not 10 projects of two software metrics, Severity and PageR- based on dependency issues, we shall define additional ank, in order to evaluate the cost-solving and criticality of criticality metrics for them. AS. We also performed this evaluation with the perspec- tive to better understand if in the ADI computation both the two metrics have to be used or not, if they provide References hints on the criticality evaluation of the AS that have to be both taken in consideration. To conclude, from the [1] A. Martini, F. Arcelli Fontana, A. Biaggi, R. Roveda, analysis of the evolution and correlation of PageRank Identifying and prioritizing architectural debt and Severity we found out that the two metrics tend to through architectural smells: a case study in a large be correlated, except for some extreme cases. It could be software company, in: Proc. of the European Conf. useful for developers to analyze the specific cases where on Software Architecture (ECSA), Springer, 2018. AS have high PageRank and low Severity (and vice-versa), [2] N. A. Ernst, S. Bellomo, I. Ozkaya, R. L. Nord, I. Gor- since they could indicate smell instances which require a ton, Measure it? manage it? ignore it? software tailored prioritization rationale: developers may be inter- practitioners and technical debt, in: Proc. of the ested in identifying cases where the smell is easy to solve 2015 10th Joint Meeting on Foundations of Software (low Severity) but in an important part of the system Engineering, ESEC/FSE 2015, 2015. (high PageRank), and choose to refactor this case first; [3] S. Vidal, W. Oizumi, A. Garcia, A. Díaz Pace, C. Mar- on the contrary, s/he could decide not to refactor a smell cos, Ranking architecturally critical agglomerations difficult to solve (high Severity) and in an unimportant of code smells, Science of Computer Programming (low PageRank) part of the system. We can assert that 182 (2019) 64–85. such smells are a signal that both PageRank and Severity [4] D. Taibi, A. Janes, V. Lenarduzzi, How develop- could be useful to define different refactoring priorities, ers perceive smells in source code: A replicated from different points of view. In particular, PageRank can study, Information and Software Technology 92 be used to identify parts of code which need a continuous (2017) 223–235. inspection, while Severity can be used to evaluate the [5] F. Pecorelli, F. Palomba, F. Khomh, A. De Lucia, cost-solving for the AS removal. Developer-driven code smell prioritization, in: Pro- ceedings of the 17th International Conference on 7 Mining Software Repositories, MSR ’20, ACM, 2020. https://www.knime.com/knime-analytics-platform [6] L. Rizzi, F. A. Fontana, R. Roveda, Support for ar- Engineering, Springer, 2019, pp. 250–260. chitectural smell refactoring, in: Proceedings of [21] A. Oliveira, L. Sousa, W. Oizumi, A. Garcia, On the the 2nd International Workshop on Refactoring, prioritization of design-relevant smelly elements: IWoR@ASE, 2018, pp. 7–10. A mixed-method, multi-project study, in: Proceed- [7] I. Pigazzini, F. A. Fontana, B. Walter, A study on cor- ings of the XIII Brazilian Symposium on Software relations between architectural smells and design Components, Architectures, and Reuse, SBCARS patterns, J. Syst. Softw. (2021). ’19, Association for Computing Machinery, 2019. [8] S. Brin, L. Page, The anatomy of a large-scale hy- [22] R. Terra, L. F. Miranda, M. T. Valente, R. S. Bigonha, pertextual web search engine, in: Seventh Interna- Qualitas.class Corpus: A compiled version of the tional World-Wide Web Conference, 1998. Qualitas Corpus, Software Engineering Notes 38 [9] I. Şora, A pagerank based recommender system (2013). for identifying key classes in software systems, in: [23] F. A. Fontana, I. Pigazzini, R. Roveda, M. Zanoni, Au- 10th Jubilee International Symposium on Applied tomatic detection of instability architectural smells, Computational Intelligence and Informatics, 2015. in: 2016 IEEE International Conference on Software [10] F. A. Fontana, I. Pigazzini, C. Raibulet, S. Basciano, Maintenance and Evolution,ICSME 2016, 2016. R. Roveda, Pagerank and criticality of architectural [24] F. A. Fontana, F. Locatelli, I. Pigazzini, P. Mereghetti, smells, in: Proceedings of the 13th European Con- An architectural smell evaluation in an industrial ference on Software Architecture, ECSA 2019, 2019. context, ICSEA 2020 (2020) 78. [11] F. A. Fontana, P. Avgeriou, I. Pigazzini, R. Roveda, [25] R. C. Martin, Object oriented design quality metrics: A study on architectural smells prediction, in: 2019 An analysis of dependencies, ROAD 2 (1995). 45th Euromicro Conference on Software Engineer- [26] G. Suryanarayana, G. Samarthyam, T. Sharma, ing and Advanced Applications (SEAA), IEEE, 2019. Refactoring for Software Design Smells, 1 ed., Mor- [12] D. M. Le, D. Link, A. Shahbazian, N. Medvidovic, gan Kaufmann, 2015. An empirical study of architectural decay in open- [27] D. Sas, P. Avgeriou, F. A. Fontana, Investigating source software, in: 2018 IEEE International Con- instability architectural smells evolution: An ex- ference on Software Architecture (ICSA), 2018. ploratory case study, in: Int. Conference on Soft- [13] F. A. Fontana, V. Lenarduzzi, R. Roveda, D. Taibi, ware Maintenance and Evolution, ICSME, 2019. Are architectural smells independent from code [28] F. Arcelli Fontana, I. Pigazzini, R. Roveda, D. A. smells? an empirical study, Journal of Systems Tamburri, M. Zanoni, E. D. Nitto, Arcan: A tool for and Software 154 (2019) 139 – 156. architectural smells detection, in: Int’l Conf. Soft- [14] T. Sharma, P. Singh, D. Spinellis, An empirical in- ware Architecture (ICSA 2017) Workshops, 2017. vestigation on the relationship between design and [29] M. B. Wilk, R. Gnanadesikan, Probability plotting architecture smells, Empirical Software Engineer- methods for the analysis of data, Biometrika 55 ing (2020). (1968) 1–17. [15] S. Herold, An initial study on the association be- [30] S. S. Shapiro, M. B. Wilk, An analysis of variance tween architectural smells and degradation, in: test for normality (complete samples), Biometrika Software Architecture, Springer International Pub- 52 (1965) 591–611. lishing, Cham, 2020, pp. 193–201. [31] C. Spearman, The proof and measurement of asso- [16] J. A. D. P. Santiago A. Vidal, Claudia Marcos, An ciation between two things, The American Journal approach to prioritize code smells for refactoring, of Psychology 15 (1904) 72–101. Autom. Softw. Eng. 23 (2016) 501–532. [32] M. Kendall, J. Gibbons, Rank Correlation Methods, [17] A. Rani, J. K. Chhabra, Prioritization of smelly Charles Griffin Book, E. Arnold, 1990. classes: A two phase approach (reducing refactor- [33] R. Wang, R. Huang, B. Qu, Network-based analysis ing efforts), in: 2017 3rd International Confer- of software change propagation, The Scientific ence on Computational Intelligence Communica- World Journal 2014 (2014). tion Technology (CICT), 2017. [34] R. Yin, Case Study Research: Design and Methods, [18] N. Sae-Lim, S. Hayashi, M. Saeki, Context-based Applied Social Research Methods, SAGE Publica- approach to prioritize code smells for refactoring, tions, 2009. Journal of Software: Evolution and Process (2017). [35] F. Perin, L. Renggli, J. Ressia, Ranking software [19] F. A. Fontana, M. Zanoni, Code smell severity classi- artifacts, in: 4th Workshop on FAMIX and Moose fication using machine learning techniques, Knowl. in Reengineering (FAMOOSr 2010), volume 120, Based Syst. 128 (2017). Citeseer, 2010. [20] T. Guggulothu, S. A. Moiz, An approach to suggest [36] W.-f. PAN, B. LI, Y.-t. MA, B. JIANG, Identifying the code smell order for refactoring, in: International key packages using weighted pagerank algorithm, Conference on Emerging Technologies in Computer ACTA ELECTONICA SINICA 42 (2014) 2174.