1. Introduction

Two diferent facets of architectural smells criticality: an empirical study

Ilaria Pigazzini

Davide Foppiani

Francesca Arcelli Fontana

0 0 University of Milano - Bicocca , Milan , Italy

Architectural smells (AS) represent symptoms of problems at architectural level that have an impact on architectural debt. It is important to identify among them the most critical ones, so that developers can prioritize them for their removal. In order to evaluate the criticality of AS, in this paper we consider two facets: the PageRank metric, to assess the centrality of a smell in a project, and Severity, a metric to estimate the cost-solving of smells. We have proposed these two metrics in a previous work and here we perform an empirical analysis of the evolution and correlation of these metrics in the version history of 10 projects (at least 22 versions each, 264 projects in total). The analysis of the evolution is useful in order to identify which architectural smells types tend to become more critical. The analysis of the correlation is useful to study whether the criticality of a smell has an influence on how much it costs to remove it, and vice-versa.

eol>Architectural Smells Architectural Debt Architectural Smells criticality Architectural Smells evolution Empirical study

1. Introduction

cated in a central part of the project and other facets.

Moreover, while criticality gives us information about

Architectural debt can be monitored through difer- the removal urgency, there is another aspect connected ent issues, such as through the presence of architectural to the removal of smells which can be considered and smells in a project. Architectural smells (AS) are de- quantified. AS have a cost-solving (cost of fixing, cost of sign decision that negatively impact internal software refactoring), which is the efort needed to remove a smell qualities and are symptoms of architectural debt [ 1 ], [ 2 ]. from the system [6]. This variable depends less from the Software systems afected by AS are dificult to main- perception of the developers but more from the specific tain and evolve, hence it is important to study them and characteristics of the interested AS. identify solutions to support developers in their removal, To resume, during AS management, developers can in particular the removal of the most critical ones (AS take into consideration two distinct aspects concerning prioritization). smells: their criticality, i.e., how much is important to

In such terms, criticality of an AS models the degree remove them as soon as possible (urgency), and their of removal urgency associated to the AS, i.e., the smell cost-solving, i.e., how much it cost to remove them. should be removed as soon as possible because it afects a Both criticality and cost-solving are particularly relepart of the project which is important for the developers vant for developers when making decisions about AS (e.g., frequently changed or highly referenced) or has a management: for instance, to choose which smell to strong impact on the maintainability of the project. refactor first [ 1 ][ 5 ]. A developer may prefer to refac

However, it is not trivial to model and evaluate the tor first the smells which require less time to be solved importance and urgency of the removal of an AS. In the (low cost solving) to quickly enhance the quality level literature, the identification of the best metrics to be used of the project, instead of fixing the most critical ones. for the evaluation of criticality is considered a complex On the other hand, the developer may decide to remove task [ 3 ], mainly because it is tightly connected to how the most dificult/critical ones, but to make this decision, smells are perceived by developers [ 4 ] and such percep- diferent factors must be considered: it can be too extion is subjected to many variables, such as the developer pensive and risky; too many changes could compromise experience, code ownership [ 5 ], whether the smell is lo- other parts. Perhaps, the most dificult AS was created by design choice and no better solution is available, as in MSR4SA’21: 1st International Workshop on Mining Software the case of cycles created by callbacks for event listeners RVeirptousaitlories for Software Architecture, September 15–17, 2021, in GUI components [ 1 ][7]. Finally, the most critical AS email: i.pigazzini@campus.unimib.it (I. Pigazzini); could appear in a not-central part of the project, such d.foppiani@campus.unimib.it (D. Foppiani); as a deprecated, unessential package, and could be not arcelli@disco.unimib.it (F. A. Fontana) interesting for the developers. orcid: 0000-0003-2629-6762 (I. Pigazzini); 0000-0002-1195-530X In this paper, we consider two metrics, PageRank and (F. A. Fon©ta2n02a1)Copyright for this paper by its authors. Use permitted under Creative Severity, and we propose to use them to model the critiCPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org) cality (PageRank) and the cost-solving (Severity) of three AS based on dependency issues, namely: Cyclic Depen- relevance of the metrics for each type of smell. Otherdency, Unstable Dependency, and Hub-Like Dependency wise, no correlation, we could infer that there is no link (see Section 3.2). PageRank, inspired by the well-known between the urgency of removing a smell and the cost of metric from Brin and Page [8] is a measure that estimates removing a smell, as computed by the proposed metrics. whether an AS is located in an important part of the In this case a developer can decide to not remove an AS project [9], where the importance is evaluated according with low PageRank and high cost solving, and to remove to how many parts of a project depend on the ones in- first an AS with high PageRank and low cost solving, volved in the AS (as a sort of centrality measure of the since this AS could become more critical since it appears AS). We want to use PageRank as a proxy of AS criticality, in a central part of the project. i.e., the higher the PageRank, the higher the criticality of We aim with our study to provide developers insights the AS. Severity, defined by us, is a measure associated on the evaluation of criticality and cost solving of AS to each specific type of AS and is computed through the through the PageRank and Severity metrics. Severity metrics used to detect each smell. Our idea is that the AS metric is focused on evaluating the cost solving in terms characteristics, such as the number of system dependen- of the number of project dependencies afected by the cies it afects, are useful to estimate how much efort is smells, while PageRank is more focused on the imporrequired to refactor the smell (cost-solving), e.g., a smell tance (criticality) of the afected components (classes/which involves many dependencies will require a deep packages). Hence, both metrics could be useful to deanalysis and a lot of time to be solved. termine the prioritization of AS, i.e., help the developer

We have considered these two metrics in a previous in choosing which smell to refactor first depending on study [10], where PageRank and Severity have been eval- the developer’s needs, i.e., the need to address the most uated on only 6 single-version projects. We have now critical ones first or the most expensive ones. extended the study by conducting an empirical evaluation We have considered the two metrics in the computaon a total of 264 versions of 10 projects with the aim to tion of an Architectural Debt Index [11] based on the empirically study criticality and cost-solving during the number of the AS found in a project and their criticalevolution of the projects, and investigate whether there ity measured in terms of both PageRank and Severity is a correlation between the trends of the two metrics, to metrics. The results of this study can be useful also to answer the following Research Questions (RQ): evaluate whether the two metrics truly capture diferent

RQ1: How PageRank and Severity of the smells evolve aspects of a smell or not. In the latter case, one of the in the version history of a project? two metrics could be left out.

RQ2: Can we find some correlation between PageRank The paper is organized through the following sections: and Severity by considering each type of smell? in Section 2 we introduce some related work, in Section 3

The answer to RQ1 aims to analyze if the values of the we describe the study design, in Section 4 we provide the two metrics tend to increase or decrease in the version results we obtained to answer the RQs. Section 5 presents history of the projects. Moreover, we are interested in the discussion of the results and Section 6 outline some understanding which AS type(s) tend to become more threats to the validity of the work. Finally in Section 7 we critical and/or dificult to remove in the version history conclude our work by outlining some threats to validity of a project, where the criticality is evaluated through and future developments. the PageRank and the cost solving is estimated with the Severity metric. In this way a developer can decide to focus the attention on these types of smells first. 2. Related Work

The answer to RQ2 allows to evaluate the correlation between the criticality and the cost solving of a smell. If We first briefly describe some empirical studies on for example the values tend to go together, highly corre- architectural smells. lated, for a specific type of AS, it means that as long as the Le et al. [12] investigated the nature and impact of smell is critical, it is also hard to remove and vice-versa: architectural smells through a large empirical study, by in this case, the two metrics would produce the same exploiting the projects’ issue trackers to analyze the imranking of smells, i.e., the prioritization of the smells pact of smells on software development; Arcelli et al. [13] would be equal by considering one of the two metrics studied the relationship between code smells and archiinterchangeably. In case of positive correlation, it could tectural smells and found that architectural smells are be also in any case interesting to analyze possible out- independent from code smells; Sharma et al. [14] conliers with diferent values of the metrics (high/low) and ducted an empirical study to investigate the relationship better capture the relevance of the metrics (see examples between design and architectural smells in C# projects. in Section 4.2). We could find that the two metrics have Finally Herold [15] performed a preliminary empirical a strong positive correlation for a specific type of smell, study to investigate the relationship between architecand not for other smells. This scenario can outline the tural smells and architectural degradation, the latter measured through the number of architectural violations. tend the previous work on a large number of projects

With respect to these previous papers, we performed (10 projects, 22 versions each, for a total of 264 versions), an empirical study focused on the evaluation of diferent and we analyze the correlation existing between the two facets of architectural smells criticality, not previously metrics through Spearman and Kendall correlation tests. studied in the literature according to our knowledge. Moreover, we study the evolution of the metrics in the

We now outline some related works done in the liter- project history. Finally, in this paper we propose to exature on the evaluation of criticality and prioritization ploit PageRank as a proxy for criticality, and Severity as of code or architectural smells. What distinguishes the a metric to estimate cost-solving. following works is the kind of information used to estimate the priority of a smell. For instance, concerning code smells, Vidal et al. [16] presented an approach to 3. Case Study Design identify the most critical smells based on a combination of three criteria, namely: past component modifications, We describe below the analyzed projects, the data we important modifiability scenarios for the system and rel- collected on AS, their Severity and PageRank and the evance of the kind of smell. Also Rani et al. [17] pro- data preparation and analysis. posed a methodology for code smell prioritization. First, it detects smelly classes using structural information of 3.1. Analyzed projects source code, then mines change history, as done by Vidal We analyzed several versions of 10 projects, for a total et al., to prioritize the smells. Always according to code of 264 versions (see Table 1). Most of the chosen projects smells studies, Sae Lim et al. [18] exploited the developers’ were picked from the Qualitas Corpus [22]. We selected context (a list of issues extracted from an issue tracking these projects since they have already been the subject of system) to define priority. Instead, Arcelli et al.[ 19] pro- several studies, they are publicly available and enable the posed a severity index of the smells based on how the replication of this study. These data were also combined metric thresholds used for the smells detection are ex- with data from the MavenRepository1, also publicly availceeded. Similarly, Guggulothu et al. [20] proposed a pri- able. We considered several releases for each project. To oritisation approach for four code smells (Long Method, easily compare the diferent projects, we chose roughly Feature Envy, God Class and Data Class), depending on the same amount of versions and preferred diferent retheir impact on design quality, where the impact is mea- leases, major or minor, over patches when possible. In sured depending on the overcome of a set of metrics such general, in this paper we use the term version to refer as coupling, size, complexity and cohesion. Moreover both minors and majors. The chosen systems also vary recently, Pecorelli [ 5 ] proposed a machine learning ap- in size and number of smells (see Table 1). In the column proach to prioritise the application of refactoring on code group last version we report the projects’ size (in terms smells. They generated a rank of code smells according to of classes/packages) and number of AS of the last version the perceived criticality that developers assign to them. of the project in the development history.

According to architectural smells, there are fewer studies about prioritization. Martini et al. [ 1 ], performed a study on the analysis of the most critical AS through 3.2. Data collection the feedback of the developers of two industrial projects. Architectural smells we performed this study by conThe smells having top refactoring priority in the opinion sidering the AS detected with the Arcan tool2 [23] deof practitioners are the ones with the highest negative scribed below, but other AS can be considered in the impact on the maintainability and evolvability of the future [24]. We limited the analysis on the following project. On the same line, Oliveira et al. [21] investi- three smells since they are the only ones for which we gated criteria that developers use in practice to prioritize developed a Severity metric, contextually to the definidesign-relevant smelly elements with the aim to develop tion of our Architectural Debt Index (ADI) [11]. a set of prioritization heuristics. From their results, two out of nine heuristics reached an average precision higher • Unstable Dependency (UD) describes a component than 75%. Finally, Vidal et al. [ 3 ] presented and evaluated (package) dependent on other components that a set of five criteria for ranking groups of code smells as are less stable than itself; This may cause a ripple indicators of architectural problems in evolving systems. efect of changes in the system. Instability of a

According to our knowledge no extensive work has component is measured with the metric proposed been previously done on the analysis of the evolution by Martin [25] as the ratio of outgoing depenand correlation between criticality and cost-solving, eval- dencies to the total number of dependencies of uated in terms of PageRank of AS and Severity metrics.

In a previous study [10] we only manually analyzed the 21hDtotpwsn:/l/omadv:nrepository.com/ https://drive.google.com/file/d/ two metrics by considering only 6 projects. Here we ex- 1WNx7FHRykbyOIxz92cDQpSL2rl_gEJ4P/view?usp=sharing the component. Consequences: The components cal, since they have higher maintenance costs. In particuwith an high instability are more prone to change lar, Cyclic Dependency is one of the most common smell with respect to the more stable ones, this means and is considered the most critical smell by developers that the component which depends on less stable [ 1 ].

components is forced to change along with them. We used our Arcan tool for the AS detection, since it is • Hub-Like Dependency (HL) arises when a compo- publicly available, allows to easily detect the considered nent (class or package) has outgoing and incom- AS and has been previously validated [28]. We computed ing dependencies with a large number of other 3 the PageRank and Severity metrics related to the three components [26]; The afected component rep- types of smells and we reported the “granularity level” resents a unique point of failure for the system of the considered smells, either class or package. Our and also a dependency bottleneck. Consequences: distinction between AS at class and package level can be The component in the middle of the hub is a mapped to another nomenclature adopted in the literaunique point of failure and a dependency bot- ture [14] which calls “design smells” our class AS and tleneck. Moreover the logic inside a Hub-Like “architectural smells” our package AS.

Dependency is hard to understand, and the smell We now report the definition of the two metrics under causes change ripple efect. analysis. • Cyclic Dependency (CD) refers to a component Severity is a metric that we defined for each type of (class or package) that is involved in a chain of re- AS to estimate the AS cost solving. In particular, it evalulations that break the desirable acyclic nature of a ates diferent features of the smells which have an impact component’s dependency structure. Components on the efort needed for its removal. For example, for the involved in a CD cannot be reused in isolation estimation of Hub Like Dependency cost-solving, we conand a change on one component propagates to sider the number of dependencies afected by the smell, the other ones. Consequences: The components because this metric gives us information about how many involved in a dependency cycle can be hardly parts of code a developer investigate/change/remove to released, maintained or reused in isolation. More- refactor the HL. over, a change on one afected component will Severity is computed diferently for each type of AS: propagate towards all the other ones involved in for UD it is evaluated through the number of bad dethe cycle. pendencies which cause the Unstable Dependency smell, where for bad dependency we mean a reference from

We considered these three AS because they are some the afected package to the less stable packages i.e. if of the most studied smells [27][13][11][15] and they are package B has high instability and package A has low also perceived as important and detrimental for the qual- instability, the dependency A → B is a bad dependency; ity of the software systems by practitioners[ 1 ][24]. In for HL the Severity corresponds to the total number of particular, these smells are based on dependency issues. dependencies which cause the HL smell (dependencies Dependencies are of great importance in software archi- from a class/package directed to the hub and vice-versa); tecture: components that are highly coupled and with a high number of dependencies are considered more criti- 3https://figshare.com/articles/dataset/_/13636472 for CD it is computed through the number of compo- of the data. The resulting dataset is a collection of 262155 nents involved in the cycle multiplied with the minimum smells categorized by project, version, type, granularity number of times a cycle repeats itself. A dependency be- level, Severity and PageRank. Table 1 shows the sumtween two components can occur multiple times because mary of our dataset, where we report the project size and we count the number of references from a class/package the number of smell instances, divided by type: for each to the others. For instance, if there is a cycle between project (considering all versions in history) we show the package A and B, caused by 5 classes belonging to A number of detected CD at class and package level (CD-Cl calling B, and B’s classes calling A 3 times, the Severity and CD-Pkg), of detected HL at class and package level value is equal to 3. This means that the cycle is repeated (HL-C and HL-P), of detected UD (UD) and the sum of all at least 3 times. project’s AS (AS). A smell instance corresponds to one

PageRank of an AS evaluate the criticality (urgency) occurrence of the smell in the project, thus the reported associated to an AS. The PageRank value of a smell in- numbers are the counts of all the occurrences. stance is computed as the mean value of the PageRank of We studied two diferent aspects: 1) Severity and the components (class or package) afected by the smell. PageRank evolution, in order to answer RQ1; 2) Severity The intuition is that components with high PageRank are and PageRank correlation to answer RQ2. important inside the project, where the importance [9] Concerning evolution, we analyzed the evolution of the corresponds to how many parts of the project depend two metrics for each type of smell in order to study their on the component. PageRank of a component is com- diferent behaviours. We summarised the data for each puted through the PageRank formula implemented by version by averaging the values of both metrics with Brin and Page [8], executed on the dependency graph of respect to the total number of smells detected in the the project: version. We conducted trend analysis to understand how the average values of PageRank and the diferent types () = 1 − + ︃( ∑=︁1 (()) )︃ (1) ioKffetnhSdeearvelelirtsietasyt,mewvohnoilocvtheoniosivcaeunrpotiwnm-aper.adroaWrmedeoterwxicpnltwoesiattredadbtlrteehnteod Maosfastnehsneswhere, the vertex is a node of the dependency graph variable of interest over time. The null hypothesis for associated to a project; () is the value of PageRank this test is that there is no monotonic trend in the series. of the vertex ; is the total number of AS in the project; The alternate hypothesis is that a trend exists. This trend is a vertex with at least a link directed to ; is the can be positive, negative, or non-null. We also analyzed number of the vertexes; () is the number of links the two metrics’ evolution respect to the evolution of of vertex ; (damping factor) is a custom factor fixed the size, where size corresponds to the number of classes at 0.85, a default value defined by Brin and Page. and packages of the projects under analysis, to check

The range of the metric spans from 0 to infinite and whether the two things are correlated. We ran Spearman higher values correspond to higher criticality. To as- and Kendall correlation tests to investigate this aspect. sociate a unique value of PageRank to a single smell Concerning the correlation analysis of PageRank and instance, we compute the mean value of the PageRank Severity, we first tested the normality of our data. Given scores of all the components involved in the smell. In this the large size of our dataset, we used Q-Q plots [29] to way, smells of any type can be ordered by this metric, evaluate if the measures do not follow a normal distrifrom the most critical to the less critical. bution. A Q-Q plot is a graphical method for comparing

Both Severity and PageRank are based on the project two probability distributions by plotting their quantiles dependencies, however they are computed in difer- against each other. These plots are often used when the ent ways and aim to evaluate two distinct aspects: im- dataset is large enough to introduce bias in the Shapiroportance/criticality (for PageRank) and dependencies Wilk test [30], which is a commonly used normality test. structure/cost-solving (for Severity). Hence, we per- The Q-Q plots of all the projects showed a non-normal beformed a correlation analysis to investigate the possible haviour. Then, we tested the correlation between Severity relationship between the two metrics. and PageRank for each version of the projects. We computed the correlation on the metrics data of all smell type 3.3. Data preparation and analysis together and also separately for each smell type. We also computed the correlation separately for each granularity

We ran Arcan and we pre-processed the output data in level, to contextualize the results at package or class level. order to produce the dataset for our analysis. Other than Given the non-normal distribution of our data, we chose Arcan, we exploited the Knime platform4 and R program- the Spearman’s [31] and Kendall’s [32] coeficients to ming language5 for the processing and statistical analysis calculate the correlation.

4https://www.knime.com/knime-analytics-platform 5https://www.r-project.org/ 4. Results We report the results both for PageRank and Severity

evolution and their correlation. At the end of each section, we also report the answer to the relative RQs. All the results and plots can be found in the replication package6. 4.1. Evolution results

In order to answer RQ1, we checked the trend of PageR

ank and Severity values throughout the versions of the projects. For every project and for both PageRank and Severity, we run the Mann-Kendall test. Table 2 and 3 show the outcome of the test, namely reporting the Trend (increasing + or decreasing -), the P-value and the Reference AS (the type of smell which the PageRank refers to) for PageRank, while Granularity (class or package) for Severity. The tables report only results where − < 0.05, i.e., there is a trend. We outline from Table 2 and 3 the following remarks: • PageRank and Severity show a trend during time in few projects. We found PageRank trend in four over ten projects, while Severity showed a trend in five projects. The tables only show the projects with a positive or negative trend. • Concerning the Severity of CDs, we observed both positive and negative trend at class level, in 4 projects, and a negative trend at package level, in one project. • Concerning the Severity of HLs, we had examples

at both class and package level of positive trends. • The Severity metric of Unstable Dependency smell does not show a trend in any project, and we could notice only one project (Hibernate) where the PageRank of UD smells had a trend.

We extended our analysis to see if the project size (measured by number of classes and packages) is correlated with the values of PageRank and Severity. We tested it for each project over its development evolution. We then analyzed the distribution of the correlation on the data of all projects. The first thing we noticed is that the number of classes and packages increases overtime. However, this does not happen for Severity and PageRank values: we do not find a significant correlation between size and the metrics except for the correlation between PageRank computed on AS on packages and the number of packages in the system. The correlation values, computed for all the projects, have range in [0.34, 0.89], with median equals to 0.74. We hypothesise that the correlation is high for PageRank because of how it is computed: the more the number of packages, the more the dependencies and higher the PageRank values are. For this reason, one may say that this should be true also for PageRank computed on classes correlated with the number of classes: instead, their correlation values range in [− 0.87, 0.9] with median equals to 0.45. This result may be due to the high variance in the number of classes among the projects (variance which is smaller for what concerns packages).

RQ1 Answer How PageRank and Severity of the smells evolve in the version history of a project?: in general we found that the average values of PageRank and Severity do not have a trend (neither positive or negative) over time. Concerning the comparison with projects’ size evolution, we found out that PageRank computed on packages show a positive correlation with the evolution of the number of packages: this is reasonable, since the increase/decrease in the number of packages has an impact also on the creation/deletion of package dependencies, thus on PageRank.

P-value 4.2. Correlation results is associated to the most updated codebase, hence we assume it is the most exemplary for them.

In order to answer RQ2, we report in Table 4 the re- By analyzing the correlation coeficients of JMeter’s sults of the correlation between Severity and PageRank, AS, we noticed that when they are calculated separately evaluated on all AS, not considering their type. As can for each AS type, they present higher values than the ones be seen, the majority of the projects presented a strong reported in Table 4. Using Spearman’s as an example: positive correlation ( > 0.6). 0.575 is the value by not considering the AS type and

Following, we discuss the correlation results, but by 0.638, 0.9, 0.881 are the values for CDs, HLs and UDs considering the diferent types of AS. The coeficient respectively. The values seem to imply that actually, values are bounded between: while the correlation in general is weak for this project, • ( CDs) 0.427 and 0.942 with Spearman’s and be- when we look at the specific smell types, the two metrics tween 0.214 and 0.812 with Kendall’s; tend to be positively correlated. However, the number • ( UDs) 0.253 and 1 with Spearman’s and between of HLs and UDs in JMeter is very small compared to 0 and 1 with Kendall’s; the number of CDs. Since correlations computed on few • ( HLs) -1 and 1 for both coeficients. observations are not significant, we can conclude that only the correlation value computed on CDs is relevant Due to their low occurrences, the metrics of HL and UD for JMeter, and it explains why the overall correlation usually present a strong correlation. However, there are value is weak for this project. cases in some projects versions where the scarce number If we closely analyze JGraph evolution, initially it of detected smells makes this calculation misleading: in shows a negative correlation for CDs at package level, some cases correlations are very high, in other ones are which progressively increases (0.2 in version 5.10.0.1) very low (fluctuate). and becomes strongly positive (0.73) in version 5.12.1.0.

On the other hand, CD is the most common smell in We further investigated what caused these changes in the dataset and this has an efect on the correlation values: the correlation values. In the first versions with negathey largely vary in the dataset, making CD the smell type tive correlation we observed 3 CDs at package level, two with some of the highest correlation values and at the of them with similar Severity and PageRank values and same time the smell with some of the lowest correlation one with a strongly higher PageRank value, probably the values. However, a clear result is that for all projects cause of the negative correlation. After version 5.10.0.1 the correlation at package level between PageRank and we noticed the presence of a 4th one. Its Severity was in Severity of CD is strong, with the exception of JGraph line with the others and also its PagerRank: this likely (see the following paragraph). balanced the PageRank values and subsequently caused the increase of the positive correlation.

Observations on weak and negative correlations Hence we can conclude that the variations in the corFrom Table 4 we can observe that some projects, such relations values from negative to positive were due to as JMeter, Lucene, Weka and Ant show a weak corre- the introduction of a new smell instance, whose metrics lation between the two metrics. We aim to investigate values strongly impacted the correlation values due to, as these behaviours and we start by analyzing two projects: for JMeter, the general small amount of smell instances. JMeter, having a weak correlation, and JGraph, showing However, this specific case does not represent a common non-positive correlation values for CDs at package level. behaviour in our dataset.

We focus on the last version of both projects because it

RQ2 Answer Can we find some correlation between reference it (incoming dependencies). In this way, a comPageRank and Severity by considering each type of ponent having many incoming dependencies but refersmell?, we found out that the smell type showing enced by components with few incoming dependencies, the highest PageRank and Severity correlation is is less important with respect to another component with CD at package level. However, also the other types, many incoming dependencies and referenced by other HL and UD, showed strong correlations, but given components with many incoming dependencies. That the lower amount of HL and UD instances, we con- is why PageRank is said to evaluate the importance of a sider the result regarding CDs more meaningful. component with respect to the entire graph. We also investigated specific cases of projects with From our analysis it results that the positive correlaweak correlation and negative correlation but we tion is particularly evident in the case of CD. The reasons did not find further insights. behind the CD Severity high correlation can be multiple: a part of code with high PageRank is interested by more changes [33] with respect to other parts of code, 5. Discussion and thus more open to the introduction of (structurally complex) CDs. This is interesting because in the litera

We found a strong correlation between PageRank and ture we find studies which confirm the correlation in the Severity. This means that, concerning the analysed data other direction [12], i.e., the presence of AS makes the and the considered smells, the criticality and the cost- components more prone to change: if our hypothesis can solving of smells go hand in hand: in the case of this be further corroborated, the conclusion would be that study, if a smell afects an important (unimportant) part the relationship between PageRank and CD Severity is of the system, then it will also have a high (low) cost like a dog chasing its tail, one triggers the other. Another solving. We can outline two diferent interpretations of reason could be that components with high PageRank the results. The positive correlation could be due to the are involved in a high number of dependencies, thus still nature of the two metrics, both bounded to the depen- making easier for a developer to wrongly introduce new dencies of the system. In this case, the conclusion would entangled dependencies and create cycles very dificult be that PageRank and Severity capture the same charac- to remove. teristic of the smells, and one of the two is redundant. As To conclude, there is a positive correlation between AS consequence, in the ADI computation [11], only one of Severity and PageRank, however at the moment we canthe two metrics should be used to evaluate AS criticality. not draw a definitive conclusion about how to interpret

However, given how the metrics are defined, they dif- this finding. We plan to conduct a validation of our refer one from the other. Severity takes into account the sults with developers from industry, who could evaluate dependencies which are directly afected by the smell, the ability of the two metrics to capture criticality and while PageRank considers also dependencies outside the cost-solving, and also manually check the specific cases smell which converge towards the components afected where smells have high PageRank and high Severity. by the smell. Take for instance the Severity of CD, which is based on the dependencies forming the cycle and their weight. If the components involved in the cycle have a 6. Threats to validity high PageRank, it means that they are involved in many dependencies with many other parts of the system, which Our study presents some threats to validity which we is unliked from the fact that those components are part address by following the structure suggested by Yin [34]. of the cycle. With such premise, the two metrics would Concerning the construct validity, the two metrics, capture diferent aspects of the smells, and their positive PageRank and Severity, may not measure what we claim correlation could mean that critical parts of the system they do, i.e., the criticality of the AS. However, this is a attract AS which are more expensive to solve. preliminary study and the next step is to validate the cur

Moreover, one could ask where is the diference in us- rent definition of the metrics with developers, by letting ing PageRank when we could use simple coupling metrics them check whether the prioritization produced by the such as FanIn and FanOut [25]. However, when evaluat- metrics is significant or not. Other threats regarding the ing the coupling of a component, such metrics take into internal validity could be related to the choice of the account only the incoming or outgoing dependencies of statistical methods used for the correlation analysis and the component itself. On the contrary, the PageRank their implementation in the used tools, but we exploited value of a component takes into account the PageRank of very well known and used tools (R language). Moreover, all the components belonging to the dependency graph. we did not validate the two metrics by investigating the In particular, the PageRank of a component is defined perception of developers of PageRank and Severity. Howrecursively and depends on the number of dependen- ever, PageRank was adopted in other studies as software cies and the PageRank metric of all the components that ranking metric [35][33][36], and we plan for the future to validate Severity in industrial setting. Threats to ex- The smell type presenting the strongest correlation ternal validity could be caused by the fact that we only is CD, suggesting that highly critical components (with analyzed projects written in Java and publicly available. high PageRank) attract CDs hard to solve (with high However, we partially mitigate such issues by analyzing Severity). Thus, developers should pay a lot of attention 10 projects with more than 22 versions each. Moreover, to CD smell, also because CD is the most common AS and the high number of CDs could have reduced the efect of in particular those at package level tend to become more the other types of detected AS in the results. We could critical in terms of PageRank in the history of the project have mitigated this aspect by sampling the CD instances development. However, we do not exclude the possibility and thus balancing the dataset. However, this would addi- that the two metrics have strong correlation because they tionally reduce the size of the dataset, mining the validity capture the same aspects of smells. In that case, we could of the CD results too. In the future, we aim to extend exploit this information to refine the computation of our the study with additional data for the smells and further ADI and leave out one of the two. remediate to this threat. Finally, concerning threats to In any case, we need to conduct a validation of both the reliability of the study, Arcan could be subjected to metrics and on the correlation results, with expert dea systematic bias in the detection, partially mitigated by velopers or by comparing the ranking provided by the the provided replication package and the fact that the metrics with information coming from issue trackers [12]. tool has been validated on open source and industrial The intuition behind is that a component afected by a projects [23] [28] [ 1 ] [24]. Moreover, some threats could critical smell (with high PageRank and high Severity) occur due to errors in the data extraction and prepara- should be also interested by many issues. In addition to tion phases, resulting in errors in the construction of the the validation, in future developments we aim to extend dataset. However, we carefully checked every stage of this work by analyzing more projects, also coming from the data preparation and relied on the support of Knime7. industry, and verify if the same results can be confirmed.

In this paper, we addressed the criticality evaluation of three AS, but the study can be extended also to other 7. Conclusion kinds of AS, e.g., Scattered Functionality and Feature Concentration, two smells which violates the separation of concerns principle. Given that such smells are not based on dependency issues, we shall define additional criticality metrics for them.

We performed an empirical analysis on 22 versions of 10 projects of two software metrics, Severity and PageRank, in order to evaluate the cost-solving and criticality of AS. We also performed this evaluation with the perspective to better understand if in the ADI computation both the two metrics have to be used or not, if they provide hints on the criticality evaluation of the AS that have to be both taken in consideration. To conclude, from the analysis of the evolution and correlation of PageRank and Severity we found out that the two metrics tend to be correlated, except for some extreme cases. It could be useful for developers to analyze the specific cases where AS have high PageRank and low Severity (and vice-versa), since they could indicate smell instances which require a tailored prioritization rationale: developers may be interested in identifying cases where the smell is easy to solve (low Severity) but in an important part of the system (high PageRank), and choose to refactor this case first; on the contrary, s/he could decide not to refactor a smell dificult to solve (high Severity) and in an unimportant (low PageRank) part of the system. We can assert that such smells are a signal that both PageRank and Severity could be useful to define diferent refactoring priorities, from diferent points of view. In particular, PageRank can be used to identify parts of code which need a continuous inspection, while Severity can be used to evaluate the cost-solving for the AS removal.

7https://www.knime.com/knime-analytics-platform

[6] L. Rizzi, F. A. Fontana, R. Roveda, Support for ar- Engineering, Springer, 2019, pp. 250–260. chitectural smell refactoring, in: Proceedings of [21] A. Oliveira, L. Sousa, W. Oizumi, A. Garcia, On the the 2nd International Workshop on Refactoring, prioritization of design-relevant smelly elements: IWoR@ASE, 2018, pp. 7–10. A mixed-method, multi-project study, in: Proceed[7] I. Pigazzini, F. A. Fontana, B. Walter, A study on cor- ings of the XIII Brazilian Symposium on Software relations between architectural smells and design Components, Architectures, and Reuse, SBCARS patterns, J. Syst. Softw. (2021). ’19, Association for Computing Machinery, 2019. [8] S. Brin, L. Page, The anatomy of a large-scale hy- [22] R. Terra, L. F. Miranda, M. T. Valente, R. S. Bigonha, pertextual web search engine, in: Seventh Interna- Qualitas.class Corpus: A compiled version of the tional World-Wide Web Conference, 1998. Qualitas Corpus, Software Engineering Notes 38 [9] I. Şora, A pagerank based recommender system (2013).

for identifying key classes in software systems, in: [23] F. A. Fontana, I. Pigazzini, R. Roveda, M. Zanoni, Au10th Jubilee International Symposium on Applied tomatic detection of instability architectural smells, Computational Intelligence and Informatics, 2015. in: 2016 IEEE International Conference on Software [10] F. A. Fontana, I. Pigazzini, C. Raibulet, S. Basciano, Maintenance and Evolution,ICSME 2016, 2016.

R. Roveda, Pagerank and criticality of architectural [24] F. A. Fontana, F. Locatelli, I. Pigazzini, P. Mereghetti, smells, in: Proceedings of the 13th European Con- An architectural smell evaluation in an industrial ference on Software Architecture, ECSA 2019, 2019. context, ICSEA 2020 (2020) 78. [11] F. A. Fontana, P. Avgeriou, I. Pigazzini, R. Roveda, [25] R. C. Martin, Object oriented design quality metrics: A study on architectural smells prediction, in: 2019 An analysis of dependencies, ROAD 2 (1995). 45th Euromicro Conference on Software Engineer- [26] G. Suryanarayana, G. Samarthyam, T. Sharma, ing and Advanced Applications (SEAA), IEEE, 2019. Refactoring for Software Design Smells, 1 ed., Mor[12] D. M. Le, D. Link, A. Shahbazian, N. Medvidovic, gan Kaufmann, 2015.

An empirical study of architectural decay in open- [27] D. Sas, P. Avgeriou, F. A. Fontana, Investigating source software, in: 2018 IEEE International Con- instability architectural smells evolution: An exference on Software Architecture (ICSA), 2018. ploratory case study, in: Int. Conference on Soft[13] F. A. Fontana, V. Lenarduzzi, R. Roveda, D. Taibi, ware Maintenance and Evolution, ICSME, 2019.

Are architectural smells independent from code [28] F. Arcelli Fontana, I. Pigazzini, R. Roveda, D. A. smells? an empirical study, Journal of Systems Tamburri, M. Zanoni, E. D. Nitto, Arcan: A tool for and Software 154 (2019) 139 – 156. architectural smells detection, in: Int’l Conf. Soft[14] T. Sharma, P. Singh, D. Spinellis, An empirical in- ware Architecture (ICSA 2017) Workshops, 2017. vestigation on the relationship between design and [29] M. B. Wilk, R. Gnanadesikan, Probability plotting architecture smells, Empirical Software Engineer- methods for the analysis of data, Biometrika 55 ing (2020). (1968) 1–17. [15] S. Herold, An initial study on the association be- [30] S. S. Shapiro, M. B. Wilk, An analysis of variance tween architectural smells and degradation, in: test for normality (complete samples), Biometrika Software Architecture, Springer International Pub- 52 (1965) 591–611.

lishing, Cham, 2020, pp. 193–201. [31] C. Spearman, The proof and measurement of asso[16] J. A. D. P. Santiago A. Vidal, Claudia Marcos, An ciation between two things, The American Journal approach to prioritize code smells for refactoring, of Psychology 15 (1904) 72–101.

Autom. Softw. Eng. 23 (2016) 501–532. [32] M. Kendall, J. Gibbons, Rank Correlation Methods, [17] A. Rani, J. K. Chhabra, Prioritization of smelly Charles Grifin Book, E. Arnold, 1990. classes: A two phase approach (reducing refactor- [33] R. Wang, R. Huang, B. Qu, Network-based analysis ing eforts), in: 2017 3rd International Confer- of software change propagation, The Scientific ence on Computational Intelligence Communica- World Journal 2014 (2014).

tion Technology (CICT), 2017. [34] R. Yin, Case Study Research: Design and Methods, [18] N. Sae-Lim, S. Hayashi, M. Saeki, Context-based Applied Social Research Methods, SAGE Publicaapproach to prioritize code smells for refactoring, tions, 2009.

Journal of Software: Evolution and Process (2017). [35] F. Perin, L. Renggli, J. Ressia, Ranking software [19] F. A. Fontana, M. Zanoni, Code smell severity classi- artifacts, in: 4th Workshop on FAMIX and Moose ifcation using machine learning techniques, Knowl. in Reengineering (FAMOOSr 2010), volume 120, Based Syst. 128 (2017). Citeseer, 2010. [20] T. Guggulothu, S. A. Moiz, An approach to suggest [36] W.-f. PAN, B. LI, Y.-t. MA, B. JIANG, Identifying the code smell order for refactoring, in: International key packages using weighted pagerank algorithm, Conference on Emerging Technologies in Computer ACTA ELECTONICA SINICA 42 (2014) 2174.

[1]

Martini ,

Arcelli Fontana ,

Biaggi ,

Roveda , Identifying and prioritizing architectural debt through architectural smells: a case study in a large software company , in: Proc. of the European Conf. on Software Architecture (ECSA) , Springer, 2018 .

[2]

N. A.

Ernst ,

Bellomo , I. Ozkaya ,

R. L.

Nord , I. Gorton , Measure it? manage it? ignore it? software practitioners and technical debt , in: Proc. of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015 , 2015 .

[3]

Vidal ,

Oizumi ,

Garcia ,

A. Díaz

Pace ,

Marcos , Ranking architecturally critical agglomerations of code smells , Science of Computer Programming 182 ( 2019 ) 64 - 85 .

[4]

Taibi ,

Janes ,

Lenarduzzi , How developers perceive smells in source code: A replicated study , Information and Software Technology 92 ( 2017 ) 223 - 235 .

[5]

Pecorelli ,

Palomba ,

Khomh , A. De Lucia , Developer-driven code smell prioritization , in: Proceedings of the 17th International Conference on Mining Software Repositories, MSR '20 , ACM , 2020 .