<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Two diferent facets of architectural smells criticality: an empirical study</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ilaria Pigazzini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Davide Foppiani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesca Arcelli Fontana</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Milano - Bicocca</institution>
          ,
          <addr-line>Milan</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Architectural smells (AS) represent symptoms of problems at architectural level that have an impact on architectural debt. It is important to identify among them the most critical ones, so that developers can prioritize them for their removal. In order to evaluate the criticality of AS, in this paper we consider two facets: the PageRank metric, to assess the centrality of a smell in a project, and Severity, a metric to estimate the cost-solving of smells. We have proposed these two metrics in a previous work and here we perform an empirical analysis of the evolution and correlation of these metrics in the version history of 10 projects (at least 22 versions each, 264 projects in total). The analysis of the evolution is useful in order to identify which architectural smells types tend to become more critical. The analysis of the correlation is useful to study whether the criticality of a smell has an influence on how much it costs to remove it, and vice-versa.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Architectural Smells</kwd>
        <kwd>Architectural Debt</kwd>
        <kwd>Architectural Smells criticality</kwd>
        <kwd>Architectural Smells evolution</kwd>
        <kwd>Empirical study</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>cated in a central part of the project and other facets.</p>
      <p>Moreover, while criticality gives us information about</p>
      <p>
        Architectural debt can be monitored through difer- the removal urgency, there is another aspect connected
ent issues, such as through the presence of architectural to the removal of smells which can be considered and
smells in a project. Architectural smells (AS) are de- quantified. AS have a cost-solving (cost of fixing, cost of
sign decision that negatively impact internal software refactoring), which is the efort needed to remove a smell
qualities and are symptoms of architectural debt [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. from the system [6]. This variable depends less from the
Software systems afected by AS are dificult to main- perception of the developers but more from the specific
tain and evolve, hence it is important to study them and characteristics of the interested AS.
identify solutions to support developers in their removal, To resume, during AS management, developers can
in particular the removal of the most critical ones (AS take into consideration two distinct aspects concerning
prioritization). smells: their criticality, i.e., how much is important to
      </p>
      <p>
        In such terms, criticality of an AS models the degree remove them as soon as possible (urgency), and their
of removal urgency associated to the AS, i.e., the smell cost-solving, i.e., how much it cost to remove them.
should be removed as soon as possible because it afects a Both criticality and cost-solving are particularly
relepart of the project which is important for the developers vant for developers when making decisions about AS
(e.g., frequently changed or highly referenced) or has a management: for instance, to choose which smell to
strong impact on the maintainability of the project. refactor first [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ][
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. A developer may prefer to
refac
      </p>
      <p>
        However, it is not trivial to model and evaluate the tor first the smells which require less time to be solved
importance and urgency of the removal of an AS. In the (low cost solving) to quickly enhance the quality level
literature, the identification of the best metrics to be used of the project, instead of fixing the most critical ones.
for the evaluation of criticality is considered a complex On the other hand, the developer may decide to remove
task [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], mainly because it is tightly connected to how the most dificult/critical ones, but to make this decision,
smells are perceived by developers [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and such percep- diferent factors must be considered: it can be too
extion is subjected to many variables, such as the developer pensive and risky; too many changes could compromise
experience, code ownership [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], whether the smell is lo- other parts. Perhaps, the most dificult AS was created
by design choice and no better solution is available, as in
MSR4SA’21: 1st International Workshop on Mining Software the case of cycles created by callbacks for event listeners
RVeirptousaitlories for Software Architecture, September 15–17, 2021, in GUI components [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ][7]. Finally, the most critical AS
email: i.pigazzini@campus.unimib.it (I. Pigazzini); could appear in a not-central part of the project, such
d.foppiani@campus.unimib.it (D. Foppiani); as a deprecated, unessential package, and could be not
arcelli@disco.unimib.it (F. A. Fontana) interesting for the developers.
orcid: 0000-0003-2629-6762 (I. Pigazzini); 0000-0002-1195-530X In this paper, we consider two metrics, PageRank and
(F. A. Fon©ta2n02a1)Copyright for this paper by its authors. Use permitted under Creative Severity, and we propose to use them to model the
critiCPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org) cality (PageRank) and the cost-solving (Severity) of three
AS based on dependency issues, namely: Cyclic Depen- relevance of the metrics for each type of smell.
Otherdency, Unstable Dependency, and Hub-Like Dependency wise, no correlation, we could infer that there is no link
(see Section 3.2). PageRank, inspired by the well-known between the urgency of removing a smell and the cost of
metric from Brin and Page [8] is a measure that estimates removing a smell, as computed by the proposed metrics.
whether an AS is located in an important part of the In this case a developer can decide to not remove an AS
project [9], where the importance is evaluated according with low PageRank and high cost solving, and to remove
to how many parts of a project depend on the ones in- first an AS with high PageRank and low cost solving,
volved in the AS (as a sort of centrality measure of the since this AS could become more critical since it appears
AS). We want to use PageRank as a proxy of AS criticality, in a central part of the project.
i.e., the higher the PageRank, the higher the criticality of We aim with our study to provide developers insights
the AS. Severity, defined by us, is a measure associated on the evaluation of criticality and cost solving of AS
to each specific type of AS and is computed through the through the PageRank and Severity metrics. Severity
metrics used to detect each smell. Our idea is that the AS metric is focused on evaluating the cost solving in terms
characteristics, such as the number of system dependen- of the number of project dependencies afected by the
cies it afects, are useful to estimate how much efort is smells, while PageRank is more focused on the
imporrequired to refactor the smell (cost-solving), e.g., a smell tance (criticality) of the afected components
(classes/which involves many dependencies will require a deep packages). Hence, both metrics could be useful to
deanalysis and a lot of time to be solved. termine the prioritization of AS, i.e., help the developer
      </p>
      <p>We have considered these two metrics in a previous in choosing which smell to refactor first depending on
study [10], where PageRank and Severity have been eval- the developer’s needs, i.e., the need to address the most
uated on only 6 single-version projects. We have now critical ones first or the most expensive ones.
extended the study by conducting an empirical evaluation We have considered the two metrics in the
computaon a total of 264 versions of 10 projects with the aim to tion of an Architectural Debt Index [11] based on the
empirically study criticality and cost-solving during the number of the AS found in a project and their
criticalevolution of the projects, and investigate whether there ity measured in terms of both PageRank and Severity
is a correlation between the trends of the two metrics, to metrics. The results of this study can be useful also to
answer the following Research Questions (RQ): evaluate whether the two metrics truly capture diferent</p>
      <p>RQ1: How PageRank and Severity of the smells evolve aspects of a smell or not. In the latter case, one of the
in the version history of a project? two metrics could be left out.</p>
      <p>RQ2: Can we find some correlation between PageRank The paper is organized through the following sections:
and Severity by considering each type of smell? in Section 2 we introduce some related work, in Section 3</p>
      <p>The answer to RQ1 aims to analyze if the values of the we describe the study design, in Section 4 we provide the
two metrics tend to increase or decrease in the version results we obtained to answer the RQs. Section 5 presents
history of the projects. Moreover, we are interested in the discussion of the results and Section 6 outline some
understanding which AS type(s) tend to become more threats to the validity of the work. Finally in Section 7 we
critical and/or dificult to remove in the version history conclude our work by outlining some threats to validity
of a project, where the criticality is evaluated through and future developments.
the PageRank and the cost solving is estimated with the
Severity metric. In this way a developer can decide to
focus the attention on these types of smells first. 2. Related Work</p>
      <p>The answer to RQ2 allows to evaluate the correlation
between the criticality and the cost solving of a smell. If We first briefly describe some empirical studies on
for example the values tend to go together, highly corre- architectural smells.
lated, for a specific type of AS, it means that as long as the Le et al. [12] investigated the nature and impact of
smell is critical, it is also hard to remove and vice-versa: architectural smells through a large empirical study, by
in this case, the two metrics would produce the same exploiting the projects’ issue trackers to analyze the
imranking of smells, i.e., the prioritization of the smells pact of smells on software development; Arcelli et al. [13]
would be equal by considering one of the two metrics studied the relationship between code smells and
archiinterchangeably. In case of positive correlation, it could tectural smells and found that architectural smells are
be also in any case interesting to analyze possible out- independent from code smells; Sharma et al. [14]
conliers with diferent values of the metrics (high/low) and ducted an empirical study to investigate the relationship
better capture the relevance of the metrics (see examples between design and architectural smells in C# projects.
in Section 4.2). We could find that the two metrics have Finally Herold [15] performed a preliminary empirical
a strong positive correlation for a specific type of smell, study to investigate the relationship between
architecand not for other smells. This scenario can outline the tural smells and architectural degradation, the latter
measured through the number of architectural violations. tend the previous work on a large number of projects</p>
      <p>With respect to these previous papers, we performed (10 projects, 22 versions each, for a total of 264 versions),
an empirical study focused on the evaluation of diferent and we analyze the correlation existing between the two
facets of architectural smells criticality, not previously metrics through Spearman and Kendall correlation tests.
studied in the literature according to our knowledge. Moreover, we study the evolution of the metrics in the</p>
      <p>
        We now outline some related works done in the liter- project history. Finally, in this paper we propose to
exature on the evaluation of criticality and prioritization ploit PageRank as a proxy for criticality, and Severity as
of code or architectural smells. What distinguishes the a metric to estimate cost-solving.
following works is the kind of information used to
estimate the priority of a smell. For instance, concerning
code smells, Vidal et al. [16] presented an approach to 3. Case Study Design
identify the most critical smells based on a combination
of three criteria, namely: past component modifications, We describe below the analyzed projects, the data we
important modifiability scenarios for the system and rel- collected on AS, their Severity and PageRank and the
evance of the kind of smell. Also Rani et al. [17] pro- data preparation and analysis.
posed a methodology for code smell prioritization. First,
it detects smelly classes using structural information of 3.1. Analyzed projects
source code, then mines change history, as done by Vidal We analyzed several versions of 10 projects, for a total
et al., to prioritize the smells. Always according to code of 264 versions (see Table 1). Most of the chosen projects
smells studies, Sae Lim et al. [18] exploited the developers’ were picked from the Qualitas Corpus [22]. We selected
context (a list of issues extracted from an issue tracking these projects since they have already been the subject of
system) to define priority. Instead, Arcelli et al.[ 19] pro- several studies, they are publicly available and enable the
posed a severity index of the smells based on how the replication of this study. These data were also combined
metric thresholds used for the smells detection are ex- with data from the MavenRepository1, also publicly
availceeded. Similarly, Guggulothu et al. [20] proposed a pri- able. We considered several releases for each project. To
oritisation approach for four code smells (Long Method, easily compare the diferent projects, we chose roughly
Feature Envy, God Class and Data Class), depending on the same amount of versions and preferred diferent
retheir impact on design quality, where the impact is mea- leases, major or minor, over patches when possible. In
sured depending on the overcome of a set of metrics such general, in this paper we use the term version to refer
as coupling, size, complexity and cohesion. Moreover both minors and majors. The chosen systems also vary
recently, Pecorelli [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] proposed a machine learning ap- in size and number of smells (see Table 1). In the column
proach to prioritise the application of refactoring on code group last version we report the projects’ size (in terms
smells. They generated a rank of code smells according to of classes/packages) and number of AS of the last version
the perceived criticality that developers assign to them. of the project in the development history.
      </p>
      <p>
        According to architectural smells, there are fewer
studies about prioritization. Martini et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], performed a
study on the analysis of the most critical AS through 3.2. Data collection
the feedback of the developers of two industrial projects. Architectural smells we performed this study by
conThe smells having top refactoring priority in the opinion sidering the AS detected with the Arcan tool2 [23]
deof practitioners are the ones with the highest negative scribed below, but other AS can be considered in the
impact on the maintainability and evolvability of the future [24]. We limited the analysis on the following
project. On the same line, Oliveira et al. [21] investi- three smells since they are the only ones for which we
gated criteria that developers use in practice to prioritize developed a Severity metric, contextually to the
definidesign-relevant smelly elements with the aim to develop tion of our Architectural Debt Index (ADI) [11].
a set of prioritization heuristics. From their results, two
out of nine heuristics reached an average precision higher • Unstable Dependency (UD) describes a component
than 75%. Finally, Vidal et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] presented and evaluated (package) dependent on other components that
a set of five criteria for ranking groups of code smells as are less stable than itself; This may cause a ripple
indicators of architectural problems in evolving systems. efect of changes in the system. Instability of a
      </p>
      <p>According to our knowledge no extensive work has component is measured with the metric proposed
been previously done on the analysis of the evolution by Martin [25] as the ratio of outgoing
depenand correlation between criticality and cost-solving, eval- dencies to the total number of dependencies of
uated in terms of PageRank of AS and Severity metrics.</p>
      <p>
        In a previous study [10] we only manually analyzed the 21hDtotpwsn:/l/omadv:nrepository.com/ https://drive.google.com/file/d/
two metrics by considering only 6 projects. Here we ex- 1WNx7FHRykbyOIxz92cDQpSL2rl_gEJ4P/view?usp=sharing
the component. Consequences: The components cal, since they have higher maintenance costs. In
particuwith an high instability are more prone to change lar, Cyclic Dependency is one of the most common smell
with respect to the more stable ones, this means and is considered the most critical smell by developers
that the component which depends on less stable [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>components is forced to change along with them. We used our Arcan tool for the AS detection, since it is
• Hub-Like Dependency (HL) arises when a compo- publicly available, allows to easily detect the considered
nent (class or package) has outgoing and incom- AS and has been previously validated [28]. We computed
ing dependencies with a large number of other 3 the PageRank and Severity metrics related to the three
components [26]; The afected component rep- types of smells and we reported the “granularity level”
resents a unique point of failure for the system of the considered smells, either class or package. Our
and also a dependency bottleneck. Consequences: distinction between AS at class and package level can be
The component in the middle of the hub is a mapped to another nomenclature adopted in the
literaunique point of failure and a dependency bot- ture [14] which calls “design smells” our class AS and
tleneck. Moreover the logic inside a Hub-Like “architectural smells” our package AS.</p>
      <p>Dependency is hard to understand, and the smell We now report the definition of the two metrics under
causes change ripple efect. analysis.
• Cyclic Dependency (CD) refers to a component Severity is a metric that we defined for each type of
(class or package) that is involved in a chain of re- AS to estimate the AS cost solving. In particular, it
evalulations that break the desirable acyclic nature of a ates diferent features of the smells which have an impact
component’s dependency structure. Components on the efort needed for its removal. For example, for the
involved in a CD cannot be reused in isolation estimation of Hub Like Dependency cost-solving, we
conand a change on one component propagates to sider the number of dependencies afected by the smell,
the other ones. Consequences: The components because this metric gives us information about how many
involved in a dependency cycle can be hardly parts of code a developer investigate/change/remove to
released, maintained or reused in isolation. More- refactor the HL.
over, a change on one afected component will Severity is computed diferently for each type of AS:
propagate towards all the other ones involved in for UD it is evaluated through the number of bad
dethe cycle. pendencies which cause the Unstable Dependency smell,
where for bad dependency we mean a reference from</p>
      <p>
        We considered these three AS because they are some the afected package to the less stable packages i.e. if
of the most studied smells [27][13][11][15] and they are package B has high instability and package A has low
also perceived as important and detrimental for the qual- instability, the dependency A → B is a bad dependency;
ity of the software systems by practitioners[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ][24]. In for HL the Severity corresponds to the total number of
particular, these smells are based on dependency issues. dependencies which cause the HL smell (dependencies
Dependencies are of great importance in software archi- from a class/package directed to the hub and vice-versa);
tecture: components that are highly coupled and with a
high number of dependencies are considered more criti- 3https://figshare.com/articles/dataset/_/13636472
for CD it is computed through the number of compo- of the data. The resulting dataset is a collection of 262155
nents involved in the cycle multiplied with the minimum smells categorized by project, version, type, granularity
number of times a cycle repeats itself. A dependency be- level, Severity and PageRank. Table 1 shows the
sumtween two components can occur multiple times because mary of our dataset, where we report the project size and
we count the number of references from a class/package the number of smell instances, divided by type: for each
to the others. For instance, if there is a cycle between project (considering all versions in history) we show the
package A and B, caused by 5 classes belonging to A number of detected CD at class and package level (CD-Cl
calling B, and B’s classes calling A 3 times, the Severity and CD-Pkg), of detected HL at class and package level
value is equal to 3. This means that the cycle is repeated (HL-C and HL-P), of detected UD (UD) and the sum of all
at least 3 times. project’s AS (AS). A smell instance corresponds to one
      </p>
      <p>PageRank of an AS evaluate the criticality (urgency) occurrence of the smell in the project, thus the reported
associated to an AS. The PageRank value of a smell in- numbers are the counts of all the occurrences.
stance is computed as the mean value of the PageRank of We studied two diferent aspects: 1) Severity and
the components (class or package) afected by the smell. PageRank evolution, in order to answer RQ1; 2) Severity
The intuition is that components with high PageRank are and PageRank correlation to answer RQ2.
important inside the project, where the importance [9] Concerning evolution, we analyzed the evolution of the
corresponds to how many parts of the project depend two metrics for each type of smell in order to study their
on the component. PageRank of a component is com- diferent behaviours. We summarised the data for each
puted through the PageRank formula implemented by version by averaging the values of both metrics with
Brin and Page [8], executed on the dependency graph of respect to the total number of smells detected in the
the project: version. We conducted trend analysis to understand how
the average values of PageRank and the diferent types
 () = 1 −  +  ︃( ∑=︁1 (()) )︃ (1) ioKffetnhSdeearvelelirtsietasyt,mewvohnoilocvtheoniosivcaeunrpotiwnm-aper.adroaWrmedeoterwxicpnltwoesiattredadbtlrteehnteod
Maosfastnehsneswhere, the vertex  is a node of the dependency graph variable of interest over time. The null hypothesis for
associated to a project;  () is the value of PageRank this test is that there is no monotonic trend in the series.
of the vertex ;  is the total number of AS in the project; The alternate hypothesis is that a trend exists. This trend
 is a vertex with at least a link directed to ;  is the can be positive, negative, or non-null. We also analyzed
number of the  vertexes; () is the number of links the two metrics’ evolution respect to the evolution of
of vertex ;  (damping factor) is a custom factor fixed the size, where size corresponds to the number of classes
at 0.85, a default value defined by Brin and Page. and packages of the projects under analysis, to check</p>
      <p>The range of the metric spans from 0 to infinite and whether the two things are correlated. We ran Spearman
higher values correspond to higher criticality. To as- and Kendall correlation tests to investigate this aspect.
sociate a unique value of PageRank to a single smell Concerning the correlation analysis of PageRank and
instance, we compute the mean value of the PageRank Severity, we first tested the normality of our data. Given
scores of all the components involved in the smell. In this the large size of our dataset, we used Q-Q plots [29] to
way, smells of any type can be ordered by this metric, evaluate if the measures do not follow a normal
distrifrom the most critical to the less critical. bution. A Q-Q plot is a graphical method for comparing</p>
      <p>Both Severity and PageRank are based on the project two probability distributions by plotting their quantiles
dependencies, however they are computed in difer- against each other. These plots are often used when the
ent ways and aim to evaluate two distinct aspects: im- dataset is large enough to introduce bias in the
Shapiroportance/criticality (for PageRank) and dependencies Wilk test [30], which is a commonly used normality test.
structure/cost-solving (for Severity). Hence, we per- The Q-Q plots of all the projects showed a non-normal
beformed a correlation analysis to investigate the possible haviour. Then, we tested the correlation between Severity
relationship between the two metrics. and PageRank for each version of the projects. We
computed the correlation on the metrics data of all smell type
3.3. Data preparation and analysis together and also separately for each smell type. We also
computed the correlation separately for each granularity</p>
      <p>We ran Arcan and we pre-processed the output data in level, to contextualize the results at package or class level.
order to produce the dataset for our analysis. Other than Given the non-normal distribution of our data, we chose
Arcan, we exploited the Knime platform4 and R program- the Spearman’s [31] and Kendall’s [32] coeficients to
ming language5 for the processing and statistical analysis calculate the correlation.</p>
      <sec id="sec-1-1">
        <title>4https://www.knime.com/knime-analytics-platform 5https://www.r-project.org/</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Results</title>
      <sec id="sec-2-1">
        <title>We report the results both for PageRank and Severity</title>
        <p>evolution and their correlation. At the end of each section,
we also report the answer to the relative RQs. All the
results and plots can be found in the replication package6.
4.1. Evolution results</p>
      </sec>
      <sec id="sec-2-2">
        <title>In order to answer RQ1, we checked the trend of PageR</title>
        <p>ank and Severity values throughout the versions of the
projects. For every project and for both PageRank and
Severity, we run the Mann-Kendall test. Table 2 and
3 show the outcome of the test, namely reporting the
Trend (increasing + or decreasing -), the P-value and
the Reference AS (the type of smell which the PageRank
refers to) for PageRank, while Granularity (class or
package) for Severity. The tables report only results where
 −  &lt; 0.05, i.e., there is a trend. We outline from
Table 2 and 3 the following remarks:
• PageRank and Severity show a trend during time
in few projects. We found PageRank trend in four
over ten projects, while Severity showed a trend
in five projects. The tables only show the projects
with a positive or negative trend.
• Concerning the Severity of CDs, we observed
both positive and negative trend at class level, in
4 projects, and a negative trend at package level,
in one project.
• Concerning the Severity of HLs, we had examples</p>
        <p>at both class and package level of positive trends.
• The Severity metric of Unstable Dependency
smell does not show a trend in any project, and we
could notice only one project (Hibernate) where
the PageRank of UD smells had a trend.</p>
        <p>We extended our analysis to see if the project size
(measured by number of classes and packages) is correlated
with the values of PageRank and Severity. We tested it
for each project over its development evolution. We then
analyzed the distribution of the correlation on the data of
all projects. The first thing we noticed is that the number
of classes and packages increases overtime. However,
this does not happen for Severity and PageRank values:
we do not find a significant correlation between size and
the metrics except for the correlation between PageRank
computed on AS on packages and the number of
packages in the system. The correlation values, computed for
all the projects, have range in [0.34, 0.89], with median
equals to 0.74. We hypothesise that the correlation is
high for PageRank because of how it is computed: the
more the number of packages, the more the dependencies
and higher the PageRank values are. For this reason, one
may say that this should be true also for PageRank
computed on classes correlated with the number of classes:
instead, their correlation values range in [− 0.87, 0.9]
with median equals to 0.45. This result may be due to
the high variance in the number of classes among the
projects (variance which is smaller for what concerns
packages).</p>
        <p>RQ1 Answer How PageRank and Severity of the
smells evolve in the version history of a project?: in
general we found that the average values of
PageRank and Severity do not have a trend (neither
positive or negative) over time. Concerning the
comparison with projects’ size evolution, we found out
that PageRank computed on packages show a
positive correlation with the evolution of the number of
packages: this is reasonable, since the
increase/decrease in the number of packages has an impact also
on the creation/deletion of package dependencies,
thus on PageRank.</p>
        <p>P-value
4.2. Correlation results is associated to the most updated codebase, hence we
assume it is the most exemplary for them.</p>
        <p>In order to answer RQ2, we report in Table 4 the re- By analyzing the correlation coeficients of JMeter’s
sults of the correlation between Severity and PageRank, AS, we noticed that when they are calculated separately
evaluated on all AS, not considering their type. As can for each AS type, they present higher values than the ones
be seen, the majority of the projects presented a strong reported in Table 4. Using Spearman’s as an example:
positive correlation ( &gt; 0.6). 0.575 is the  value by not considering the AS type and</p>
        <p>Following, we discuss the correlation results, but by 0.638, 0.9, 0.881 are the values for CDs, HLs and UDs
considering the diferent types of AS. The coeficient respectively. The values seem to imply that actually,
values are bounded between: while the correlation in general is weak for this project,
• ( CDs) 0.427 and 0.942 with Spearman’s and be- when we look at the specific smell types, the two metrics
tween 0.214 and 0.812 with Kendall’s; tend to be positively correlated. However, the number
• ( UDs) 0.253 and 1 with Spearman’s and between of HLs and UDs in JMeter is very small compared to
0 and 1 with Kendall’s; the number of CDs. Since correlations computed on few
• ( HLs) -1 and 1 for both coeficients. observations are not significant, we can conclude that
only the correlation value computed on CDs is relevant
Due to their low occurrences, the metrics of HL and UD for JMeter, and it explains why the overall correlation
usually present a strong correlation. However, there are value is weak for this project.
cases in some projects versions where the scarce number If we closely analyze JGraph evolution, initially it
of detected smells makes this calculation misleading: in shows a negative correlation for CDs at package level,
some cases correlations are very high, in other ones are which progressively increases (0.2 in version 5.10.0.1)
very low (fluctuate). and becomes strongly positive (0.73) in version 5.12.1.0.</p>
        <p>On the other hand, CD is the most common smell in We further investigated what caused these changes in
the dataset and this has an efect on the correlation values: the correlation values. In the first versions with
negathey largely vary in the dataset, making CD the smell type tive correlation we observed 3 CDs at package level, two
with some of the highest correlation values and at the of them with similar Severity and PageRank values and
same time the smell with some of the lowest correlation one with a strongly higher PageRank value, probably the
values. However, a clear result is that for all projects cause of the negative correlation. After version 5.10.0.1
the correlation at package level between PageRank and we noticed the presence of a 4th one. Its Severity was in
Severity of CD is strong, with the exception of JGraph line with the others and also its PagerRank: this likely
(see the following paragraph). balanced the PageRank values and subsequently caused
the increase of the positive correlation.</p>
        <p>Observations on weak and negative correlations Hence we can conclude that the variations in the
corFrom Table 4 we can observe that some projects, such relations values from negative to positive were due to
as JMeter, Lucene, Weka and Ant show a weak corre- the introduction of a new smell instance, whose metrics
lation between the two metrics. We aim to investigate values strongly impacted the correlation values due to, as
these behaviours and we start by analyzing two projects: for JMeter, the general small amount of smell instances.
JMeter, having a weak correlation, and JGraph, showing However, this specific case does not represent a common
non-positive correlation values for CDs at package level. behaviour in our dataset.</p>
        <p>We focus on the last version of both projects because it</p>
        <p>RQ2 Answer Can we find some correlation between reference it (incoming dependencies). In this way, a
comPageRank and Severity by considering each type of ponent having many incoming dependencies but
refersmell?, we found out that the smell type showing enced by components with few incoming dependencies,
the highest PageRank and Severity correlation is is less important with respect to another component with
CD at package level. However, also the other types, many incoming dependencies and referenced by other
HL and UD, showed strong correlations, but given components with many incoming dependencies. That
the lower amount of HL and UD instances, we con- is why PageRank is said to evaluate the importance of a
sider the result regarding CDs more meaningful. component with respect to the entire graph.
We also investigated specific cases of projects with From our analysis it results that the positive
correlaweak correlation and negative correlation but we tion is particularly evident in the case of CD. The reasons
did not find further insights. behind the CD Severity high correlation can be
multiple: a part of code with high PageRank is interested by
more changes [33] with respect to other parts of code,
5. Discussion and thus more open to the introduction of (structurally
complex) CDs. This is interesting because in the
litera</p>
        <p>We found a strong correlation between PageRank and ture we find studies which confirm the correlation in the
Severity. This means that, concerning the analysed data other direction [12], i.e., the presence of AS makes the
and the considered smells, the criticality and the cost- components more prone to change: if our hypothesis can
solving of smells go hand in hand: in the case of this be further corroborated, the conclusion would be that
study, if a smell afects an important (unimportant) part the relationship between PageRank and CD Severity is
of the system, then it will also have a high (low) cost like a dog chasing its tail, one triggers the other. Another
solving. We can outline two diferent interpretations of reason could be that components with high PageRank
the results. The positive correlation could be due to the are involved in a high number of dependencies, thus still
nature of the two metrics, both bounded to the depen- making easier for a developer to wrongly introduce new
dencies of the system. In this case, the conclusion would entangled dependencies and create cycles very dificult
be that PageRank and Severity capture the same charac- to remove.
teristic of the smells, and one of the two is redundant. As To conclude, there is a positive correlation between AS
consequence, in the ADI computation [11], only one of Severity and PageRank, however at the moment we
canthe two metrics should be used to evaluate AS criticality. not draw a definitive conclusion about how to interpret</p>
        <p>However, given how the metrics are defined, they dif- this finding. We plan to conduct a validation of our
refer one from the other. Severity takes into account the sults with developers from industry, who could evaluate
dependencies which are directly afected by the smell, the ability of the two metrics to capture criticality and
while PageRank considers also dependencies outside the cost-solving, and also manually check the specific cases
smell which converge towards the components afected where smells have high PageRank and high Severity.
by the smell. Take for instance the Severity of CD, which
is based on the dependencies forming the cycle and their
weight. If the components involved in the cycle have a 6. Threats to validity
high PageRank, it means that they are involved in many
dependencies with many other parts of the system, which Our study presents some threats to validity which we
is unliked from the fact that those components are part address by following the structure suggested by Yin [34].
of the cycle. With such premise, the two metrics would Concerning the construct validity, the two metrics,
capture diferent aspects of the smells, and their positive PageRank and Severity, may not measure what we claim
correlation could mean that critical parts of the system they do, i.e., the criticality of the AS. However, this is a
attract AS which are more expensive to solve. preliminary study and the next step is to validate the
cur</p>
        <p>
          Moreover, one could ask where is the diference in us- rent definition of the metrics with developers, by letting
ing PageRank when we could use simple coupling metrics them check whether the prioritization produced by the
such as FanIn and FanOut [25]. However, when evaluat- metrics is significant or not. Other threats regarding the
ing the coupling of a component, such metrics take into internal validity could be related to the choice of the
account only the incoming or outgoing dependencies of statistical methods used for the correlation analysis and
the component itself. On the contrary, the PageRank their implementation in the used tools, but we exploited
value of a component takes into account the PageRank of very well known and used tools (R language). Moreover,
all the components belonging to the dependency graph. we did not validate the two metrics by investigating the
In particular, the PageRank of a component is defined perception of developers of PageRank and Severity.
Howrecursively and depends on the number of dependen- ever, PageRank was adopted in other studies as software
cies and the PageRank metric of all the components that ranking metric [35][33][36], and we plan for the future
to validate Severity in industrial setting. Threats to ex- The smell type presenting the strongest correlation
ternal validity could be caused by the fact that we only is CD, suggesting that highly critical components (with
analyzed projects written in Java and publicly available. high PageRank) attract CDs hard to solve (with high
However, we partially mitigate such issues by analyzing Severity). Thus, developers should pay a lot of attention
10 projects with more than 22 versions each. Moreover, to CD smell, also because CD is the most common AS and
the high number of CDs could have reduced the efect of in particular those at package level tend to become more
the other types of detected AS in the results. We could critical in terms of PageRank in the history of the project
have mitigated this aspect by sampling the CD instances development. However, we do not exclude the possibility
and thus balancing the dataset. However, this would addi- that the two metrics have strong correlation because they
tionally reduce the size of the dataset, mining the validity capture the same aspects of smells. In that case, we could
of the CD results too. In the future, we aim to extend exploit this information to refine the computation of our
the study with additional data for the smells and further ADI and leave out one of the two.
remediate to this threat. Finally, concerning threats to In any case, we need to conduct a validation of both
the reliability of the study, Arcan could be subjected to metrics and on the correlation results, with expert
dea systematic bias in the detection, partially mitigated by velopers or by comparing the ranking provided by the
the provided replication package and the fact that the metrics with information coming from issue trackers [12].
tool has been validated on open source and industrial The intuition behind is that a component afected by a
projects [23] [28] [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] [24]. Moreover, some threats could critical smell (with high PageRank and high Severity)
occur due to errors in the data extraction and prepara- should be also interested by many issues. In addition to
tion phases, resulting in errors in the construction of the the validation, in future developments we aim to extend
dataset. However, we carefully checked every stage of this work by analyzing more projects, also coming from
the data preparation and relied on the support of Knime7. industry, and verify if the same results can be confirmed.
        </p>
        <p>In this paper, we addressed the criticality evaluation
of three AS, but the study can be extended also to other
7. Conclusion kinds of AS, e.g., Scattered Functionality and Feature
Concentration, two smells which violates the separation
of concerns principle. Given that such smells are not
based on dependency issues, we shall define additional
criticality metrics for them.</p>
        <p>We performed an empirical analysis on 22 versions of
10 projects of two software metrics, Severity and
PageRank, in order to evaluate the cost-solving and criticality of
AS. We also performed this evaluation with the
perspective to better understand if in the ADI computation both
the two metrics have to be used or not, if they provide
hints on the criticality evaluation of the AS that have to
be both taken in consideration. To conclude, from the
analysis of the evolution and correlation of PageRank
and Severity we found out that the two metrics tend to
be correlated, except for some extreme cases. It could be
useful for developers to analyze the specific cases where
AS have high PageRank and low Severity (and vice-versa),
since they could indicate smell instances which require a
tailored prioritization rationale: developers may be
interested in identifying cases where the smell is easy to solve
(low Severity) but in an important part of the system
(high PageRank), and choose to refactor this case first;
on the contrary, s/he could decide not to refactor a smell
dificult to solve (high Severity) and in an unimportant
(low PageRank) part of the system. We can assert that
such smells are a signal that both PageRank and Severity
could be useful to define diferent refactoring priorities,
from diferent points of view. In particular, PageRank can
be used to identify parts of code which need a continuous
inspection, while Severity can be used to evaluate the
cost-solving for the AS removal.</p>
      </sec>
      <sec id="sec-2-3">
        <title>7https://www.knime.com/knime-analytics-platform</title>
        <p>[6] L. Rizzi, F. A. Fontana, R. Roveda, Support for ar- Engineering, Springer, 2019, pp. 250–260.
chitectural smell refactoring, in: Proceedings of [21] A. Oliveira, L. Sousa, W. Oizumi, A. Garcia, On the
the 2nd International Workshop on Refactoring, prioritization of design-relevant smelly elements:
IWoR@ASE, 2018, pp. 7–10. A mixed-method, multi-project study, in:
Proceed[7] I. Pigazzini, F. A. Fontana, B. Walter, A study on cor- ings of the XIII Brazilian Symposium on Software
relations between architectural smells and design Components, Architectures, and Reuse, SBCARS
patterns, J. Syst. Softw. (2021). ’19, Association for Computing Machinery, 2019.
[8] S. Brin, L. Page, The anatomy of a large-scale hy- [22] R. Terra, L. F. Miranda, M. T. Valente, R. S. Bigonha,
pertextual web search engine, in: Seventh Interna- Qualitas.class Corpus: A compiled version of the
tional World-Wide Web Conference, 1998. Qualitas Corpus, Software Engineering Notes 38
[9] I. Şora, A pagerank based recommender system (2013).</p>
        <p>for identifying key classes in software systems, in: [23] F. A. Fontana, I. Pigazzini, R. Roveda, M. Zanoni,
Au10th Jubilee International Symposium on Applied tomatic detection of instability architectural smells,
Computational Intelligence and Informatics, 2015. in: 2016 IEEE International Conference on Software
[10] F. A. Fontana, I. Pigazzini, C. Raibulet, S. Basciano, Maintenance and Evolution,ICSME 2016, 2016.</p>
        <p>R. Roveda, Pagerank and criticality of architectural [24] F. A. Fontana, F. Locatelli, I. Pigazzini, P. Mereghetti,
smells, in: Proceedings of the 13th European Con- An architectural smell evaluation in an industrial
ference on Software Architecture, ECSA 2019, 2019. context, ICSEA 2020 (2020) 78.
[11] F. A. Fontana, P. Avgeriou, I. Pigazzini, R. Roveda, [25] R. C. Martin, Object oriented design quality metrics:
A study on architectural smells prediction, in: 2019 An analysis of dependencies, ROAD 2 (1995).
45th Euromicro Conference on Software Engineer- [26] G. Suryanarayana, G. Samarthyam, T. Sharma,
ing and Advanced Applications (SEAA), IEEE, 2019. Refactoring for Software Design Smells, 1 ed.,
Mor[12] D. M. Le, D. Link, A. Shahbazian, N. Medvidovic, gan Kaufmann, 2015.</p>
        <p>An empirical study of architectural decay in open- [27] D. Sas, P. Avgeriou, F. A. Fontana, Investigating
source software, in: 2018 IEEE International Con- instability architectural smells evolution: An
exference on Software Architecture (ICSA), 2018. ploratory case study, in: Int. Conference on
Soft[13] F. A. Fontana, V. Lenarduzzi, R. Roveda, D. Taibi, ware Maintenance and Evolution, ICSME, 2019.</p>
        <p>Are architectural smells independent from code [28] F. Arcelli Fontana, I. Pigazzini, R. Roveda, D. A.
smells? an empirical study, Journal of Systems Tamburri, M. Zanoni, E. D. Nitto, Arcan: A tool for
and Software 154 (2019) 139 – 156. architectural smells detection, in: Int’l Conf.
Soft[14] T. Sharma, P. Singh, D. Spinellis, An empirical in- ware Architecture (ICSA 2017) Workshops, 2017.
vestigation on the relationship between design and [29] M. B. Wilk, R. Gnanadesikan, Probability plotting
architecture smells, Empirical Software Engineer- methods for the analysis of data, Biometrika 55
ing (2020). (1968) 1–17.
[15] S. Herold, An initial study on the association be- [30] S. S. Shapiro, M. B. Wilk, An analysis of variance
tween architectural smells and degradation, in: test for normality (complete samples), Biometrika
Software Architecture, Springer International Pub- 52 (1965) 591–611.</p>
        <p>lishing, Cham, 2020, pp. 193–201. [31] C. Spearman, The proof and measurement of
asso[16] J. A. D. P. Santiago A. Vidal, Claudia Marcos, An ciation between two things, The American Journal
approach to prioritize code smells for refactoring, of Psychology 15 (1904) 72–101.</p>
        <p>Autom. Softw. Eng. 23 (2016) 501–532. [32] M. Kendall, J. Gibbons, Rank Correlation Methods,
[17] A. Rani, J. K. Chhabra, Prioritization of smelly Charles Grifin Book, E. Arnold, 1990.
classes: A two phase approach (reducing refactor- [33] R. Wang, R. Huang, B. Qu, Network-based analysis
ing eforts), in: 2017 3rd International Confer- of software change propagation, The Scientific
ence on Computational Intelligence Communica- World Journal 2014 (2014).</p>
        <p>tion Technology (CICT), 2017. [34] R. Yin, Case Study Research: Design and Methods,
[18] N. Sae-Lim, S. Hayashi, M. Saeki, Context-based Applied Social Research Methods, SAGE
Publicaapproach to prioritize code smells for refactoring, tions, 2009.</p>
        <p>Journal of Software: Evolution and Process (2017). [35] F. Perin, L. Renggli, J. Ressia, Ranking software
[19] F. A. Fontana, M. Zanoni, Code smell severity classi- artifacts, in: 4th Workshop on FAMIX and Moose
ifcation using machine learning techniques, Knowl. in Reengineering (FAMOOSr 2010), volume 120,
Based Syst. 128 (2017). Citeseer, 2010.
[20] T. Guggulothu, S. A. Moiz, An approach to suggest [36] W.-f. PAN, B. LI, Y.-t. MA, B. JIANG, Identifying the
code smell order for refactoring, in: International key packages using weighted pagerank algorithm,
Conference on Emerging Technologies in Computer ACTA ELECTONICA SINICA 42 (2014) 2174.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Martini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Arcelli Fontana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Biaggi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Roveda</surname>
          </string-name>
          ,
          <article-title>Identifying and prioritizing architectural debt through architectural smells: a case study in a large software company</article-title>
          ,
          <source>in: Proc. of the European Conf. on Software Architecture (ECSA)</source>
          , Springer,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Ernst</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bellomo</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Ozkaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Nord</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gorton</surname>
          </string-name>
          ,
          <article-title>Measure it? manage it? ignore it? software practitioners and technical debt</article-title>
          ,
          <source>in: Proc. of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE</source>
          <year>2015</year>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Vidal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Oizumi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Garcia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Díaz</given-names>
            <surname>Pace</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Marcos</surname>
          </string-name>
          ,
          <article-title>Ranking architecturally critical agglomerations of code smells</article-title>
          ,
          <source>Science of Computer Programming</source>
          <volume>182</volume>
          (
          <year>2019</year>
          )
          <fpage>64</fpage>
          -
          <lpage>85</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Taibi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Janes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Lenarduzzi</surname>
          </string-name>
          ,
          <article-title>How developers perceive smells in source code: A replicated study</article-title>
          ,
          <source>Information and Software Technology</source>
          <volume>92</volume>
          (
          <year>2017</year>
          )
          <fpage>223</fpage>
          -
          <lpage>235</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pecorelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Palomba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Khomh</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. De Lucia</surname>
          </string-name>
          ,
          <article-title>Developer-driven code smell prioritization</article-title>
          ,
          <source>in: Proceedings of the 17th International Conference on Mining Software Repositories, MSR '20</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>