=Paper=
{{Paper
|id=Vol-2978/msr4sa-paper2
|storemode=property
|title=Two Different Facets of Architectural Smells Criticality: An Empirical Study
|pdfUrl=https://ceur-ws.org/Vol-2978/msr4sa-paper2.pdf
|volume=Vol-2978
|authors=Ilaria Pigazzini, Davide Foppiani, Francesca Arcelli Fontana
|dblpUrl=https://dblp.org/rec/conf/ecsa/PigazziniFF21
}}
==Two Different Facets of Architectural Smells Criticality: An Empirical Study==
<pdf width="1500px">https://ceur-ws.org/Vol-2978/msr4sa-paper2.pdf</pdf>
<pre>
Two different facets of architectural smells criticality: an
empirical study
Ilaria Pigazzini1 , Davide Foppiani1 and Francesca Arcelli Fontana1
1
    University of Milano - Bicocca, Milan, Italy


                                             Abstract
                                             Architectural smells (AS) represent symptoms of problems at architectural level that have an impact on architectural debt. It
                                             is important to identify among them the most critical ones, so that developers can prioritize them for their removal. In order
                                             to evaluate the criticality of AS, in this paper we consider two facets: the PageRank metric, to assess the centrality of a smell
                                             in a project, and Severity, a metric to estimate the cost-solving of smells. We have proposed these two metrics in a previous
                                             work and here we perform an empirical analysis of the evolution and correlation of these metrics in the version history
                                             of 10 projects (at least 22 versions each, 264 projects in total). The analysis of the evolution is useful in order to identify
                                             which architectural smells types tend to become more critical. The analysis of the correlation is useful to study whether the
                                             criticality of a smell has an influence on how much it costs to remove it, and vice-versa.

                                             Keywords
                                             Architectural Smells, Architectural Debt, Architectural Smells criticality, Architectural Smells evolution, Empirical study


1. Introduction                                                                                                       cated in a central part of the project and other facets.
                                                                                                                      Moreover, while criticality gives us information about
   Architectural debt can be monitored through differ-                                                                the removal urgency, there is another aspect connected
ent issues, such as through the presence of architectural                                                             to the removal of smells which can be considered and
smells in a project. Architectural smells (AS) are de-                                                                quantified. AS have a cost-solving (cost of fixing, cost of
sign decision that negatively impact internal software                                                                refactoring), which is the effort needed to remove a smell
qualities and are symptoms of architectural debt [1], [2].                                                            from the system [6]. This variable depends less from the
Software systems affected by AS are difficult to main-                                                                perception of the developers but more from the specific
tain and evolve, hence it is important to study them and                                                              characteristics of the interested AS.
identify solutions to support developers in their removal,                                                               To resume, during AS management, developers can
in particular the removal of the most critical ones (AS                                                               take into consideration two distinct aspects concerning
prioritization).                                                                                                      smells: their criticality, i.e., how much is important to
   In such terms, criticality of an AS models the degree                                                              remove them as soon as possible (urgency), and their
of removal urgency associated to the AS, i.e., the smell                                                              cost-solving, i.e., how much it cost to remove them.
should be removed as soon as possible because it affects a                                                               Both criticality and cost-solving are particularly rele-
part of the project which is important for the developers                                                             vant for developers when making decisions about AS
(e.g., frequently changed or highly referenced) or has a                                                              management: for instance, to choose which smell to
strong impact on the maintainability of the project.                                                                  refactor first [1][5]. A developer may prefer to refac-
   However, it is not trivial to model and evaluate the                                                               tor first the smells which require less time to be solved
importance and urgency of the removal of an AS. In the                                                                (low cost solving) to quickly enhance the quality level
literature, the identification of the best metrics to be used                                                         of the project, instead of fixing the most critical ones.
for the evaluation of criticality is considered a complex                                                             On the other hand, the developer may decide to remove
task [3], mainly because it is tightly connected to how                                                               the most difficult/critical ones, but to make this decision,
smells are perceived by developers [4] and such percep-                                                               different factors must be considered: it can be too ex-
tion is subjected to many variables, such as the developer                                                            pensive and risky; too many changes could compromise
experience, code ownership [5], whether the smell is lo-                                                              other parts. Perhaps, the most difficult AS was created
                                                                                                                      by design choice and no better solution is available, as in
MSR4SA’21: 1st International Workshop on Mining Software                                                              the case of cycles created by callbacks for event listeners
Repositories for Software Architecture, September 15–17, 2021,
Virtual
                                                                                                                      in GUI components [1][7]. Finally, the most critical AS
email: i.pigazzini@campus.unimib.it (I. Pigazzini);                                                                   could appear in a not-central part of the project, such
d.foppiani@campus.unimib.it (D. Foppiani);                                                                            as a deprecated, unessential package, and could be not
arcelli@disco.unimib.it (F. A. Fontana)                                                                               interesting for the developers.
orcid: 0000-0003-2629-6762 (I. Pigazzini); 0000-0002-1195-530X
                                                                                                                         In this paper, we consider two metrics, PageRank and
(F. A. Fontana)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative   Severity, and we propose to use them to model the criti-
                                       Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)                                        cality (PageRank) and the cost-solving (Severity) of three
AS based on dependency issues, namely: Cyclic Depen-             relevance of the metrics for each type of smell. Other-
dency, Unstable Dependency, and Hub-Like Dependency              wise, no correlation, we could infer that there is no link
(see Section 3.2). PageRank, inspired by the well-known          between the urgency of removing a smell and the cost of
metric from Brin and Page [8] is a measure that estimates        removing a smell, as computed by the proposed metrics.
whether an AS is located in an important part of the             In this case a developer can decide to not remove an AS
project [9], where the importance is evaluated according         with low PageRank and high cost solving, and to remove
to how many parts of a project depend on the ones in-            first an AS with high PageRank and low cost solving,
volved in the AS (as a sort of centrality measure of the         since this AS could become more critical since it appears
AS). We want to use PageRank as a proxy of AS criticality,       in a central part of the project.
i.e., the higher the PageRank, the higher the criticality of        We aim with our study to provide developers insights
the AS. Severity, defined by us, is a measure associated         on the evaluation of criticality and cost solving of AS
to each specific type of AS and is computed through the          through the PageRank and Severity metrics. Severity
metrics used to detect each smell. Our idea is that the AS       metric is focused on evaluating the cost solving in terms
characteristics, such as the number of system dependen-          of the number of project dependencies affected by the
cies it affects, are useful to estimate how much effort is       smells, while PageRank is more focused on the impor-
required to refactor the smell (cost-solving), e.g., a smell     tance (criticality) of the affected components (classes/-
which involves many dependencies will require a deep             packages). Hence, both metrics could be useful to de-
analysis and a lot of time to be solved.                         termine the prioritization of AS, i.e., help the developer
   We have considered these two metrics in a previous            in choosing which smell to refactor first depending on
study [10], where PageRank and Severity have been eval-          the developer’s needs, i.e., the need to address the most
uated on only 6 single-version projects. We have now             critical ones first or the most expensive ones.
extended the study by conducting an empirical evaluation            We have considered the two metrics in the computa-
on a total of 264 versions of 10 projects with the aim to        tion of an Architectural Debt Index [11] based on the
empirically study criticality and cost-solving during the        number of the AS found in a project and their critical-
evolution of the projects, and investigate whether there         ity measured in terms of both PageRank and Severity
is a correlation between the trends of the two metrics, to       metrics. The results of this study can be useful also to
answer the following Research Questions (RQ):                    evaluate whether the two metrics truly capture different
   RQ1: How PageRank and Severity of the smells evolve           aspects of a smell or not. In the latter case, one of the
in the version history of a project?                             two metrics could be left out.
   RQ2: Can we find some correlation between PageRank               The paper is organized through the following sections:
and Severity by considering each type of smell?                  in Section 2 we introduce some related work, in Section 3
   The answer to RQ1 aims to analyze if the values of the        we describe the study design, in Section 4 we provide the
two metrics tend to increase or decrease in the version          results we obtained to answer the RQs. Section 5 presents
history of the projects. Moreover, we are interested in          the discussion of the results and Section 6 outline some
understanding which AS type(s) tend to become more               threats to the validity of the work. Finally in Section 7 we
critical and/or difficult to remove in the version history       conclude our work by outlining some threats to validity
of a project, where the criticality is evaluated through         and future developments.
the PageRank and the cost solving is estimated with the
Severity metric. In this way a developer can decide to
focus the attention on these types of smells first.              2. Related Work
   The answer to RQ2 allows to evaluate the correlation
                                                                    We first briefly describe some empirical studies on
between the criticality and the cost solving of a smell. If
                                                                 architectural smells.
for example the values tend to go together, highly corre-
                                                                    Le et al. [12] investigated the nature and impact of
lated, for a specific type of AS, it means that as long as the
                                                                 architectural smells through a large empirical study, by
smell is critical, it is also hard to remove and vice-versa:
                                                                 exploiting the projects’ issue trackers to analyze the im-
in this case, the two metrics would produce the same
                                                                 pact of smells on software development; Arcelli et al. [13]
ranking of smells, i.e., the prioritization of the smells
                                                                 studied the relationship between code smells and archi-
would be equal by considering one of the two metrics
                                                                 tectural smells and found that architectural smells are
interchangeably. In case of positive correlation, it could
                                                                 independent from code smells; Sharma et al. [14] con-
be also in any case interesting to analyze possible out-
                                                                 ducted an empirical study to investigate the relationship
liers with different values of the metrics (high/low) and
                                                                 between design and architectural smells in C# projects.
better capture the relevance of the metrics (see examples
                                                                 Finally Herold [15] performed a preliminary empirical
in Section 4.2). We could find that the two metrics have
                                                                 study to investigate the relationship between architec-
a strong positive correlation for a specific type of smell,
                                                                 tural smells and architectural degradation, the latter mea-
and not for other smells. This scenario can outline the
sured through the number of architectural violations.           tend the previous work on a large number of projects
   With respect to these previous papers, we performed          (10 projects, 22 versions each, for a total of 264 versions),
an empirical study focused on the evaluation of different       and we analyze the correlation existing between the two
facets of architectural smells criticality, not previously      metrics through Spearman and Kendall correlation tests.
studied in the literature according to our knowledge.           Moreover, we study the evolution of the metrics in the
   We now outline some related works done in the liter-         project history. Finally, in this paper we propose to ex-
ature on the evaluation of criticality and prioritization       ploit PageRank as a proxy for criticality, and Severity as
of code or architectural smells. What distinguishes the         a metric to estimate cost-solving.
following works is the kind of information used to es-
timate the priority of a smell. For instance, concerning
code smells, Vidal et al. [16] presented an approach to         3. Case Study Design
identify the most critical smells based on a combination
                                                                  We describe below the analyzed projects, the data we
of three criteria, namely: past component modifications,
                                                                collected on AS, their Severity and PageRank and the
important modifiability scenarios for the system and rel-
                                                                data preparation and analysis.
evance of the kind of smell. Also Rani et al. [17] pro-
posed a methodology for code smell prioritization. First,
it detects smelly classes using structural information of       3.1. Analyzed projects
source code, then mines change history, as done by Vidal           We analyzed several versions of 10 projects, for a total
et al., to prioritize the smells. Always according to code      of 264 versions (see Table 1). Most of the chosen projects
smells studies, Sae Lim et al. [18] exploited the developers’   were picked from the Qualitas Corpus [22]. We selected
context (a list of issues extracted from an issue tracking      these projects since they have already been the subject of
system) to define priority. Instead, Arcelli et al.[19] pro-    several studies, they are publicly available and enable the
posed a severity index of the smells based on how the           replication of this study. These data were also combined
metric thresholds used for the smells detection are ex-         with data from the MavenRepository1 , also publicly avail-
ceeded. Similarly, Guggulothu et al. [20] proposed a pri-       able. We considered several releases for each project. To
oritisation approach for four code smells (Long Method,         easily compare the different projects, we chose roughly
Feature Envy, God Class and Data Class), depending on           the same amount of versions and preferred different re-
their impact on design quality, where the impact is mea-        leases, major or minor, over patches when possible. In
sured depending on the overcome of a set of metrics such        general, in this paper we use the term version to refer
as coupling, size, complexity and cohesion. Moreover            both minors and majors. The chosen systems also vary
recently, Pecorelli [5] proposed a machine learning ap-         in size and number of smells (see Table 1). In the column
proach to prioritise the application of refactoring on code     group last version we report the projects’ size (in terms
smells. They generated a rank of code smells according to       of classes/packages) and number of AS of the last version
the perceived criticality that developers assign to them.       of the project in the development history.
   According to architectural smells, there are fewer stud-
ies about prioritization. Martini et al. [1], performed a
study on the analysis of the most critical AS through           3.2. Data collection
the feedback of the developers of two industrial projects.      Architectural smells we performed this study by con-
The smells having top refactoring priority in the opinion       sidering the AS detected with the Arcan tool2 [23] de-
of practitioners are the ones with the highest negative         scribed below, but other AS can be considered in the
impact on the maintainability and evolvability of the           future [24]. We limited the analysis on the following
project. On the same line, Oliveira et al. [21] investi-        three smells since they are the only ones for which we
gated criteria that developers use in practice to prioritize    developed a Severity metric, contextually to the defini-
design-relevant smelly elements with the aim to develop         tion of our Architectural Debt Index (ADI) [11].
a set of prioritization heuristics. From their results, two
out of nine heuristics reached an average precision higher              • Unstable Dependency (UD) describes a component
than 75%. Finally, Vidal et al. [3] presented and evaluated               (package) dependent on other components that
a set of five criteria for ranking groups of code smells as               are less stable than itself; This may cause a ripple
indicators of architectural problems in evolving systems.                 effect of changes in the system. Instability of a
   According to our knowledge no extensive work has                       component is measured with the metric proposed
been previously done on the analysis of the evolution                     by Martin [25] as the ratio of outgoing depen-
and correlation between criticality and cost-solving, eval-               dencies to the total number of dependencies of
uated in terms of PageRank of AS and Severity metrics.              1
                                                                   https://mvnrepository.com/
In a previous study [10] we only manually analyzed the              2
                                                                   Download:                  https://drive.google.com/file/d/
two metrics by considering only 6 projects. Here we ex-         1WNx7FHRykbyOIxz92cDQpSL2rl_gEJ4P/view?usp=sharing
Table 1
Summary of the dataset

    Project      #V      #Cl.      #Pkg        #AS   #CD-Cl    #CD-Pkg          #HL-Cl        #HL-Pkg        #UD            #AS
                                last version                                      all versions
    Ant          24    1157        62     413         8131          2064          15            92     243                 10545
    Azureus      24    8148       480 7722           97172         29801          41            70    3478                130562
    FreeCol      24    1310        35 1395           30488          1652          86            54     356                 32636
    Hibernate    24    2980       170 1172           12910          9026          18           129    1267                 23350
    JMeter       26     681        55     307         3930          2681          79            54     574                  7318
    JGraph       24     188        20     118         2602             79         79             1      51                  2812
    Jstock       24     865        19     665        13585           619          64             8     247                 14523
    Jung         22     705        40     133          894           658          31            27     270                  1880
    Lucene       31    1425        22     408         6241           407            9           59     187                  6903
    Weka         44    2423        80 1090           25241          5200         102            41    1042                 31626
    Acronyms. V: version, CL: classes, Pkg: packages, AS: Architectural Smells, CD: Cyclic Dependency,
    HL: Hub-Like Dependency, UD: Unstable Dependency


       the component. Consequences: The components             cal, since they have higher maintenance costs. In particu-
       with an high instability are more prone to change       lar, Cyclic Dependency is one of the most common smell
       with respect to the more stable ones, this means        and is considered the most critical smell by developers
       that the component which depends on less stable         [1].
       components is forced to change along with them.            We used our Arcan tool for the AS detection, since it is
     • Hub-Like Dependency (HL) arises when a compo-           publicly available, allows to easily detect the considered
       nent (class or package) has outgoing and incom-         AS and has been previously validated [28]. We computed
                                                               3
       ing dependencies with a large number of other             the PageRank and Severity metrics related to the three
       components [26]; The affected component rep-            types of smells and we reported the “granularity level”
       resents a unique point of failure for the system        of the considered smells, either class or package. Our
       and also a dependency bottleneck. Consequences:         distinction between AS at class and package level can be
       The component in the middle of the hub is a             mapped to another nomenclature adopted in the litera-
       unique point of failure and a dependency bot-           ture [14] which calls “design smells” our class AS and
       tleneck. Moreover the logic inside a Hub-Like           “architectural smells” our package AS.
       Dependency is hard to understand, and the smell            We now report the definition of the two metrics under
       causes change ripple effect.                            analysis.
     • Cyclic Dependency (CD) refers to a component               Severity is a metric that we defined for each type of
       (class or package) that is involved in a chain of re-   AS to estimate the AS cost solving. In particular, it evalu-
       lations that break the desirable acyclic nature of a    ates different features of the smells which have an impact
       component’s dependency structure. Components            on the effort needed for its removal. For example, for the
       involved in a CD cannot be reused in isolation          estimation of Hub Like Dependency cost-solving, we con-
       and a change on one component propagates to             sider the number of dependencies affected by the smell,
       the other ones. Consequences: The components            because this metric gives us information about how many
       involved in a dependency cycle can be hardly            parts of code a developer investigate/change/remove to
       released, maintained or reused in isolation. More-      refactor the HL.
       over, a change on one affected component will              Severity is computed differently for each type of AS:
       propagate towards all the other ones involved in        for UD it is evaluated through the number of bad de-
       the cycle.                                              pendencies which cause the Unstable Dependency smell,
                                                               where for bad dependency we mean a reference from
   We considered these three AS because they are some          the affected package to the less stable packages i.e. if
of the most studied smells [27][13][11][15] and they are       package B has high instability and package A has low
also perceived as important and detrimental for the qual-      instability, the dependency A → B is a bad dependency;
ity of the software systems by practitioners[1][24]. In        for HL the Severity corresponds to the total number of
particular, these smells are based on dependency issues.       dependencies which cause the HL smell (dependencies
Dependencies are of great importance in software archi-        from a class/package directed to the hub and vice-versa);
tecture: components that are highly coupled and with a
high number of dependencies are considered more criti-             3
                                                                       https://figshare.com/articles/dataset/_/13636472
for CD it is computed through the number of compo-             of the data. The resulting dataset is a collection of 262155
nents involved in the cycle multiplied with the minimum        smells categorized by project, version, type, granularity
number of times a cycle repeats itself. A dependency be-       level, Severity and PageRank. Table 1 shows the sum-
tween two components can occur multiple times because          mary of our dataset, where we report the project size and
we count the number of references from a class/package         the number of smell instances, divided by type: for each
to the others. For instance, if there is a cycle between       project (considering all versions in history) we show the
package A and B, caused by 5 classes belonging to A            number of detected CD at class and package level (CD-Cl
calling B, and B’s classes calling A 3 times, the Severity     and CD-Pkg), of detected HL at class and package level
value is equal to 3. This means that the cycle is repeated     (HL-C and HL-P), of detected UD (UD) and the sum of all
at least 3 times.                                              project’s AS (AS). A smell instance corresponds to one
   PageRank of an AS evaluate the criticality (urgency)        occurrence of the smell in the project, thus the reported
associated to an AS. The PageRank value of a smell in-         numbers are the counts of all the occurrences.
stance is computed as the mean value of the PageRank of            We studied two different aspects: 1) Severity and
the components (class or package) affected by the smell.       PageRank evolution, in order to answer RQ1; 2) Severity
The intuition is that components with high PageRank are        and PageRank correlation to answer RQ2.
important inside the project, where the importance [9]             Concerning evolution, we analyzed the evolution of the
corresponds to how many parts of the project depend            two metrics for each type of smell in order to study their
on the component. PageRank of a component is com-              different behaviours. We summarised the data for each
puted through the PageRank formula implemented by              version by averaging the values of both metrics with
Brin and Page [8], executed on the dependency graph of         respect to the total number of smells detected in the
the project:                                                   version. We conducted trend analysis to understand how
                                (︃ 𝑛            )︃             the average values of PageRank and the different types
                     1−𝑑          ∑︁ 𝑃 𝑅(𝑝𝑘 )                  of Severity evolve overtime. We exploited the Mann-
          𝑃 𝑅(𝑣) =         +𝑑                           (1)
                       𝑁                𝐶(𝑝𝑘 )                 Kendall test, which is a non-parametric test able to assess
                                     𝑘=1
                                                               if there is a monotonic upward or downward trend of the
where, the vertex 𝑣 is a node of the dependency graph          variable of interest over time. The null hypothesis for
associated to a project; 𝑃 𝑅(𝑣) is the value of PageRank       this test is that there is no monotonic trend in the series.
of the vertex 𝑣; 𝑁 is the total number of AS in the project;   The alternate hypothesis is that a trend exists. This trend
𝑃𝑘 is a vertex with at least a link directed to 𝑣; 𝑛 is the    can be positive, negative, or non-null. We also analyzed
number of the 𝑝𝑘 vertexes; 𝐶(𝑝𝑘 ) is the number of links       the two metrics’ evolution respect to the evolution of
of vertex 𝑝𝑘 ; 𝑑 (damping factor) is a custom factor fixed     the size, where size corresponds to the number of classes
at 0.85, a default value defined by Brin and Page.             and packages of the projects under analysis, to check
   The range of the metric spans from 0 to infinite and        whether the two things are correlated. We ran Spearman
higher values correspond to higher criticality. To as-         and Kendall correlation tests to investigate this aspect.
sociate a unique value of PageRank to a single smell               Concerning the correlation analysis of PageRank and
instance, we compute the mean value of the PageRank            Severity, we first tested the normality of our data. Given
scores of all the components involved in the smell. In this    the large size of our dataset, we used Q-Q plots [29] to
way, smells of any type can be ordered by this metric,         evaluate if the measures do not follow a normal distri-
from the most critical to the less critical.                   bution. A Q-Q plot is a graphical method for comparing
   Both Severity and PageRank are based on the project         two probability distributions by plotting their quantiles
dependencies, however they are computed in differ-             against each other. These plots are often used when the
ent ways and aim to evaluate two distinct aspects: im-         dataset is large enough to introduce bias in the Shapiro-
portance/criticality (for PageRank) and dependencies           Wilk test [30], which is a commonly used normality test.
structure/cost-solving (for Severity). Hence, we per-          The Q-Q plots of all the projects showed a non-normal be-
formed a correlation analysis to investigate the possible      haviour. Then, we tested the correlation between Severity
relationship between the two metrics.                          and PageRank for each version of the projects. We com-
                                                               puted the correlation on the metrics data of all smell type
3.3. Data preparation and analysis                             together and also separately for each smell type. We also
                                                               computed the correlation separately for each granularity
  We ran Arcan and we pre-processed the output data in         level, to contextualize the results at package or class level.
order to produce the dataset for our analysis. Other than      Given the non-normal distribution of our data, we chose
Arcan, we exploited the Knime platform4 and R program-         the Spearman’s [31] and Kendall’s [32] coefficients to
ming language5 for the processing and statistical analysis     calculate the correlation.
    4
        https://www.knime.com/knime-analytics-platform
    5
        https://www.r-project.org/
4. Results                                                     Table 2
                                                               Mann-Kendall results - PageRank
   We report the results both for PageRank and Severity
evolution and their correlation. At the end of each section,       Project     Trend     P-value    Reference AS
we also report the answer to the relative RQs. All the             Ant              +    0.009867    CD-package
results and plots can be found in the replication package6 .       Azureus          +    2.77E-05      CD-class
                                                                   Azureus          +           0    CD-package
                                                                   Azureus          +    3.81E-06       HL-class
4.1. Evolution results                                             Azureus          +           0    HL-package
   In order to answer RQ1, we checked the trend of PageR-          Azureus          +           0    UD-package
ank and Severity values throughout the versions of the             Hibernate        +    0.030929      CD-class
projects. For every project and for both PageRank and              Hibernate        +           0    CD-package
                                                                   Hibernate        +    0.000677       HL-class
Severity, we run the Mann-Kendall test. Table 2 and
                                                                   Hibernate        +           0    HL-package
3 show the outcome of the test, namely reporting the               Hibernate        +    2.38E-07    UD-package
Trend (increasing + or decreasing -), the P-value and              Jgraph           +    0.001375       HL-class
the Reference AS (the type of smell which the PageRank
refers to) for PageRank, while Granularity (class or pack-
age) for Severity. The tables report only results where Table 3
𝑝 − 𝑣𝑎𝑙𝑢𝑒 < 0.05, i.e., there is a trend. We outline from Mann-Kendall results - Severity
Table 2 and 3 the following remarks:                          Project       Trend      P-value      Granularity
      • PageRank and Severity show a trend during time                              Severity - CD
         in few projects. We found PageRank trend in four          Azureus            + 0.024848              class
         over ten projects, while Severity showed a trend          Hibernate          + 0.000291              class
         in five projects. The tables only show the projects       Jstock             - 0.025486           package
        with a positive or negative trend.                         Jung               - 0.039728              class
                                                                   Lucene             -    3.25E-06           class
      • Concerning the Severity of CDs, we observed
                                                                                     Severity - HL
         both positive and negative trend at class level, in
                                                                   Jstock             + 0.002832           package
        4 projects, and a negative trend at package level,         Lucene             + 0.000422              class
         in one project.                                           Weka               + 0.002132              class
      • Concerning the Severity of HLs, we had examples            Weka               + 0.005923           package
         at both class and package level of positive trends.
      • The Severity metric of Unstable Dependency
         smell does not show a trend in any project, and we may say that this should be true also for PageRank com-
         could notice only one project (Hibernate) where puted on classes correlated with the number of classes:
         the PageRank of UD smells had a trend.              instead, their correlation values range in [−0.87, 0.9]
                                                             with median equals to 0.45. This result may be due to
   We extended our analysis to see if the project size (mea-
                                                             the high variance in the number of classes among the
sured by number of classes and packages) is correlated
                                                             projects (variance which is smaller for what concerns
with the values of PageRank and Severity. We tested it
                                                             packages).
for each project over its development evolution. We then
analyzed the distribution of the correlation on the data of     RQ1 Answer How PageRank and Severity of the
all projects. The first thing we noticed is that the number     smells evolve in the version history of a project?: in
of classes and packages increases overtime. However,            general we found that the average values of PageR-
this does not happen for Severity and PageRank values:          ank and Severity do not have a trend (neither pos-
we do not find a significant correlation between size and       itive or negative) over time. Concerning the com-
the metrics except for the correlation between PageRank         parison with projects’ size evolution, we found out
computed on AS on packages and the number of pack-              that PageRank computed on packages show a posi-
ages in the system. The correlation values, computed for        tive correlation with the evolution of the number of
all the projects, have range in [0.34, 0.89], with median       packages: this is reasonable, since the increase/de-
equals to 0.74. We hypothesise that the correlation is          crease in the number of packages has an impact also
high for PageRank because of how it is computed: the            on the creation/deletion of package dependencies,
more the number of packages, the more the dependencies          thus on PageRank.
and higher the PageRank values are. For this reason, one
    6
        https://figshare.com/articles/dataset/_/13636472
Table 4
Severity and PageRank correlation (last version only)

                         Project      Version     Spearman         P-value    Kendall     P-value
                         Ant          1.10.7             0.582     < 0.001        0.46    < 0.001
                         Azureus      4.8.1.2            0.871     < 0.001       0.704    < 0.001
                         FreeCol      0.10.7             0.809     < 0.001        0.64    < 0.001
                         Hibernate    4.2.2              0.719     < 0.001       0.573    < 0.001
                         JMeter       5.2.1              0.575     < 0.001       0.455    < 0.001
                         JGraph       5.13.0.0           0.664     < 0.001       0.581    < 0.001
                         Jstock       1.0.6w             0.621     < 0.001       0.494    < 0.001
                         Jung         1.7.6              0.643     < 0.001       0.506    < 0.001
                         Lucene       4.3.0              0.411     < 0.001        0.33    < 0.001
                         Weka         3.7.9               0.53     < 0.001       0.428    < 0.001


4.2. Correlation results                                         is associated to the most updated codebase, hence we
                                                                 assume it is the most exemplary for them.
  In order to answer RQ2, we report in Table 4 the re-
                                                                    By analyzing the correlation coefficients of JMeter’s
sults of the correlation between Severity and PageRank,
                                                                 AS, we noticed that when they are calculated separately
evaluated on all AS, not considering their type. As can
                                                                 for each AS type, they present higher values than the ones
be seen, the majority of the projects presented a strong
                                                                 reported in Table 4. Using Spearman’s as an example:
positive correlation (𝜌 > 0.6).
                                                                 0.575 is the 𝜌 value by not considering the AS type and
  Following, we discuss the correlation results, but by
                                                                 0.638, 0.9, 0.881 are the values for CDs, HLs and UDs
considering the different types of AS. The coefficient
                                                                 respectively. The values seem to imply that actually,
values are bounded between:
                                                                 while the correlation in general is weak for this project,
     • ( CDs) 0.427 and 0.942 with Spearman’s and be-            when we look at the specific smell types, the two metrics
       tween 0.214 and 0.812 with Kendall’s;                     tend to be positively correlated. However, the number
     • ( UDs) 0.253 and 1 with Spearman’s and between            of HLs and UDs in JMeter is very small compared to
       0 and 1 with Kendall’s;                                   the number of CDs. Since correlations computed on few
     • ( HLs) -1 and 1 for both coefficients.                    observations are not significant, we can conclude that
                                                                 only the correlation value computed on CDs is relevant
Due to their low occurrences, the metrics of HL and UD           for JMeter, and it explains why the overall correlation
usually present a strong correlation. However, there are         value is weak for this project.
cases in some projects versions where the scarce number             If we closely analyze JGraph evolution, initially it
of detected smells makes this calculation misleading: in         shows a negative correlation for CDs at package level,
some cases correlations are very high, in other ones are         which progressively increases (0.2 in version 5.10.0.1)
very low (fluctuate).                                            and becomes strongly positive (0.73) in version 5.12.1.0.
   On the other hand, CD is the most common smell in             We further investigated what caused these changes in
the dataset and this has an effect on the correlation values:    the correlation values. In the first versions with nega-
they largely vary in the dataset, making CD the smell type       tive correlation we observed 3 CDs at package level, two
with some of the highest correlation values and at the           of them with similar Severity and PageRank values and
same time the smell with some of the lowest correlation          one with a strongly higher PageRank value, probably the
values. However, a clear result is that for all projects         cause of the negative correlation. After version 5.10.0.1
the correlation at package level between PageRank and            we noticed the presence of a 4th one. Its Severity was in
Severity of CD is strong, with the exception of JGraph           line with the others and also its PagerRank: this likely
(see the following paragraph).                                   balanced the PageRank values and subsequently caused
                                                                 the increase of the positive correlation.
Observations on weak and negative correlations                      Hence we can conclude that the variations in the cor-
From Table 4 we can observe that some projects, such             relations values from negative to positive were due to
as JMeter, Lucene, Weka and Ant show a weak corre-               the introduction of a new smell instance, whose metrics
lation between the two metrics. We aim to investigate            values strongly impacted the correlation values due to, as
these behaviours and we start by analyzing two projects:         for JMeter, the general small amount of smell instances.
JMeter, having a weak correlation, and JGraph, showing           However, this specific case does not represent a common
non-positive correlation values for CDs at package level.        behaviour in our dataset.
We focus on the last version of both projects because it
  RQ2 Answer Can we find some correlation between             reference it (incoming dependencies). In this way, a com-
  PageRank and Severity by considering each type of           ponent having many incoming dependencies but refer-
  smell?, we found out that the smell type showing            enced by components with few incoming dependencies,
  the highest PageRank and Severity correlation is            is less important with respect to another component with
  CD at package level. However, also the other types,         many incoming dependencies and referenced by other
  HL and UD, showed strong correlations, but given            components with many incoming dependencies. That
  the lower amount of HL and UD instances, we con-            is why PageRank is said to evaluate the importance of a
  sider the result regarding CDs more meaningful.             component with respect to the entire graph.
  We also investigated specific cases of projects with            From our analysis it results that the positive correla-
  weak correlation and negative correlation but we            tion is particularly evident in the case of CD. The reasons
  did not find further insights.                              behind the CD Severity high correlation can be multi-
                                                              ple: a part of code with high PageRank is interested by
                                                              more changes [33] with respect to other parts of code,
5. Discussion                                                 and thus more open to the introduction of (structurally
                                                              complex) CDs. This is interesting because in the litera-
   We found a strong correlation between PageRank and         ture we find studies which confirm the correlation in the
Severity. This means that, concerning the analysed data       other direction [12], i.e., the presence of AS makes the
and the considered smells, the criticality and the cost-      components more prone to change: if our hypothesis can
solving of smells go hand in hand: in the case of this        be further corroborated, the conclusion would be that
study, if a smell affects an important (unimportant) part     the relationship between PageRank and CD Severity is
of the system, then it will also have a high (low) cost       like a dog chasing its tail, one triggers the other. Another
solving. We can outline two different interpretations of      reason could be that components with high PageRank
the results. The positive correlation could be due to the     are involved in a high number of dependencies, thus still
nature of the two metrics, both bounded to the depen-         making easier for a developer to wrongly introduce new
dencies of the system. In this case, the conclusion would     entangled dependencies and create cycles very difficult
be that PageRank and Severity capture the same charac-        to remove.
teristic of the smells, and one of the two is redundant. As       To conclude, there is a positive correlation between AS
consequence, in the ADI computation [11], only one of         Severity and PageRank, however at the moment we can-
the two metrics should be used to evaluate AS criticality.    not draw a definitive conclusion about how to interpret
   However, given how the metrics are defined, they dif-      this finding. We plan to conduct a validation of our re-
fer one from the other. Severity takes into account the       sults with developers from industry, who could evaluate
dependencies which are directly affected by the smell,        the ability of the two metrics to capture criticality and
while PageRank considers also dependencies outside the        cost-solving, and also manually check the specific cases
smell which converge towards the components affected          where smells have high PageRank and high Severity.
by the smell. Take for instance the Severity of CD, which
is based on the dependencies forming the cycle and their
weight. If the components involved in the cycle have a        6. Threats to validity
high PageRank, it means that they are involved in many
dependencies with many other parts of the system, which          Our study presents some threats to validity which we
is unliked from the fact that those components are part       address by following the structure suggested by Yin [34].
of the cycle. With such premise, the two metrics would        Concerning the construct validity, the two metrics,
capture different aspects of the smells, and their positive   PageRank and Severity, may not measure what we claim
correlation could mean that critical parts of the system      they do, i.e., the criticality of the AS. However, this is a
attract AS which are more expensive to solve.                 preliminary study and the next step is to validate the cur-
   Moreover, one could ask where is the difference in us-     rent definition of the metrics with developers, by letting
ing PageRank when we could use simple coupling metrics        them check whether the prioritization produced by the
such as FanIn and FanOut [25]. However, when evaluat-         metrics is significant or not. Other threats regarding the
ing the coupling of a component, such metrics take into       internal validity could be related to the choice of the
account only the incoming or outgoing dependencies of         statistical methods used for the correlation analysis and
the component itself. On the contrary, the PageRank           their implementation in the used tools, but we exploited
value of a component takes into account the PageRank of       very well known and used tools (R language). Moreover,
all the components belonging to the dependency graph.         we did not validate the two metrics by investigating the
In particular, the PageRank of a component is defined         perception of developers of PageRank and Severity. How-
recursively and depends on the number of dependen-            ever, PageRank was adopted in other studies as software
cies and the PageRank metric of all the components that       ranking metric [35][33][36], and we plan for the future
to validate Severity in industrial setting. Threats to ex-       The smell type presenting the strongest correlation
ternal validity could be caused by the fact that we only      is CD, suggesting that highly critical components (with
analyzed projects written in Java and publicly available.     high PageRank) attract CDs hard to solve (with high
However, we partially mitigate such issues by analyzing       Severity). Thus, developers should pay a lot of attention
10 projects with more than 22 versions each. Moreover,        to CD smell, also because CD is the most common AS and
the high number of CDs could have reduced the effect of       in particular those at package level tend to become more
the other types of detected AS in the results. We could       critical in terms of PageRank in the history of the project
have mitigated this aspect by sampling the CD instances       development. However, we do not exclude the possibility
and thus balancing the dataset. However, this would addi-     that the two metrics have strong correlation because they
tionally reduce the size of the dataset, mining the validity  capture the same aspects of smells. In that case, we could
of the CD results too. In the future, we aim to extend        exploit this information to refine the computation of our
the study with additional data for the smells and further     ADI and leave out one of the two.
remediate to this threat. Finally, concerning threats to         In any case, we need to conduct a validation of both
the reliability of the study, Arcan could be subjected to     metrics and on the correlation results, with expert de-
a systematic bias in the detection, partially mitigated by    velopers or by comparing the ranking provided by the
the provided replication package and the fact that the        metrics with information coming from issue trackers [12].
tool has been validated on open source and industrial         The intuition behind is that a component affected by a
projects [23] [28] [1] [24]. Moreover, some threats could     critical smell (with high PageRank and high Severity)
occur due to errors in the data extraction and prepara-       should be also interested by many issues. In addition to
tion phases, resulting in errors in the construction of the   the validation, in future developments we aim to extend
dataset. However, we carefully checked every stage of         this work by analyzing more projects, also coming from
the data preparation and relied on the support of Knime7 .    industry, and verify if the same results can be confirmed.
                                                                 In this paper, we addressed the criticality evaluation
                                                              of three AS, but the study can be extended also to other
7. Conclusion                                                 kinds of AS, e.g., Scattered Functionality and Feature
                                                              Concentration, two smells which violates the separation
   We performed an empirical analysis on 22 versions of
                                                              of concerns principle. Given that such smells are not
10 projects of two software metrics, Severity and PageR-
                                                              based on dependency issues, we shall define additional
ank, in order to evaluate the cost-solving and criticality of
                                                              criticality metrics for them.
AS. We also performed this evaluation with the perspec-
tive to better understand if in the ADI computation both
the two metrics have to be used or not, if they provide References
hints on the criticality evaluation of the AS that have to
be both taken in consideration. To conclude, from the          [1] A. Martini, F. Arcelli Fontana, A. Biaggi, R. Roveda,
analysis of the evolution and correlation of PageRank               Identifying and prioritizing architectural debt
and Severity we found out that the two metrics tend to              through architectural smells: a case study in a large
be correlated, except for some extreme cases. It could be           software company, in: Proc. of the European Conf.
useful for developers to analyze the specific cases where           on Software Architecture (ECSA), Springer, 2018.
AS have high PageRank and low Severity (and vice-versa), [2] N. A. Ernst, S. Bellomo, I. Ozkaya, R. L. Nord, I. Gor-
since they could indicate smell instances which require a           ton, Measure it? manage it? ignore it? software
tailored prioritization rationale: developers may be inter-         practitioners and technical debt, in: Proc. of the
ested in identifying cases where the smell is easy to solve         2015 10th Joint Meeting on Foundations of Software
(low Severity) but in an important part of the system               Engineering, ESEC/FSE 2015, 2015.
(high PageRank), and choose to refactor this case first;       [3] S. Vidal, W. Oizumi, A. Garcia, A. Díaz Pace, C. Mar-
on the contrary, s/he could decide not to refactor a smell          cos, Ranking architecturally critical agglomerations
difficult to solve (high Severity) and in an unimportant            of code smells, Science of Computer Programming
(low PageRank) part of the system. We can assert that               182 (2019) 64–85.
such smells are a signal that both PageRank and Severity       [4] D. Taibi, A. Janes, V. Lenarduzzi, How develop-
could be useful to define different refactoring priorities,         ers perceive smells in source code: A replicated
from different points of view. In particular, PageRank can          study, Information and Software Technology 92
be used to identify parts of code which need a continuous           (2017) 223–235.
inspection, while Severity can be used to evaluate the         [5] F. Pecorelli, F. Palomba, F. Khomh, A. De Lucia,
cost-solving for the AS removal.                                    Developer-driven code smell prioritization, in: Pro-
                                                                    ceedings of the 17th International Conference on
     7
                                                                    Mining Software Repositories, MSR ’20, ACM, 2020.
    https://www.knime.com/knime-analytics-platform
 [6] L. Rizzi, F. A. Fontana, R. Roveda, Support for ar-            Engineering, Springer, 2019, pp. 250–260.
     chitectural smell refactoring, in: Proceedings of         [21] A. Oliveira, L. Sousa, W. Oizumi, A. Garcia, On the
     the 2nd International Workshop on Refactoring,                 prioritization of design-relevant smelly elements:
     IWoR@ASE, 2018, pp. 7–10.                                      A mixed-method, multi-project study, in: Proceed-
 [7] I. Pigazzini, F. A. Fontana, B. Walter, A study on cor-        ings of the XIII Brazilian Symposium on Software
     relations between architectural smells and design              Components, Architectures, and Reuse, SBCARS
     patterns, J. Syst. Softw. (2021).                              ’19, Association for Computing Machinery, 2019.
 [8] S. Brin, L. Page, The anatomy of a large-scale hy-        [22] R. Terra, L. F. Miranda, M. T. Valente, R. S. Bigonha,
     pertextual web search engine, in: Seventh Interna-             Qualitas.class Corpus: A compiled version of the
     tional World-Wide Web Conference, 1998.                        Qualitas Corpus, Software Engineering Notes 38
 [9] I. Şora, A pagerank based recommender system                   (2013).
     for identifying key classes in software systems, in:      [23] F. A. Fontana, I. Pigazzini, R. Roveda, M. Zanoni, Au-
     10th Jubilee International Symposium on Applied                tomatic detection of instability architectural smells,
     Computational Intelligence and Informatics, 2015.              in: 2016 IEEE International Conference on Software
[10] F. A. Fontana, I. Pigazzini, C. Raibulet, S. Basciano,         Maintenance and Evolution,ICSME 2016, 2016.
     R. Roveda, Pagerank and criticality of architectural      [24] F. A. Fontana, F. Locatelli, I. Pigazzini, P. Mereghetti,
     smells, in: Proceedings of the 13th European Con-              An architectural smell evaluation in an industrial
     ference on Software Architecture, ECSA 2019, 2019.             context, ICSEA 2020 (2020) 78.
[11] F. A. Fontana, P. Avgeriou, I. Pigazzini, R. Roveda,      [25] R. C. Martin, Object oriented design quality metrics:
     A study on architectural smells prediction, in: 2019           An analysis of dependencies, ROAD 2 (1995).
     45th Euromicro Conference on Software Engineer-           [26] G. Suryanarayana, G. Samarthyam, T. Sharma,
     ing and Advanced Applications (SEAA), IEEE, 2019.              Refactoring for Software Design Smells, 1 ed., Mor-
[12] D. M. Le, D. Link, A. Shahbazian, N. Medvidovic,               gan Kaufmann, 2015.
     An empirical study of architectural decay in open-        [27] D. Sas, P. Avgeriou, F. A. Fontana, Investigating
     source software, in: 2018 IEEE International Con-              instability architectural smells evolution: An ex-
     ference on Software Architecture (ICSA), 2018.                 ploratory case study, in: Int. Conference on Soft-
[13] F. A. Fontana, V. Lenarduzzi, R. Roveda, D. Taibi,             ware Maintenance and Evolution, ICSME, 2019.
     Are architectural smells independent from code            [28] F. Arcelli Fontana, I. Pigazzini, R. Roveda, D. A.
     smells? an empirical study, Journal of Systems                 Tamburri, M. Zanoni, E. D. Nitto, Arcan: A tool for
     and Software 154 (2019) 139 – 156.                             architectural smells detection, in: Int’l Conf. Soft-
[14] T. Sharma, P. Singh, D. Spinellis, An empirical in-            ware Architecture (ICSA 2017) Workshops, 2017.
     vestigation on the relationship between design and        [29] M. B. Wilk, R. Gnanadesikan, Probability plotting
     architecture smells, Empirical Software Engineer-              methods for the analysis of data, Biometrika 55
     ing (2020).                                                    (1968) 1–17.
[15] S. Herold, An initial study on the association be-        [30] S. S. Shapiro, M. B. Wilk, An analysis of variance
     tween architectural smells and degradation, in:                test for normality (complete samples), Biometrika
     Software Architecture, Springer International Pub-             52 (1965) 591–611.
     lishing, Cham, 2020, pp. 193–201.                         [31] C. Spearman, The proof and measurement of asso-
[16] J. A. D. P. Santiago A. Vidal, Claudia Marcos, An              ciation between two things, The American Journal
     approach to prioritize code smells for refactoring,            of Psychology 15 (1904) 72–101.
     Autom. Softw. Eng. 23 (2016) 501–532.                     [32] M. Kendall, J. Gibbons, Rank Correlation Methods,
[17] A. Rani, J. K. Chhabra, Prioritization of smelly               Charles Griffin Book, E. Arnold, 1990.
     classes: A two phase approach (reducing refactor-         [33] R. Wang, R. Huang, B. Qu, Network-based analysis
     ing efforts), in: 2017 3rd International Confer-               of software change propagation, The Scientific
     ence on Computational Intelligence Communica-                  World Journal 2014 (2014).
     tion Technology (CICT), 2017.                             [34] R. Yin, Case Study Research: Design and Methods,
[18] N. Sae-Lim, S. Hayashi, M. Saeki, Context-based                Applied Social Research Methods, SAGE Publica-
     approach to prioritize code smells for refactoring,            tions, 2009.
     Journal of Software: Evolution and Process (2017).        [35] F. Perin, L. Renggli, J. Ressia, Ranking software
[19] F. A. Fontana, M. Zanoni, Code smell severity classi-          artifacts, in: 4th Workshop on FAMIX and Moose
     fication using machine learning techniques, Knowl.             in Reengineering (FAMOOSr 2010), volume 120,
     Based Syst. 128 (2017).                                        Citeseer, 2010.
[20] T. Guggulothu, S. A. Moiz, An approach to suggest         [36] W.-f. PAN, B. LI, Y.-t. MA, B. JIANG, Identifying the
     code smell order for refactoring, in: International            key packages using weighted pagerank algorithm,
     Conference on Emerging Technologies in Computer                ACTA ELECTONICA SINICA 42 (2014) 2174.

</pre>