<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Elements to Layers using Centrality Measures and Machine Learning Techniques</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sanjay B Thakre</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arvind W Kiwelekar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>View, Supervised Classification.</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Engineering, Dr Babasaheb Ambedkar Technological University</institution>
          ,
          <addr-line>Lonere-Raigad, 402103</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Layered Architecture Style</institution>
          ,
          <addr-line>Architecture Descriptions, Architecture Recovery, Centrality Measures, Module Dependency</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Workshop Proce dings</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>The necessity of explicit architecture descriptions for communicating system functionalities and system maintenance activities has been continuously emphasized. This paper presents an approach to extract layering information, a form of architecture descriptions, using the centrality measures from Social Network Analysis theory and supervised machine learning algorithms. The layer recovery approach presented in this paper works in two phases. The first phase calculates centrality measures for each program element in an application. The second phase assumes that the application has been implemented around the layered architecture style and maps program elements to appropriate layers. Two techniques for mapping program elements to layers are presented. The first technique maps program elements to layer using pre-defined rules, while the second technique learns the mapping rules from a pre-labelled data set. The paper presents the evaluation of both approaches.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The value of explicit software architecture has been
increasingly recognized for software maintenance and
evolution activities [1]. Especially architecture descriptions
in terms of high-level abstractions such as patterns, styles
and views have been found as a valuable tool to
communicate system functionalities efectively [ 2, 3]. Despite
the numerous benefits, a legacy or open-source software
system often lacks such kind of architecture descriptions.
by dependency relationship. Three observations drove
the rationale behind using centrality measures for
architecture extraction: (i) Most of these measures provide a
highly intuitive and computationally simple way to
analyze interactions when a graph represents the structure
of a system. (ii) These measures quantify the structure of
a system at multiple levels, i.e., at a particular node level,
concerning other nodes in the graph, and at a group of
nodes or communities. (iii) These measures support the
development of data-driven approaches to architecture
proximately represents a system decomposition may be
able, they are not aligned with the latest version [3].</p>
      <p>Moreover, when such architecture descriptions are avail- recovery. Hence such approaches can learn the rules of
architecture recovery from given data. The approach</p>
      <p>A lightweight architecture recovery approach that ap- recovers layering information in two phases. In the first
more convenient than sophisticated architecture recov- element. We assume that system functionalities are
deery techniques in such situations. Such a light approach
shall quickly extract relevant information necessary to
build a system decomposition so that it can provide
muchneeded assistance to software architects dealing with
reengineering or modernization of existing systems, thus
increasing their productivity.
per presents an architecture recovery to extract layered
decomposition of an implemented system. The method
LGOBE
0000-0002-9647-6403 (S. B. Thakre); 0000-0002-3407-0221
(A. W. Kiwelekar)
work Analysis [4] to analyze software structure formed
uses centrality measures from the theory of Social Net- extract layering information. The rest of the paper is
phase, a centrality score is assigned to each program
composed among multiple layers, and so in the second
phase, a layer is assigned to each program element.</p>
      <p>
        The paper contributes to the existing knowledge base
of the architecture recovery domain in the following
ways. (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) It demonstrates the use of centrality measures
to recover layering information. (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) It describes a
datausing supervised classification algorithms. (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) It presents
an evaluation of supervised classification algorithms to
organized as follows: Section II defines various centrality
measures. Section III describes the central element of the
approach. The algorithmic and data-driven approaches to
the problem of layer assignment are explained in Section
IV. The evaluation of the approach is presented in Section
V. Section VI puts our approach in the context of existing
approaches. Finally, the paper concludes in Section VII.
      </p>
      <p>The influential nodes are characterized by a smaller
internode distance which signifies the faster transfer of
information. The closeness centrality is derived from the
average distance from a node to all the connected nodes
at diferent depths. However, the distance between the
disconnected components of the network is infinite, and
hence it is excluded. For the central node, the average
distance would be small and is calculated as the inverse
of the sum of the distances to all other nodes (  ). The
normalized closeness (</p>
      <p>) is in the range from 0 to 1,
where 0 represents an isolated node, and 1 indicates a
strongly connected node.</p>
      <p>( ) =</p>
      <p>∑(
   = ∑(
1
 
 − 1
 
)
)</p>
      <sec id="sec-1-1">
        <title>2.3. Betweenness centrality</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Social Network Analysis</title>
    </sec>
    <sec id="sec-3">
      <title>Measures</title>
      <p>The theory of Social Network Analysis (SNA) provides a
generic framework to analyze the structure of complex
systems. It includes a rich set of measures, models, and
methods to extract the patterns of interactions among
systems’ elements. A complex system is expressed as
a network of nodes and edges to support systems from
diverse application domains.</p>
      <p>A few examples of complex systems that have been
analyzed with the help of SNA include communities on
social media platforms [5], and neural systems[6]. The
techniques from SNA have been applied to control the
influence of diseases [ 7] to understand biological systems
[8], and to investigate protein interactions [9]. In these
applications of SNA, a complex system is represented as
a graph, and then they are analyzed through measures
such as centrality, scale-free, small world, and community
structure [10, 11, 12, 13]. Some of these commonly used</p>
      <p>
        The theory of Social Network Analysis (SNA) provides
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
(
        <xref ref-type="bibr" rid="ref5">5</xref>
        )
(
        <xref ref-type="bibr" rid="ref6">6</xref>
        )
(
        <xref ref-type="bibr" rid="ref7">7</xref>
        )
SNA measures relevant to our study are described below. This measure identifies those central nodes which are
a range of measures with varying levels. Some are ap- the network. Removal of such a central node would mean
plied at the node-level while others are applied at the
network-level. The node-level measures are calculated
from the nodes which are directly connected to a given
a disconnection of the complete network. Hence, these
nodes act as a bridge to pass the information [12, 14].
      </p>
      <p>Betweenness centrality (  ( ) is defined as the number
node. The centrality measures are the node-level mea- of shortest paths passing through a node  .
responsible for connecting two or more components of
sures quantifying the importance of an individual node
in the network. A central node is an influential node
having significant potential to communicate and access
information. There exists diferent centrality measures,
and they are derived from the connections to a node,
position of a node in the network, and relative importance
of nodes.</p>
      <sec id="sec-3-1">
        <title>2.1. Degree centrality</title>
        <p>This measure determines a central node based on the
connections to individual nodes. A node with a higher
degree in a network is considered the most influential one.
In a directed graph, two diferent centrality measures
exist in-degree and out-degree based on the number of
incoming and outgoing edges, respectively. The degree
centrality,   ( ) , of a node,  , is equal to the number of its
connections, ( )
possible degree of the node.</p>
        <p>, normalized    to the maximum
  ( ) = ( )
   =
  ( )
 − 1
=
( )
 − 1</p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. Closeness centrality</title>
        <p>This measure identifies an influential node in terms of a
faster and broader spread of information in a network.
  ( ) =
∑
≠≠
  ( )</p>
        <p>′
  ( ) =</p>
        <p>2  ( )
 2 − 3 + 2
where,   is the total number of shortest paths from a
node  to  and   ( ) is the number of paths that pass
through  . The relative betweenness centrality, 
of any node in a graph with respect to the maximum
′
 ( ) ,
centrality of the node is calculated from   ( ) .</p>
      </sec>
      <sec id="sec-3-3">
        <title>2.4. Eigenvector centrality</title>
        <p>
          The Eigenvector centrality is a relative centrality
measure, unlike the last three measures that are absolute. The
Eigenvector centrality calculation depends on the largest
real Eigenvalue present in the symmetric adjacency
ma(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) trix. The centrality of a node  is proportional to the sum
of the centralities of the nodes connected to it [15, 12].
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
  = ∑
        </p>
        <p>=1</p>
        <p>In general, it requires the solution of the equation
 = 
where  is an adjacency matrix.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Approach</title>
      <p>The broad objective of the approach is to extract
highlevel layering information, a form of architecture
descriptions, from the implementation artefacts so that software
architects can perform the analysis specific to layer
architecture style. In this paper, we demonstrate our approach
with the help of implementation artefacts available in a
Java-based system. The method assumes that a system
under study is implemented around the layered
architecture style. Software maintenance engineer can verify
such assumptions from the earlier documentation of the
system when available. As shown in Figure 2, the
approach consists of following two phases.</p>
      <p>1. Dependency Network Builder and Analysis:
This phase retrieves the dependencies present
among implementation artefacts. For a
Javabased system, this phase takes Java or Jar files
as an input and generates a dependency network.
The programming elements such as Classes,
Interfaces, and Packages are the nodes of the
network and the Java relationships such as  ,
 , and   are the edges in the
dependency network. The output of this stage is
represented as a graph in Graph Modeling
Language (GML). In the second stage, a centrality
score to each node is assigned. The centrality
score includes the diferent measures described</p>
      <p>in Section 2, and they are calculated at the Class
and Interface levels. The output of this stage is
a data file in the CSV format describing the
centrality score assigned to each program element.
2. Layer Assignment: The layer assignment
activity aims to assign the most appropriate layer
to a program element. Additional style-specific
analyses such as analysis of layer violations, and
performance modeling can be supported once the
programming elements are assigned to
appropriate layers.</p>
      <p>Building a dependency network and calculating
centrality scores are straightforward activities to realize when
the tools such as JDependency 1 are available. However,
assigning program elements to layers is a trickier issue
considering the large number of program elements and
1https://sourceforge.net/projects/javadepend/
n)
3:
4:
in-degree
out-degree
between-ness
close-ness
eigenvector
upper</p>
    </sec>
    <sec id="sec-5">
      <title>4. Layer Assignment</title>
      <p>The objective of the layer assignment stage is to identify
the most appropriate layer based on the centrality
measures. Here, we use the term layer in the loose sense that
a layer is a logical coarse-grained unit of decomposing
system functionalities and encapsulating program
elements according to common system functionalities. It is
not the term layer in the strict sense as that used in Layer
architecture style [16].</p>
      <p>We assume a three-layers based decomposition. The
decision to decompose all the system responsibilities into
three layers is based on the observation that most
architectural styles’ functionalities can be cleanly decomposed
into three coarse-grained units of decomposition. For
example, architectural style such as Model-View-Controller
(MVC), Presentation-Abstraction-Control (PAC),[16], and
3-tier architectural patterns have three units of system
decomposition.</p>
      <p>Two diferent techniques based on centrality measures
are developed to assign program elements to layers. The
ifrst techniques uses a set of pre-defined rules. The
second technique automatically learns the assignment rules
from the pre-labelled layer-assignments using a
supervisory classification algorithm.</p>
      <sec id="sec-5-1">
        <title>4.1. Rule-Driven Layer Assignment</title>
        <p>This technique uses centrality measures, described in Sec- into three segments corresponding to lower, middle and
tion 2, to map program elements to the most appropriate
end if
end if
26: end for
layer. The measure of degree centrality from Section 2.1 is
further divided as in-degree and out-degree which count
the number of incoming and outgoing edges of a node.</p>
        <p>Total five measures of centrality are used. Five
accessor functions namely  , 
,  
,</p>
        <p>and  
are defined to get the values of in-degree centrality,
outdegree centrality, betweenness centrality, closeness
centrality and eigenvector centrality associated to a specific
node respectively.</p>
        <p>Three layers are labelled as, i.e. upper, middle and
lower. Table 1 describes the relative significance of
various centrality measures with references to upper, middle
and lower layers. A set of configuration parameters, as
shown in Table 2, are defined. These parameters provide
lfexibility while mapping program elements to a specific
layer.</p>
        <p>The Algorithm 1 is operated on a
dependencynetwork in which nodes represent program elements
while edges represent dependencies. The objective of
this algorithm is to partition the node space representing
upper layers. This objective is achieved in two stages.</p>
        <p>First, the algorithm calculates two diferent partitions.</p>
        <p>The first partition i.e.    is calculated using the
in-degree centrality measure while the second partition
is calculated using the out-degree centrality measure.</p>
        <p>Algorithm 2 refineLabel(inParticion, outPartition,
n)
Input: inPartition[1:n], outPartition[1:n]: Vector
n: Integer
Output: nodeLabels[1:n]: Vector
upper
then
closeness
high
high
0
1</p>
        <sec id="sec-5-1-1">
          <title>Rationale</title>
        </sec>
        <sec id="sec-5-1-2">
          <title>Classes with in</title>
          <p>degree value equal
to 0 are placed in
the upper layer.</p>
          <p>Classes with high
out-degree are
placed in the top
layer because they
use services from
layers beneath
them.</p>
          <p>Classes with high
closeness value are
placed in the
upper layer because
of large average
distance from top layer
to bottom layer.</p>
          <p>Classes with high
betweenness value are
placed in the
middle layer as they fall
on the path from
top layer to bottom
layer.</p>
        </sec>
        <sec id="sec-5-1-3">
          <title>Classes with high</title>
          <p>in-degree value are
placed in the bottom
layer because they
are highly used.</p>
        </sec>
        <sec id="sec-5-1-4">
          <title>Classes with out</title>
          <p>degree value equal
to zero are placed
to bottom layer
because they only
provide services.</p>
        </sec>
        <sec id="sec-5-1-5">
          <title>Classes with eigen</title>
          <p>value equal to 1 are
placed to bottom
layer because they
are highly reused.</p>
          <p>Classes with
in-degree and
outdegree values are
equal to 0 are placed
to bottom layer,
because they are
isolated classes.</p>
          <p>The configuration parameters need to be suitably
initialized for the correct functioning of the rule-driven
approach discussed in the previous section. The system ar- In the data-driven approach, the problem of layered
chitect responsible for architecture recovery needs to fine assignment is modeled as a multi-class classification
probtune the parameters to get layering at the desired level of
abstraction. To overcome this drawback a data-driven ap- (le3m)wwitihthntuhmreeerilcaablelylsein.ec.oldowedera(s11),,2m, iadnddle3(2re)sapnedctuivpepleyr.
proach is developed to assign labels to the programming The classification model is trained on the labeled
dataelements.</p>
          <p>After the execution of Algorithm 1 each node is
labeled with two labels corresponding to layers. The
various combination of labels include (lower, lower), (middle,
middle), (top, top), (middle, top), and (middle, lower). Out
of these six labelling, the labels (middle, top), and (middle,
lower) are conflicting because two diferent labels are
assigned to a node. This conflict needs to be resolved.</p>
          <p>The Algorithm 2 resolves the conflicting labels and
assigns the unique label to each node. The conflicting
labels are resolved by using the rules described in the
decision Table 3. The function   called in the
Algorithm 2 uses these rules. The rules in Table 3
resolve the conflicting assignments using the centrality
measures of close-ness, between-ness, and Eigen vector,
while the primary layer assignment by Algorithm 1 is
done with in-degree and out-degree centrality measures.</p>
          <p>When Algorithm 2 is executed, some of the nodes from
the middle layer bubble up to the upper layer, and some
nodes fall to the lower level. Some nodes remain at the
middle layer. The vector  holds the unique
labelling of each node in the dependency network after
resolving all conflicts.</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Supervised Classification based Layered Assignment</title>
        <p>lower
in
out
eigen
in
.
set. The data set, as shown in Table 4, includes program
element identifiers, values of all the centrality measures
and layering labels as specified by the system architect
responsible for architecture recovery. The layering
labels can be used from the previous version of the system
under study or the labels guessed by system architect to
explore diferent alternatives for system decomposition.</p>
        <p>We implement three supervised classification
algorithm namely K-Nearest Neighbour, Support Vector
Machine, and Decision Tree. These are the machine learning
algorithms particularly used for multi-class classification
problems. A detailed comparison of these various
algorithms can be found in [17]. Python’s Scikit-Learn [18]
library is used to develop classification model based on
these algorithms. Table 4 shows the format of the
sample dataset used to train the classification models. The
developed models are evaluated against the classification
metrics such as accuracy, precision, recall, and F1-Score. Figure 3: An Architecture of a System Designed to Train the
approach</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Model Development</title>
      <p>The machine learning model development includes
phases like training, testing and evaluation. This
section describes how these phases are carried out.</p>
      <sec id="sec-6-1">
        <title>5.1. Model Training</title>
        <p>The following software systems are used to train and
build the machine learning models.</p>
        <p>1. Training Architecture system: A small-scale
sample architecture system, as shown in Figure 3,
has been designed specially to train the approach.</p>
        <p>It includes 16 classes without the implementation
of any functionalities; it contains only
dependencies among classes, as shown in Figure 3. The
classes named as SEC, TX, SERI in the figure
represent crosscutting concerns.
2. HealthWatcher: The HealthWatcher is a
webbased application providing healthcare-related
services [19]. This application is selected to train
the model because existing literature in the public
domain confirms that the application follows a
client-server and layered architecture style.
Initially, the first author has manually assigned
labels to all the programming elements. Later the
second author checked the labelling of individual
elements. We used the following rules to label
programming elements. These are: (i)
Programming elements that access low-level device
functions and data access functions are labelled as
lower. (ii) Programming elements accessing
presentation functions are labelled as upper. (iii) Pro- ConStore is a framework for detailing out the concepts
gramming elements that provide business logic or and creating a domain model for a given application.
depend on programming elements defined within
the application are labelled as middle.</p>
      </sec>
      <sec id="sec-6-2">
        <title>5.3. Model Evaluation</title>
        <p>Total one hundred fity-one classes, i.e. data instances,
are used to train the model—sixteen classes from the
specially designed system and 135 classes from the
 ℎ ℎ system.</p>
      </sec>
      <sec id="sec-6-3">
        <title>5.2. Model Testing</title>
        <p>We used ConStore[20], a small scale Java-based library
designed to manage concept networks, to test the model.</p>
        <p>This application is selected to test the model because the
second author has involved in recovering architecture
previously and knows the details of the system. The
The performance of classification models is typically
evaluated against measures such as accuracy, precision, recall,
and F1-Score [21]. These metrics are derived from a
confusion matrix which compares the count of actual class
labels for the observations in a given data set and the class
labels as predicted by a classification model. Four
diferent metrics are derived by comparing true labels with the
predicted labels. These are accuracy, recall, precision and
F1-score. Table 5 shows the performance analysis against
these metrics. The table compares the performance of
algorithmic-centric approach and data-driven approach.
5.3.1. Accuracy Analysis
5.3.3. Threats to Validation
The accuracy is the rate of correction for classification
The performance of the models has been evaluated in an
models. Higher the value of accuracy, better is the model. academic setting with internal validation only. By
interFrom the accuracy point of view, one can observe from
nal validation, we mean the performance of algorithmic
the Table 5 that the data-driven approach performs better
and data-driven techniques that have been compared and
as compared to the algorithmic-centric approach. The
analyzed. This is because our prime aim is to demonstrate
decision-based classifier preforms better on all the test
the significance of social network analysis measures in
cases with an average accuracy of 74%. This is because
recovering architecture descriptions. The model’s
perthe performance of algorithmic approach depends on
formance needs to be further compared against similar
the proper tuning of various configuration parameters.
approaches developed earlier[22]. Also, the usefulness
The results shown in Table 5 are obtained with following
of recovered layering information needs to be assessed
adjust the model parameters for the better results of accu- include dependencies i.e. at the level of importing files or
values in Table 6 of configuration parameters. The Table
6 shows combination of configuration values for the best
accuracy obtained during twenty-five iterations. During
each iterations, values of configuration parameters were
incremented by 1 or 0.1( for   ,   ).</p>
        <sec id="sec-6-3-1">
          <title>ConStore</title>
        </sec>
        <sec id="sec-6-3-2">
          <title>Healthwatcher Test Arch.</title>
          <p>4
1
4
1
6
0.8
0.6
10
1
2
5
9
0.8
0.5
2
1
2
2
6
0.6
0.6
5.3.2. Recall, Precision, F1-Score Analysis
Recall indicates the proportion of correctly identified
true positives while precision is the proportion of correct
positive identification. High values of both recall and
precision are desired, but it isn’t easy to get high values
simultaneously for recall and precision. Hence, F1- score
combines recall and precision into one metrics. From
the recall, precision, and F1-score point of view, one can
observe from Table 5 that decision tree-based classifier
performs better with the highest F1-score of 0.86 for
the upper layer classification of test architecture system.
Recalling class labels with higher precision for middle
layer is a challenging task for all the models described
in this paper. This is because of the presence of many
not so cleanly encapsulated functionalities in a module
at the middle layer and mapping crosscutting concerns
to one of the three layers.
in the software industry setting.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>6. Related</title>
    </sec>
    <sec id="sec-8">
      <title>Work</title>
      <p>Recovering architecture descriptions from the code has
been one of the widely and continuously explored
problem by Software Architecture researchers. This has
resulted in a large number of techniques[23, 24, 25], survey
papers [22] and books [26] devoted to the topic. In the
context of these earlier approaches, this section provides
the rationale behind the implementation decisions taken
while developing our approach.</p>
      <p>(i) Include Dependencies vs Symbolic
Dependencies: The recent study reported in [27] has recognized
that the quality of recovered architecture depends on the
type of dependencies analyzed to recover architecture.</p>
      <p>The study analyzes the impact of symbolic dependencies
i.e. dependencies at the program identifier level versus
including packages. Further, it emphasizes that symbolic
dependencies are more accurate way to recover
structural information. The use of include dependencies is
error prone owing to the fact that a programmer may
include a package without using it.</p>
      <p>We used include dependencies in our approach because
extracting and managing include dependencies are
simple as compared to symbolic dependencies. Further, we
mitigated the risk of unused packages by excluding these
relationship from further analysis. Many programming
environments facilitate the removal of unused packages.</p>
      <p>One of our objectives was to develop a data-driven
approach and cleaning data in this way is an established
practice in the field of data engineering.</p>
      <p>(ii) Unsupervised Clustering vs Supervised
classification:</p>
      <p>The techniques of unsupervised clustering
have been adopted widely to extract high-level
architectures through the analysis of dependencies between
implementation artefacts [23]. These approaches use
hierarchical and search-based methods for clustering.</p>
      <p>These approaches usually take substantial search time
to find not so good architectures [ 28]. One of the
advantages of clustering methods is that unlabelled data
sets drive these methods. But, the identified clusters of
program elements need to be labelled with appropriate enough to extend.
labels. The layering information extracted by the approach</p>
      <p>Our choice of supervised classification method is driven can be viewed as one way of decomposing a system. It
by the fact that centrality measures quantify the structural is not the single ground truth architecture that is often
properties with reference to a node, and relation of the dificult to agree upon and laborious to discover [ 22].
nodes with respect to others. Processing such quantified Further, the quality of extracted architecture descriptions,
values in eficient way is one of the advantages of many i.e. clustering of program elements to layers, need to be
of supervised classification methods. Further, assigning assessed for the properties such as a minimal layer of
program elements with layering labels is not an issue if violation [31, 32] or satisfaction of a particular quality
such information is available from the previous version attribute[26] or any other project-specific criteria.
of software, which may be the case for re-engineering or We described the working of the approach by
assummodernization projects. In the absence of such labelled ing a three-layer decomposition. The work presented
data set, the approach presented in the paper can still be in this paper can be extended to more than three layers.
adopted in two stages. In the first stage, a tentative layer The algorithm-centric technique needs to be adapted by
labelling can be done through algorithmic approach fol- redesigning rules for additional layers. Also, the
superlowed by the labelling through supervised classification vised classification method can be adjusted by relabelling
method. program elements with the number of layers considered.</p>
      <p>(ii) Applications of Social Network Analysis Exploring the impact of fusing structural properties
(SNA) Measures: The interest in applying theory of and some semantic features such as a dominant concern
SNA has started growing in recent times. Some of the addressed by a programming element would be an
excitrecent applications of SNA include predicting architec- ing exercise for future exploration.
tural smell [29], predicting vulnerable software
components [30] to measure structural similarity of program
elements. Our approach characterizes each program el- References
ements through its centrality measures which can be
termed as feature representation, using machine learning
vocabulary, necessary to build data-driven models.</p>
    </sec>
    <sec id="sec-9">
      <title>7. Conclusion</title>
      <p>The main highlights of the approach presented in the
paper include: (i)The dependency graph formed by
programming elements (i.e. classes in Java) is treated as a
network, and centrality measures are applied to extract
structural properties. (ii) It represents each program
element as a set of values corresponding to diferent
centrality measures. Thus, each program element is represented
as a feature vector in machine learning terminology. (iii)
The paper treats a layer as a coarsely granular abstraction
encapsulating common system functionalities. Then it
maps a group of programming elements sharing common
structural properties to a layer. (iv) The paper describes
two mapping methods for this purpose called
algorithmic centric and data-driven. (v) Overall, the data-driven
method illustrated in the paper perform better compared
to the algorithmic centric method.</p>
      <p>The paper makes several assumptions, such as (i)
availability of Java-based system implementation, (ii) a system
is decomposed into three layers, and (iii) availability of
pre-labelled data set for supervised classification. These
are the assumption made to simplify the realization and
demonstration of the approach. Hence, these
assumptions do not restrict the approach. However, these
assumptions can be relaxed, and the approach is flexible
[8] J. S. Silva, A. M. Saraiva, A methodology for ap- [23] O. Maqbool, H. Babri, Hierarchical clustering for
plying social network analysis metrics to biologi- software architecture recovery, IEEE Transactions
cal interaction networks, in: Advances in Social on Software Engineering 33 (2007) 759–780.
Networks Analysis and Mining (ASONAM), 2015 [24] A. W. Kiwelekar, R. K. Joshi, An ontological
frameIEEE/ACM International Conference on, IEEE, 2015, work for architecture model integration, in:
Propp. 1300–1307. ceedings of the 4th International Workshop on
[9] G. Amitai, A. Shemesh, E. Sitbon, M. Shklar, Twin Peaks of Requirements and Architecture, 2014,
D. Netanely, I. Venger, S. Pietrokovski, Network pp. 24–27.
analysis of protein structures identifies functional [25] A. W. Kiwelekar, R. K. Joshi, Ontological analysis
residues, Journal of Molecular Biology 344 (2004) for generating baseline architectural descriptions,
1135–1146. doi:h t t p : / / d x . d o i . o r g / 1 0 . 1 0 1 6 / j . j m b . in: European Conference on Software Architecture,
2 0 0 4 . 1 0 . 0 5 5 . Springer, 2010, pp. 417–424.
[10] M. E. J. Newman, Random graphs as models of [26] A. Isazadeh, H. Izadkhah, I. Elgedawy, Source code
networks, arXiv preprint cond-mat/0202208 (2002). modularization: theory and techniques, Springer,
[11] M. E. Newman, The structure and function of com- 2017.</p>
      <p>plex networks, SIAM review 45 (2003) 167–256. [27] T. Lutellier, D. Chollak, J. Garcia, L. Tan, D. Rayside,
[12] S. P. Borgatti, Centrality and network flow, Social N. Medvidović, R. Kroeger, Measuring the impact
networks 27 (2005) 55–71. of code dependencies on software architecture
re[13] L. C. Freeman, Centrality in social networks concep- covery techniques, IEEE Transactions on Software
tual clarification, Social networks 1 (1979) 215–239. Engineering 44 (2017) 159–181.
[14] D. R. White, S. P. Borgatti, Betweenness centrality [28] S. Mohammadi, H. Izadkhah, A new algorithm for
measures for directed graphs, Social Networks 16 software clustering considering the knowledge of
(1994) 335–346. dependency between artifacts in the source code,
[15] P. Bonacich, Some unique properties of eigenvector Information and Software Technology 105 (2019)
centrality, Social networks 29 (2007) 555–564. 252–256.
[16] F. Buschmann, K. Henney, D. C. Schmidt, Pattern- [29] A. Tommasel, Applying social network analysis
oriented software architecture, on patterns and pat- techniques to architectural smell prediction, in:
tern languages, volume 5, John wiley &amp; sons, 2007. 2019 IEEE International Conference on Software
[17] C. A. U. Hassan, M. S. Khan, M. A. Shah, Compari- Architecture Companion (ICSA-C), IEEE, 2019, pp.
son of machine learning algorithms in data classifi- 254–261.
cation, in: 2018 24th International Conference on [30] V. H. Nguyen, L. M. S. Tran, Predicting vulnerable
Automation and Computing (ICAC), IEEE, 2018, pp. software components with dependency graphs, in:
1–6. Proceedings of the 6th International Workshop on
[18] J. Hao, T. K. Ho, Machine learning made easy: A Security Measurements and Metrics, 2010, pp. 1–8.
review of scikit-learn package in python program- [31] S. Sarkar, G. Maskeri, S. Ramachandran, Discovery
ming language, Journal of Educational and Behav- of architectural layers and measurement of layering
ioral Statistics 44 (2019) 348–361. violations in source code, Journal of Systems and
[19] P. Greenwood, T. Bartolomei, E. Figueiredo, Software 82 (2009) 1891–1905.</p>
      <p>M. Dosea, A. Garcia, N. Cacho, C. Sant’Anna, [32] S. Sarkar, V. Kaulgud, Architecture reconstruction
S. Soares, P. Borba, U. Kulesza, et al., On the im- from code for business applications-a practical
appact of aspectual decompositions on design stabil- proach, in: 1st India Workshop on Reverse
Engiity: An empirical study, in: European Conference neering,(IWRE), 2010.
on Object-Oriented Programming, Springer, 2007,
pp. 176–200.
[20] http://www.cse.iitb.ac.in/constore,</p>
      <p>http://www.cse.iitb.ac.in/constore, 2009.
[21] C. Goutte, E. Gaussier, A probabilistic
interpretation of precision, recall and f-score, with
implication for evaluation, in: European conference on
information retrieval, Springer, 2005, pp. 345–359.
[22] J. Garcia, I. Ivkovic, N. Medvidovic, A
comparative analysis of software architecture recovery
techniques, in: 2013 28th IEEE/ACM International
Conference on Automated Software Engineering (ASE),
IEEE, 2013, pp. 486–496.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Link</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Behnamghader</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Moazeni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Boehm</surname>
          </string-name>
          ,
          <article-title>The value of software architecture recovery for maintenance</article-title>
          ,
          <source>in: Proceedings of the 12th Innovations on Software Engineering Conference (formerly known as India Software Engineering Conference)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Pacheco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Marín-Raventós</surname>
          </string-name>
          , G. López,
          <article-title>Designing a technical debt visualization tool to improve stakeholder communication in the decision-making process: a case study</article-title>
          ,
          <source>in: International Conference on Research and Practical Issues of Enterprise Information Systems</source>
          , Springer,
          <year>2018</year>
          , pp.
          <fpage>15</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Shahbazian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Brun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Medvidovic</surname>
          </string-name>
          ,
          <article-title>Recovering architectural design decisions</article-title>
          ,
          <source>in: 2018 IEEE International Conference on Software Architecture (ICSA)</source>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>95</fpage>
          -
          <lpage>9509</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Carrington</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Scott</surname>
          </string-name>
          , S. Wasserman,
          <article-title>Models and methods in social network analysis</article-title>
          , volume
          <volume>28</volume>
          , Cambridge university press,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Thakare</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. W.</given-names>
            <surname>Kiwelekar</surname>
          </string-name>
          ,
          <string-name>
            <surname>Skiplpa:</surname>
          </string-name>
          <article-title>An eficient label propagation algorithm for community detection in sparse network</article-title>
          ,
          <source>in: Proceedings of the 9th Annual ACM India Conference</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>97</fpage>
          -
          <lpage>106</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Albert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-L.</given-names>
            <surname>Barabási</surname>
          </string-name>
          ,
          <article-title>Statistical mechanics of complex networks</article-title>
          ,
          <source>Reviews of modern physics 74</source>
          (
          <year>2002</year>
          )
          <fpage>47</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Watts</surname>
          </string-name>
          , Networks, dynamics, and
          <source>the smallworld phenomenon 1</source>
          ,
          <source>American Journal of sociology 105</source>
          (
          <year>1999</year>
          )
          <fpage>493</fpage>
          -
          <lpage>527</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>