=Paper= {{Paper |id=Vol-2978/saml-paper1 |storemode=property |title=Mapping Program Elements to Layers using Centrality Measures and Machine Learning Techniques |pdfUrl=https://ceur-ws.org/Vol-2978/saml-paper1.pdf |volume=Vol-2978 |authors=Sanjay B Thakre,Arvind W Kiwelekar |dblpUrl=https://dblp.org/rec/conf/ecsa/ThakareK21 }} ==Mapping Program Elements to Layers using Centrality Measures and Machine Learning Techniques== https://ceur-ws.org/Vol-2978/saml-paper1.pdf

Mapping Program Elements to Layers using Centrality
Measures and Machine Learning Techniques
Sanjay B Thakre, Arvind W Kiwelekar
Department of Computer Engineering, Dr Babasaheb Ambedkar Technological University, Lonere-Raigad, 402103, India

Abstract
The necessity of explicit architecture descriptions for communicating system functionalities and system maintenance activities
has been continuously emphasized. This paper presents an approach to extract layering information, a form of architecture
descriptions, using the centrality measures from Social Network Analysis theory and supervised machine learning algorithms.
The layer recovery approach presented in this paper works in two phases. The first phase calculates centrality measures
for each program element in an application. The second phase assumes that the application has been implemented around
the layered architecture style and maps program elements to appropriate layers. Two techniques for mapping program
elements to layers are presented. The first technique maps program elements to layer using pre-defined rules, while the
second technique learns the mapping rules from a pre-labelled data set. The paper presents the evaluation of both approaches.

Keywords
Layered Architecture Style, Architecture Descriptions, Architecture Recovery, Centrality Measures, Module Dependency
View, Supervised Classification.

1. Introduction by dependency relationship. Three observations drove
the rationale behind using centrality measures for archi-
The value of explicit software architecture has been in- tecture extraction: (i) Most of these measures provide a
creasingly recognized for software maintenance and evo- highly intuitive and computationally simple way to ana-
lution activities [1]. Especially architecture descriptions lyze interactions when a graph represents the structure
in terms of high-level abstractions such as patterns, styles of a system. (ii) These measures quantify the structure of
and views have been found as a valuable tool to commu- a system at multiple levels, i.e., at a particular node level,
nicate system functionalities effectively [2, 3]. Despite concerning other nodes in the graph, and at a group of
the numerous benefits, a legacy or open-source software nodes or communities. (iii) These measures support the
system often lacks such kind of architecture descriptions. development of data-driven approaches to architecture
Moreover, when such architecture descriptions are avail- recovery. Hence such approaches can learn the rules of
able, they are not aligned with the latest version [3]. architecture recovery from given data. The approach
A lightweight architecture recovery approach that ap- recovers layering information in two phases. In the first
proximately represents a system decomposition may be phase, a centrality score is assigned to each program
more convenient than sophisticated architecture recov- element. We assume that system functionalities are de-
ery techniques in such situations. Such a light approach composed among multiple layers, and so in the second
shall quickly extract relevant information necessary to phase, a layer is assigned to each program element.
build a system decomposition so that it can provide much- The paper contributes to the existing knowledge base
needed assistance to software architects dealing with re- of the architecture recovery domain in the following
engineering or modernization of existing systems, thus ways. (1) It demonstrates the use of centrality measures
increasing their productivity. to recover layering information. (2) It describes a data-
Intending to design a lightweight approach, this pa- driven approach to mapping program elements to layers
per presents an architecture recovery to extract layered using supervised classification algorithms. (3) It presents
decomposition of an implemented system. The method an evaluation of supervised classification algorithms to
uses centrality measures from the theory of Social Net- extract layering information. The rest of the paper is
work Analysis [4] to analyze software structure formed organized as follows: Section II defines various centrality
measures. Section III describes the central element of the
ECSA2021 Companion Volume, Robert Heinrich, Raffaela Mirandola
and Danny Weyns, Växjö, Sweden, 13–17 September, 2021
approach. The algorithmic and data-driven approaches to
Envelope-Open mail2sbt@gmail.com (S. B. Thakre); awk@dbatu.ac.in the problem of layer assignment are explained in Section
(A. W. Kiwelekar) IV. The evaluation of the approach is presented in Section
GLOBE https://awk-net.group/ (A. W. Kiwelekar) V. Section VI puts our approach in the context of existing
Orcid 0000-0002-9647-6403 (S. B. Thakre); 0000-0002-3407-0221 approaches. Finally, the paper concludes in Section VII.
(A. W. Kiwelekar)
© 2021 Copyright © 2021 for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
2. Social Network Analysis The influential nodes are characterized by a smaller inter-
node distance which signifies the faster transfer of in-
Measures formation. The closeness centrality is derived from the
The theory of Social Network Analysis (SNA) provides a average distance from a node to all the connected nodes
generic framework to analyze the structure of complex at different depths. However, the distance between the
systems. It includes a rich set of measures, models, and disconnected components of the network is infinite, and
methods to extract the patterns of interactions among hence it is excluded. For the central node, the average
systems’ elements. A complex system is expressed as distance would be small and is calculated as the inverse
a network of nodes and edges to support systems from of the sum of the distances to all other nodes (𝑑𝑣𝑤 ). The
diverse application domains. normalized closeness (𝑁 𝐶𝐶 ) is in the range from 0 to 1,
A few examples of complex systems that have been where 0 represents an isolated node, and 1 indicates a
analyzed with the help of SNA include communities on strongly connected node.
social media platforms [5], and neural systems[6]. The 1
techniques from SNA have been applied to control the 𝐶𝐶 (𝑣) = ∑( ) (3)
𝑑𝑣𝑤
influence of diseases [7] to understand biological systems
[8], and to investigate protein interactions [9]. In these 𝑛−1
𝑁 𝐶𝐶 = ∑( ) (4)
applications of SNA, a complex system is represented as 𝑑𝑣𝑤
a graph, and then they are analyzed through measures
such as centrality, scale-free, small world, and community 2.3. Betweenness centrality
structure [10, 11, 12, 13]. Some of these commonly used
SNA measures relevant to our study are described below. This measure identifies those central nodes which are
The theory of Social Network Analysis (SNA) provides responsible for connecting two or more components of
a range of measures with varying levels. Some are ap- the network. Removal of such a central node would mean
plied at the node-level while others are applied at the a disconnection of the complete network. Hence, these
network-level. The node-level measures are calculated nodes act as a bridge to pass the information [12, 14].
from the nodes which are directly connected to a given Betweenness centrality (𝐶𝐵 (𝑣) is defined as the number
node. The centrality measures are the node-level mea- of shortest paths passing through a node 𝑣.
sures quantifying the importance of an individual node
𝜎𝑠𝑡 (𝑣)
in the network. A central node is an influential node 𝐶𝐵 (𝑣) = ∑ (5)
having significant potential to communicate and access 𝑠≠𝑣≠𝑡 𝜎𝑠𝑡
information. There exists different centrality measures,
where, 𝜎𝑠𝑡 is the total number of shortest paths from a
and they are derived from the connections to a node, po-
node 𝑠 to 𝑡 and 𝜎𝑠𝑡 (𝑣) is the number of paths that pass
sition of a node in the network, and relative importance ′
through 𝑣. The relative betweenness centrality, 𝐶𝐵 (𝑣),
of nodes.
of any node in a graph with respect to the maximum
centrality of the node is calculated from 𝐶𝐵 (𝑣).
2.1. Degree centrality
′ 2𝐶𝐵 (𝑣)
This measure determines a central node based on the 𝐶𝐵 (𝑣) = 2
(6)
𝑛 − 3𝑛 + 2
connections to individual nodes. A node with a higher
degree in a network is considered the most influential one.
In a directed graph, two different centrality measures
exist in-degree and out-degree based on the number of 2.4. Eigenvector centrality
incoming and outgoing edges, respectively. The degree
centrality, 𝐶𝐷 (𝑣), of a node, 𝑣, is equal to the number of its The Eigenvector centrality is a relative centrality mea-
connections, 𝑑𝑒𝑔(𝑣), normalized 𝑁 𝐶𝐷 to the maximum sure, unlike the last three measures that are absolute. The
possible degree of the node. Eigenvector centrality calculation depends on the largest
real Eigenvalue present in the symmetric adjacency ma-
𝐶𝐷 (𝑣) = 𝑑𝑒𝑔(𝑣) (1) trix. The centrality of a node 𝑣 is proportional to the sum
of the centralities of the nodes connected to it [15, 12].
𝐶𝐷 (𝑣) 𝑑𝑒𝑔(𝑣)
𝑁 𝐶𝐷 = = (2) 𝑛
𝑛−1 𝑛−1
𝜆𝑣𝑖 = ∑ 𝑎𝑖𝑗 𝑣𝑗 (7)
𝑗=1
2.2. Closeness centrality
In general, it requires the solution of the equation
This measure identifies an influential node in terms of a 𝐴𝑣 = 𝜆𝑣 where 𝐴 is an adjacency matrix.
faster and broader spread of information in a network.
Figure 1: An Example of Class Dependencies with and their centrality scores.

PID in-degree out-degree 𝐶𝐷 (𝑣) 𝐶𝐵 (v) 𝐶𝐶 (𝑣) 𝜆𝑣𝑖
Class A 0 3 3 0 0.71 0
Class B 0 1 1 0 0.5 0
Class C 1 1 2 2 0.6 0.055
Class D 2 2 4 2.5 1 0.27
Class E 2 3 5 5.5 0.8 0.0055
Class F 3 0 3 0 0 1
Class G 2 0 2 0 0 0.99
Layer 2 1 5 6 0 1 0.5
Layer 1 4 5 9 1 1 0.5
Layer 0 5 0 5 0 0 1

Figure 1 shows the centrality scores of various pro-
gramming elements calculated by considering the depen-
dencies as shown in the figure. Here, it is to be noted
that centrality scores can be calculated at different granu-
larity levels, i.e., at the object, method, class, package, or
logical layer. In the figure, centrality scores are computed
at class and layer level. Here, we have considered a layer
as a logical encapsulation unit, loosely holding multiple
classes.

3. Approach
The broad objective of the approach is to extract high-
level layering information, a form of architecture descrip-
tions, from the implementation artefacts so that software
architects can perform the analysis specific to layer archi-
tecture style. In this paper, we demonstrate our approach Figure 2: Block diagram of a tool implement in Java to dis-
with the help of implementation artefacts available in a cover layered architecture using centrality.
Java-based system. The method assumes that a system
under study is implemented around the layered archi-
tecture style. Software maintenance engineer can verify
in Section 2, and they are calculated at the Class
such assumptions from the earlier documentation of the
and Interface levels. The output of this stage is
system when available. As shown in Figure 2, the ap-
a data file in the CSV format describing the cen-
proach consists of following two phases.
trality score assigned to each program element.
1. Dependency Network Builder and Analysis: 2. Layer Assignment: The layer assignment ac-
This phase retrieves the dependencies present tivity aims to assign the most appropriate layer
among implementation artefacts. For a Java- to a program element. Additional style-specific
based system, this phase takes Java or Jar files analyses such as analysis of layer violations, and
as an input and generates a dependency network. performance modeling can be supported once the
The programming elements such as Classes, In- programming elements are assigned to appropri-
terfaces, and Packages are the nodes of the net- ate layers.
work and the Java relationships such as 𝑒𝑥𝑡𝑒𝑛𝑑𝑠,
𝑖𝑚𝑝𝑙𝑒𝑚𝑒𝑛𝑡𝑠, and 𝑖𝑚𝑝𝑜𝑟𝑡𝑠 are the edges in the de- Building a dependency network and calculating central-
pendency network. The output of this stage is ity scores are straightforward activities to realize when
1 are available. However,
represented as a graph in Graph Modeling Lan- the tools such as JDependency
guage (GML). In the second stage, a centrality assigning program elements to layers is a trickier issue
score to each node is assigned. The centrality considering the large number of program elements and
score includes the different measures described 1
https://sourceforge.net/projects/javadepend/
Centrality upper middle lower Algorithm 1 : primaryLabel(inDegree, outDegree,
in-degree low - high n)
out-degree high - low Input: inDegree[1:n],outDegree[1:n]: Vector, n:Integer
between-ness low high low Output: inPartition[1:n], outPartition[1:n] Vector
close-ness high high low
eigenvector low low high 1: Initialize 𝛿𝑖𝑢 , 𝛿𝑖𝑙 , 𝛿𝑜𝑢 and 𝛿𝑜𝑙
2: for node in 1 to n do
Table 1 3: if 𝑖𝑛(𝑛𝑜𝑑𝑒) = 0 and 𝑜𝑢𝑡(𝑛𝑜𝑑𝑒) = 0 then
Relative Significance of centrality measures with respect to
4: 𝑖𝑛𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛[𝑛𝑜𝑑𝑒] ← 𝑙𝑜𝑤𝑒𝑟
layers
5: 𝑜𝑢𝑡𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛[𝑛𝑜𝑑𝑒] ← 𝑙𝑜𝑤𝑒𝑟 5
6: else
𝛿𝑖𝑙 , 𝛿𝑖𝑢 Lower and upper bound for in-degree cen- 7: if 𝑖𝑛(𝑛𝑜𝑑𝑒) > 𝛿𝑖𝑙 then
trality values. 8: 𝑖𝑛𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛[𝑛𝑜𝑑𝑒] ← 𝑙𝑜𝑤𝑒𝑟
𝛿𝑜𝑙 , 𝛿𝑜𝑢 Lower and upper bound for out-degree cen-
9: else
trality values.
10: if 𝑖𝑛(𝑛𝑜𝑑𝑒) < 𝛿𝑖𝑢 then
𝛿𝑏 Critical Value for between-ness centrality
11: 𝑖𝑛𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛[𝑛𝑜𝑑𝑒] ← 𝑢𝑝𝑝𝑒𝑟
𝛿𝑐 Critical Value for close-ness centrality
𝛿𝑒 Critical Value for eigen-value centrality 12: else
13: 𝑖𝑛𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛[𝑛𝑜𝑑𝑒] ← 𝑚𝑖𝑑𝑑𝑙𝑒
Table 2 14: end if
Configuration Parameters 15: end if
16: if 𝑜𝑢𝑡(𝑛𝑜𝑑𝑒) > 𝛿𝑜𝑢 then
17: 𝑜𝑢𝑡𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛[𝑛𝑜𝑑𝑒] ← 𝑢𝑝𝑝𝑒𝑟
relationships among them are present. We describe two 18: else
techniques for this purpose in the following section. 19: if 𝑜𝑢𝑡(𝑛𝑜𝑑𝑒) < 𝛿𝑜𝑙 then
20: 𝑜𝑢𝑡𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛[𝑛𝑜𝑑𝑒] ← 𝑙𝑜𝑤𝑒𝑟
21: else
4. Layer Assignment 22: 𝑜𝑢𝑡𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛[𝑛𝑜𝑑𝑒] ← 𝑚𝑖𝑑𝑑𝑙𝑒
23: end if
The objective of the layer assignment stage is to identify
24: end if
the most appropriate layer based on the centrality mea-
25: end if
sures. Here, we use the term layer in the loose sense that
26: end for
a layer is a logical coarse-grained unit of decomposing
system functionalities and encapsulating program ele-
ments according to common system functionalities. It is
layer. The measure of degree centrality from Section 2.1 is
not the term layer in the strict sense as that used in Layer
further divided as in-degree and out-degree which count
architecture style [16].
the number of incoming and outgoing edges of a node.
We assume a three-layers based decomposition. The
Total five measures of centrality are used. Five acces-
decision to decompose all the system responsibilities into
sor functions namely 𝑖𝑛, 𝑜𝑢𝑡, 𝑏𝑒𝑡𝑤𝑒𝑒𝑛, 𝑐𝑙𝑜𝑠𝑒𝑛𝑒𝑠𝑠 and 𝑒𝑖𝑔𝑒𝑛
three layers is based on the observation that most archi-
are defined to get the values of in-degree centrality, out-
tectural styles’ functionalities can be cleanly decomposed
degree centrality, betweenness centrality, closeness cen-
into three coarse-grained units of decomposition. For ex-
trality and eigenvector centrality associated to a specific
ample, architectural style such as Model-View-Controller
node respectively.
(MVC), Presentation-Abstraction-Control (PAC),[16], and
Three layers are labelled as, i.e. upper, middle and
3-tier architectural patterns have three units of system
lower. Table 1 describes the relative significance of vari-
decomposition.
ous centrality measures with references to upper, middle
Two different techniques based on centrality measures
and lower layers. A set of configuration parameters, as
are developed to assign program elements to layers. The
shown in Table 2, are defined. These parameters provide
first techniques uses a set of pre-defined rules. The sec-
flexibility while mapping program elements to a specific
ond technique automatically learns the assignment rules
layer.
from the pre-labelled layer-assignments using a supervi-
The Algorithm 1 is operated on a dependency-
sory classification algorithm.
network in which nodes represent program elements
while edges represent dependencies. The objective of
4.1. Rule-Driven Layer Assignment this algorithm is to partition the node space representing
This technique uses centrality measures, described in Sec- into three segments corresponding to lower, middle and
tion 2, to map program elements to the most appropriate upper layers. This objective is achieved in two stages.
First, the algorithm calculates two different partitions. Table 3
The first partition i.e. 𝑖𝑛𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛 is calculated using the Decision Table used to Refine Layering
in-degree centrality measure while the second partition Layer Measure Significance Rationale
is calculated using the out-degree centrality measure. in 0 Classes with in-
upper degree value equal
Algorithm 2 refineLabel(inParticion, outPartition, to 0 are placed in
n) the upper layer.
Input: inPartition[1:n], outPartition[1:n]: Vector out high Classes with high
n: Integer out-degree are
Output: nodeLabels[1:n]: Vector placed in the top
layer because they
1: Initialize 𝛿𝑏 , 𝛿𝑐 and 𝛿𝑒 use services from
2: for node in 1 to n do layers beneath
3: if 𝑖𝑛𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛[𝑛𝑜𝑑𝑒] = 𝑜𝑢𝑡𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛[𝑛𝑜𝑑𝑒] then them.
4: 𝑛𝑜𝑑𝑒𝐿𝑎𝑏𝑒𝑙𝑠[𝑛𝑜𝑑𝑒] ← 𝑜𝑢𝑡𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛[𝑛𝑜𝑑𝑒] closeness high Classes with high
5: else closeness value are
6: 𝑛𝑜𝑑𝑒𝐿𝑎𝑏𝑒𝑙𝑠[𝑛𝑜𝑑𝑒] ← placed in the up-
per layer because
𝑢𝑝𝐷𝑜𝑤𝑛(𝑖𝑛𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛[𝑛𝑜𝑑𝑒], 𝑜𝑢𝑡𝑃𝑎𝑡𝑖𝑡𝑖𝑜𝑛[𝑛𝑜𝑑𝑒])
of large average dis-
7: end if
tance from top layer
8: end for
to bottom layer.
middle between high Classes with high be-
After the execution of Algorithm 1 each node is la- tweenness value are
beled with two labels corresponding to layers. The vari- placed in the mid-
ous combination of labels include (lower, lower), (middle, dle layer as they fall
middle), (top, top), (middle, top), and (middle, lower). Out on the path from
of these six labelling, the labels (middle, top), and (middle, top layer to bottom
layer.
lower) are conflicting because two different labels are
in high Classes with high
assigned to a node. This conflict needs to be resolved.
in-degree value are
The Algorithm 2 resolves the conflicting labels and lower
placed in the bottom
assigns the unique label to each node. The conflicting layer because they
labels are resolved by using the rules described in the are highly used.
decision Table 3. The function 𝑢𝑝𝐷𝑜𝑤𝑛 called in the out 0 Classes with out-
Algorithm 2 uses these rules. The rules in Table 3 re- degree value equal
solve the conflicting assignments using the centrality to zero are placed
measures of close-ness, between-ness, and Eigen vector, to bottom layer
while the primary layer assignment by Algorithm 1 is because they only
done with in-degree and out-degree centrality measures. provide services.
eigen 1 Classes with eigen
When Algorithm 2 is executed, some of the nodes from
value equal to 1 are
the middle layer bubble up to the upper layer, and some placed to bottom
nodes fall to the lower level. Some nodes remain at the layer because they
middle layer. The vector 𝑛𝑜𝑑𝑒𝐿𝑎𝑏𝑒𝑙𝑠 holds the unique la- are highly reused.
belling of each node in the dependency network after in - Classes with
resolving all conflicts. in-degree and out-
degree values are
equal to 0 are placed
4.2. Supervised Classification based to bottom layer,
Layered Assignment because they are
isolated classes.
The configuration parameters need to be suitably initial-
ized for the correct functioning of the rule-driven ap-
proach discussed in the previous section. The system ar-
In the data-driven approach, the problem of layered
chitect responsible for architecture recovery needs to fine
assignment is modeled as a multi-class classification prob-
tune the parameters to get layering at the desired level of
lem with three labels i.e. lower (1), middle (2) and upper
abstraction. To overcome this drawback a data-driven ap-
(3) with numerically encoded as 1,2, and 3 respectively.
proach is developed to assign labels to the programming
The classification model is trained on the labeled data-
elements.
Table 4
Sample observations from the Datasets used for Supervised Learning
Id Label In- Out- Close- Betweenness Eigenvec- Layer
Degree Degree ness tor
HealthWatcher
1 ComplaintRecord 1 10 1.714 19 0.0056 2
2 ObjectAlreadyInsertedException 37 0 0 0 0.347 1
3 ObjectNotFoundException 53 0 0 0 0.943 1
4 ObjectNotValidException 41 0 0 0 0.883 1
5 RepositoryException 60 0 0 0 1 1
ConStore
1 Cache 2 1 1 0 0.0162 2
2 CacheObject 4 0 0 0 0.053 2
3 LRUCache 0 2 1 0 0 2
4 MRUCache 1 2 1 7 0.0246 2
5 ItemQuery 1 20 0.412 47.166 0.0388 2

set. The data set, as shown in Table 4, includes program
element identifiers, values of all the centrality measures
and layering labels as specified by the system architect
responsible for architecture recovery. The layering la-
bels can be used from the previous version of the system
under study or the labels guessed by system architect to
explore different alternatives for system decomposition.
We implement three supervised classification algo-
rithm namely K-Nearest Neighbour, Support Vector Ma-
chine, and Decision Tree. These are the machine learning
algorithms particularly used for multi-class classification
problems. A detailed comparison of these various algo-
rithms can be found in [17]. Python’s Scikit-Learn [18]
library is used to develop classification model based on
these algorithms. Table 4 shows the format of the sam-
ple dataset used to train the classification models. The
developed models are evaluated against the classification
metrics such as accuracy, precision, recall, and F1-Score. Figure 3: An Architecture of a System Designed to Train the
. approach

5. Model Development classes named as SEC, TX, SERI in the figure rep-
resent crosscutting concerns.
The machine learning model development includes
phases like training, testing and evaluation. This sec- 2. HealthWatcher: The HealthWatcher is a web-
tion describes how these phases are carried out. based application providing healthcare-related
services [19]. This application is selected to train
the model because existing literature in the public
5.1. Model Training domain confirms that the application follows a
The following software systems are used to train and client-server and layered architecture style. Ini-
build the machine learning models. tially, the first author has manually assigned la-
bels to all the programming elements. Later the
1. Training Architecture system: A small-scale second author checked the labelling of individual
sample architecture system, as shown in Figure 3, elements. We used the following rules to label
has been designed specially to train the approach. programming elements. These are: (i) Program-
It includes 16 classes without the implementation ming elements that access low-level device func-
of any functionalities; it contains only dependen- tions and data access functions are labelled as
cies among classes, as shown in Figure 3. The lower. (ii) Programming elements accessing pre-
Table 5
Accuracy and Confusion Matrix for Data-Driven and Algorithmic Approach

SVM Decision Tree KNN classifier Rule based
HealthWatcher(Size: 135 classes or interfaces) Confusion Matrix
Layer lower mid- up- lower mid- up- lower mid- up- lower mid- up-
dle per dle per dle per dle per
lower 47 1 9 49 4 4 41 8 8 28 16 13
middle 20 5 12 15 20 2 7 28 2 6 30 1
upper 5 1 35 7 0 34 6 6 29 3 9 29
Accuracy = 0.64 Accuracy = 0.76 Accuracy = 0.72 Accuracy = 0.63
Recall (R), Precision (P), F1-Score (F1) Evaluation
R P F-1 R P F-1 R P F-1 R P F-1
lower 0.65 0.82 0.73 0.69 0.86 0.77 0.76 0.72 0.74 0.76 0.49 0.60
middle 0.71 0.14 0.23 0.83 0.54 0.66 0.67 0.76 0.71 0.55 0.81 0.65
upper 0.62 0.85 0.72 0.85 0.83 0.84 0.74 0.71 0.72 0.66 0.66 0.66
Training Architecture System (Size = 16 Classes) Confusion
Matrix
lower 5 2 0 4 3 0 5 2 0 5 2 0
middle 1 4 0 0 5 0 1 4 0 1 4 0
upper 1 0 3 0 1 3 4 0 0 1 0 3
Accuracy = 0.75 Accuracy = 0.75 Accuracy = 0.56 Accuracy =0.75
Recall (R), Precision (P), F1-Score (F1) Evaluation
R P F-1 R P F-1 R P F-1 R P F-1
lower 0.71 0.71 0.71 1.00 0.57 0.73 0.50 0.71 0.59 0.71 0.71 0.71
middle 0.67 0.80 0.73 0.56 1.00 0.71 0.67 0.80 0.73 0.67 0.80 0.73
upper 1.00 0.75 0.86 1.00 0.75 0.86 0.00 0.00 0.00 1.00 0.75 0.86
Constore (Size: 66 classes or interfaces) Confusion Matrix
lower 43 0 0 43 0 0 40 3 0 27 16 0]
middle 14 0 1 13 2 0 12 3 0 9 5 1
upper 6 0 2 6 0 2 7 1 0 4 2 2
Accuracy = 0.68 Accuracy = 0.71 Accuracy = 0.65 Accuracy =0.52
Recall (R), Precision (P), F1-Score (F1) Evaluation
R P F-1 R P F-1 R P F-1 R P F-1
lower 0.68 1.00 0.81 0.69 1.00 0.82 0.68 0.93 0.78 0.68 0.63 0.65
middle 0.00 0.00 0.00 1.00 0.13 0.24 0.43 0.20 0.27 0.22 0.33 0.26
top 0.67 0.25 0.36 1.00 0.25 0.40 0.00 0.00 0.00 0.67 0.25 0.36

sentation functions are labelled as upper. (iii) Pro- ConStore is a framework for detailing out the concepts
gramming elements that provide business logic or and creating a domain model for a given application.
depend on programming elements defined within
the application are labelled as middle. 5.3. Model Evaluation
Total one hundred fifty-one classes, i.e. data instances, The performance of classification models is typically eval-
are used to train the model—sixteen classes from the uated against measures such as accuracy, precision, recall,
specially designed system and 135 classes from the and F1-Score [21]. These metrics are derived from a con-
𝐻 𝑒𝑎𝑙𝑡ℎ𝑊 𝑎𝑡𝑐ℎ𝑒𝑟 system. fusion matrix which compares the count of actual class
labels for the observations in a given data set and the class
5.2. Model Testing labels as predicted by a classification model. Four differ-
ent metrics are derived by comparing true labels with the
We used ConStore[20], a small scale Java-based library
predicted labels. These are accuracy, recall, precision and
designed to manage concept networks, to test the model.
F1-score. Table 5 shows the performance analysis against
This application is selected to test the model because the
these metrics. The table compares the performance of
second author has involved in recovering architecture
algorithmic-centric approach and data-driven approach.
previously and knows the details of the system. The
5.3.1. Accuracy Analysis 5.3.3. Threats to Validation
The accuracy is the rate of correction for classification The performance of the models has been evaluated in an
models. Higher the value of accuracy, better is the model. academic setting with internal validation only. By inter-
From the accuracy point of view, one can observe from nal validation, we mean the performance of algorithmic
the Table 5 that the data-driven approach performs better and data-driven techniques that have been compared and
as compared to the algorithmic-centric approach. The analyzed. This is because our prime aim is to demonstrate
decision-based classifier preforms better on all the test the significance of social network analysis measures in
cases with an average accuracy of 74%. This is because recovering architecture descriptions. The model’s per-
the performance of algorithmic approach depends on formance needs to be further compared against similar
the proper tuning of various configuration parameters. approaches developed earlier[22]. Also, the usefulness
The results shown in Table 5 are obtained with following of recovered layering information needs to be assessed
values in Table 6 of configuration parameters. The Table in the software industry setting.
6 shows combination of configuration values for the best
accuracy obtained during twenty-five iterations. During
each iterations, values of configuration parameters were 6. Related Work
incremented by 1 or 0.1( for 𝛿𝑐 , 𝛿𝑒 ).
Recovering architecture descriptions from the code has
ConStore Healthwatcher Test Arch. been one of the widely and continuously explored prob-
𝛿𝑖𝑙 4 10 2 lem by Software Architecture researchers. This has re-
𝛿𝑖𝑢 1 1 1 sulted in a large number of techniques[23, 24, 25], survey
𝛿𝑜𝑙 4 2 2 papers [22] and books [26] devoted to the topic. In the
𝛿𝑜𝑢 1 5 2 context of these earlier approaches, this section provides
𝛿𝑏 6 9 6 the rationale behind the implementation decisions taken
𝛿𝑐 0.8 0.8 0.6 while developing our approach.
𝛿𝑒 0.6 0.5 0.6 (i) Include Dependencies vs Symbolic Dependen-
cies: The recent study reported in [27] has recognized
Table 6
Configuration parameters and their values
that the quality of recovered architecture depends on the
type of dependencies analyzed to recover architecture.
The study analyzes the impact of symbolic dependencies
The machine learning models automatically learn and i.e. dependencies at the program identifier level versus
adjust the model parameters for the better results of accu- include dependencies i.e. at the level of importing files or
racy. In case of algorithmic approach, the configuration including packages. Further, it emphasizes that symbolic
parameter tuning is an iterative process and need to try dependencies are more accurate way to recover struc-
different combinations. tural information. The use of include dependencies is
error prone owing to the fact that a programmer may
5.3.2. Recall, Precision, F1-Score Analysis include a package without using it.
We used include dependencies in our approach because
Recall indicates the proportion of correctly identified
extracting and managing include dependencies are sim-
true positives while precision is the proportion of correct
ple as compared to symbolic dependencies. Further, we
positive identification. High values of both recall and
mitigated the risk of unused packages by excluding these
precision are desired, but it isn’t easy to get high values
relationship from further analysis. Many programming
simultaneously for recall and precision. Hence, F1- score
environments facilitate the removal of unused packages.
combines recall and precision into one metrics. From
One of our objectives was to develop a data-driven ap-
the recall, precision, and F1-score point of view, one can
proach and cleaning data in this way is an established
observe from Table 5 that decision tree-based classifier
practice in the field of data engineering.
performs better with the highest F1-score of 0.86 for
(ii) Unsupervised Clustering vs Supervised clas-
the upper layer classification of test architecture system.
sification: The techniques of unsupervised clustering
Recalling class labels with higher precision for middle
have been adopted widely to extract high-level archi-
layer is a challenging task for all the models described
tectures through the analysis of dependencies between
in this paper. This is because of the presence of many
implementation artefacts [23]. These approaches use
not so cleanly encapsulated functionalities in a module
hierarchical and search-based methods for clustering.
at the middle layer and mapping crosscutting concerns
These approaches usually take substantial search time
to one of the three layers.
to find not so good architectures [28]. One of the ad-
vantages of clustering methods is that unlabelled data
sets drive these methods. But, the identified clusters of
program elements need to be labelled with appropriate enough to extend.
labels. The layering information extracted by the approach
Our choice of supervised classification method is driven can be viewed as one way of decomposing a system. It
by the fact that centrality measures quantify the structural is not the single ground truth architecture that is often
properties with reference to a node, and relation of the difficult to agree upon and laborious to discover [22].
nodes with respect to others. Processing such quantified Further, the quality of extracted architecture descriptions,
values in efficient way is one of the advantages of many i.e. clustering of program elements to layers, need to be
of supervised classification methods. Further, assigning assessed for the properties such as a minimal layer of
program elements with layering labels is not an issue if violation [31, 32] or satisfaction of a particular quality
such information is available from the previous version attribute[26] or any other project-specific criteria.
of software, which may be the case for re-engineering or We described the working of the approach by assum-
modernization projects. In the absence of such labelled ing a three-layer decomposition. The work presented
data set, the approach presented in the paper can still be in this paper can be extended to more than three layers.
adopted in two stages. In the first stage, a tentative layer The algorithm-centric technique needs to be adapted by
labelling can be done through algorithmic approach fol- redesigning rules for additional layers. Also, the super-
lowed by the labelling through supervised classification vised classification method can be adjusted by relabelling
method. program elements with the number of layers considered.
(ii) Applications of Social Network Analysis Exploring the impact of fusing structural properties
(SNA) Measures: The interest in applying theory of and some semantic features such as a dominant concern
SNA has started growing in recent times. Some of the addressed by a programming element would be an excit-
recent applications of SNA include predicting architec- ing exercise for future exploration.
tural smell [29], predicting vulnerable software compo-
nents [30] to measure structural similarity of program
elements. Our approach characterizes each program el- References
ements through its centrality measures which can be
[1] D. Link, P. Behnamghader, R. Moazeni, B. Boehm,
termed as feature representation, using machine learning
The value of software architecture recovery for
vocabulary, necessary to build data-driven models.
maintenance, in: Proceedings of the 12th Inno-
vations on Software Engineering Conference (for-
7. Conclusion merly known as India Software Engineering Con-
ference), 2019, pp. 1–10.
The main highlights of the approach presented in the [2] A. Pacheco, G. Marín-Raventós, G. López, Design-
paper include: (i)The dependency graph formed by pro- ing a technical debt visualization tool to improve
gramming elements (i.e. classes in Java) is treated as a stakeholder communication in the decision-making
network, and centrality measures are applied to extract process: a case study, in: International Conference
structural properties. (ii) It represents each program ele- on Research and Practical Issues of Enterprise In-
ment as a set of values corresponding to different central- formation Systems, Springer, 2018, pp. 15–26.
ity measures. Thus, each program element is represented [3] A. Shahbazian, Y. K. Lee, D. Le, Y. Brun, N. Medvi-
as a feature vector in machine learning terminology. (iii) dovic, Recovering architectural design decisions,
The paper treats a layer as a coarsely granular abstraction in: 2018 IEEE International Conference on Software
encapsulating common system functionalities. Then it Architecture (ICSA), IEEE, 2018, pp. 95–9509.
maps a group of programming elements sharing common [4] P. J. Carrington, J. Scott, S. Wasserman, Models
structural properties to a layer. (iv) The paper describes and methods in social network analysis, volume 28,
two mapping methods for this purpose called algorith- Cambridge university press, 2005.
mic centric and data-driven. (v) Overall, the data-driven [5] S. B. Thakare, A. W. Kiwelekar, Skiplpa: An effi-
method illustrated in the paper perform better compared cient label propagation algorithm for community
to the algorithmic centric method. detection in sparse network, in: Proceedings of
The paper makes several assumptions, such as (i) avail- the 9th Annual ACM India Conference, 2016, pp.
ability of Java-based system implementation, (ii) a system 97–106.
is decomposed into three layers, and (iii) availability of [6] R. Albert, A.-L. Barabási, Statistical mechanics of
pre-labelled data set for supervised classification. These complex networks, Reviews of modern physics 74
are the assumption made to simplify the realization and (2002) 47.
demonstration of the approach. Hence, these assump- [7] D. J. Watts, Networks, dynamics, and the small-
tions do not restrict the approach. However, these as- world phenomenon 1, American Journal of sociol-
sumptions can be relaxed, and the approach is flexible ogy 105 (1999) 493–527.
[8] J. S. Silva, A. M. Saraiva, A methodology for ap- [23] O. Maqbool, H. Babri, Hierarchical clustering for
plying social network analysis metrics to biologi- software architecture recovery, IEEE Transactions
cal interaction networks, in: Advances in Social on Software Engineering 33 (2007) 759–780.
Networks Analysis and Mining (ASONAM), 2015 [24] A. W. Kiwelekar, R. K. Joshi, An ontological frame-
IEEE/ACM International Conference on, IEEE, 2015, work for architecture model integration, in: Pro-
pp. 1300–1307. ceedings of the 4th International Workshop on
[9] G. Amitai, A. Shemesh, E. Sitbon, M. Shklar, Twin Peaks of Requirements and Architecture, 2014,
D. Netanely, I. Venger, S. Pietrokovski, Network pp. 24–27.
analysis of protein structures identifies functional [25] A. W. Kiwelekar, R. K. Joshi, Ontological analysis
residues, Journal of Molecular Biology 344 (2004) for generating baseline architectural descriptions,
1135–1146. doi:h t t p : / / d x . d o i . o r g / 1 0 . 1 0 1 6 / j . j m b . in: European Conference on Software Architecture,
2004.10.055. Springer, 2010, pp. 417–424.
[10] M. E. J. Newman, Random graphs as models of [26] A. Isazadeh, H. Izadkhah, I. Elgedawy, Source code
networks, arXiv preprint cond-mat/0202208 (2002). modularization: theory and techniques, Springer,
[11] M. E. Newman, The structure and function of com- 2017.
plex networks, SIAM review 45 (2003) 167–256. [27] T. Lutellier, D. Chollak, J. Garcia, L. Tan, D. Rayside,
[12] S. P. Borgatti, Centrality and network flow, Social N. Medvidović, R. Kroeger, Measuring the impact
networks 27 (2005) 55–71. of code dependencies on software architecture re-
[13] L. C. Freeman, Centrality in social networks concep- covery techniques, IEEE Transactions on Software
tual clarification, Social networks 1 (1979) 215–239. Engineering 44 (2017) 159–181.
[14] D. R. White, S. P. Borgatti, Betweenness centrality [28] S. Mohammadi, H. Izadkhah, A new algorithm for
measures for directed graphs, Social Networks 16 software clustering considering the knowledge of
(1994) 335–346. dependency between artifacts in the source code,
[15] P. Bonacich, Some unique properties of eigenvector Information and Software Technology 105 (2019)
centrality, Social networks 29 (2007) 555–564. 252–256.
[16] F. Buschmann, K. Henney, D. C. Schmidt, Pattern- [29] A. Tommasel, Applying social network analysis
oriented software architecture, on patterns and pat- techniques to architectural smell prediction, in:
tern languages, volume 5, John wiley & sons, 2007. 2019 IEEE International Conference on Software
[17] C. A. U. Hassan, M. S. Khan, M. A. Shah, Compari- Architecture Companion (ICSA-C), IEEE, 2019, pp.
son of machine learning algorithms in data classifi- 254–261.
cation, in: 2018 24th International Conference on [30] V. H. Nguyen, L. M. S. Tran, Predicting vulnerable
Automation and Computing (ICAC), IEEE, 2018, pp. software components with dependency graphs, in:
1–6. Proceedings of the 6th International Workshop on
[18] J. Hao, T. K. Ho, Machine learning made easy: A Security Measurements and Metrics, 2010, pp. 1–8.
review of scikit-learn package in python program- [31] S. Sarkar, G. Maskeri, S. Ramachandran, Discovery
ming language, Journal of Educational and Behav- of architectural layers and measurement of layering
ioral Statistics 44 (2019) 348–361. violations in source code, Journal of Systems and
[19] P. Greenwood, T. Bartolomei, E. Figueiredo, Software 82 (2009) 1891–1905.
M. Dosea, A. Garcia, N. Cacho, C. Sant’Anna, [32] S. Sarkar, V. Kaulgud, Architecture reconstruction
S. Soares, P. Borba, U. Kulesza, et al., On the im- from code for business applications-a practical ap-
pact of aspectual decompositions on design stabil- proach, in: 1st India Workshop on Reverse Engi-
ity: An empirical study, in: European Conference neering,(IWRE), 2010.
on Object-Oriented Programming, Springer, 2007,
pp. 176–200.
[20] http://www.cse.iitb.ac.in/constore,
http://www.cse.iitb.ac.in/constore, 2009.
[21] C. Goutte, E. Gaussier, A probabilistic interpreta-
tion of precision, recall and f-score, with implica-
tion for evaluation, in: European conference on
information retrieval, Springer, 2005, pp. 345–359.
[22] J. Garcia, I. Ivkovic, N. Medvidovic, A compara-
tive analysis of software architecture recovery tech-
niques, in: 2013 28th IEEE/ACM International Con-
ference on Automated Software Engineering (ASE),
IEEE, 2013, pp. 486–496.