Mapping Program Elements to Layers using Centrality Measures and Machine Learning Techniques Sanjay B Thakre, Arvind W Kiwelekar Department of Computer Engineering, Dr Babasaheb Ambedkar Technological University, Lonere-Raigad, 402103, India Abstract The necessity of explicit architecture descriptions for communicating system functionalities and system maintenance activities has been continuously emphasized. This paper presents an approach to extract layering information, a form of architecture descriptions, using the centrality measures from Social Network Analysis theory and supervised machine learning algorithms. The layer recovery approach presented in this paper works in two phases. The first phase calculates centrality measures for each program element in an application. The second phase assumes that the application has been implemented around the layered architecture style and maps program elements to appropriate layers. Two techniques for mapping program elements to layers are presented. The first technique maps program elements to layer using pre-defined rules, while the second technique learns the mapping rules from a pre-labelled data set. The paper presents the evaluation of both approaches. Keywords Layered Architecture Style, Architecture Descriptions, Architecture Recovery, Centrality Measures, Module Dependency View, Supervised Classification. 1. Introduction by dependency relationship. Three observations drove the rationale behind using centrality measures for archi- The value of explicit software architecture has been in- tecture extraction: (i) Most of these measures provide a creasingly recognized for software maintenance and evo- highly intuitive and computationally simple way to ana- lution activities [1]. Especially architecture descriptions lyze interactions when a graph represents the structure in terms of high-level abstractions such as patterns, styles of a system. (ii) These measures quantify the structure of and views have been found as a valuable tool to commu- a system at multiple levels, i.e., at a particular node level, nicate system functionalities effectively [2, 3]. Despite concerning other nodes in the graph, and at a group of the numerous benefits, a legacy or open-source software nodes or communities. (iii) These measures support the system often lacks such kind of architecture descriptions. development of data-driven approaches to architecture Moreover, when such architecture descriptions are avail- recovery. Hence such approaches can learn the rules of able, they are not aligned with the latest version [3]. architecture recovery from given data. The approach A lightweight architecture recovery approach that ap- recovers layering information in two phases. In the first proximately represents a system decomposition may be phase, a centrality score is assigned to each program more convenient than sophisticated architecture recov- element. We assume that system functionalities are de- ery techniques in such situations. Such a light approach composed among multiple layers, and so in the second shall quickly extract relevant information necessary to phase, a layer is assigned to each program element. build a system decomposition so that it can provide much- The paper contributes to the existing knowledge base needed assistance to software architects dealing with re- of the architecture recovery domain in the following engineering or modernization of existing systems, thus ways. (1) It demonstrates the use of centrality measures increasing their productivity. to recover layering information. (2) It describes a data- Intending to design a lightweight approach, this pa- driven approach to mapping program elements to layers per presents an architecture recovery to extract layered using supervised classification algorithms. (3) It presents decomposition of an implemented system. The method an evaluation of supervised classification algorithms to uses centrality measures from the theory of Social Net- extract layering information. The rest of the paper is work Analysis [4] to analyze software structure formed organized as follows: Section II defines various centrality measures. Section III describes the central element of the ECSA2021 Companion Volume, Robert Heinrich, Raffaela Mirandola and Danny Weyns, Växjö, Sweden, 13–17 September, 2021 approach. The algorithmic and data-driven approaches to Envelope-Open mail2sbt@gmail.com (S. B. Thakre); awk@dbatu.ac.in the problem of layer assignment are explained in Section (A. W. Kiwelekar) IV. The evaluation of the approach is presented in Section GLOBE https://awk-net.group/ (A. W. Kiwelekar) V. Section VI puts our approach in the context of existing Orcid 0000-0002-9647-6403 (S. B. Thakre); 0000-0002-3407-0221 approaches. Finally, the paper concludes in Section VII. (A. W. Kiwelekar) © 2021 Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 2. Social Network Analysis The influential nodes are characterized by a smaller inter- node distance which signifies the faster transfer of in- Measures formation. The closeness centrality is derived from the The theory of Social Network Analysis (SNA) provides a average distance from a node to all the connected nodes generic framework to analyze the structure of complex at different depths. However, the distance between the systems. It includes a rich set of measures, models, and disconnected components of the network is infinite, and methods to extract the patterns of interactions among hence it is excluded. For the central node, the average systems’ elements. A complex system is expressed as distance would be small and is calculated as the inverse a network of nodes and edges to support systems from of the sum of the distances to all other nodes (𝑑𝑣𝑤 ). The diverse application domains. normalized closeness (𝑁 𝐶𝐶 ) is in the range from 0 to 1, A few examples of complex systems that have been where 0 represents an isolated node, and 1 indicates a analyzed with the help of SNA include communities on strongly connected node. social media platforms [5], and neural systems[6]. The 1 techniques from SNA have been applied to control the 𝐶𝐶 (𝑣) = ∑( ) (3) 𝑑𝑣𝑤 influence of diseases [7] to understand biological systems [8], and to investigate protein interactions [9]. In these 𝑛−1 𝑁 𝐶𝐶 = ∑( ) (4) applications of SNA, a complex system is represented as 𝑑𝑣𝑤 a graph, and then they are analyzed through measures such as centrality, scale-free, small world, and community 2.3. Betweenness centrality structure [10, 11, 12, 13]. Some of these commonly used SNA measures relevant to our study are described below. This measure identifies those central nodes which are The theory of Social Network Analysis (SNA) provides responsible for connecting two or more components of a range of measures with varying levels. Some are ap- the network. Removal of such a central node would mean plied at the node-level while others are applied at the a disconnection of the complete network. Hence, these network-level. The node-level measures are calculated nodes act as a bridge to pass the information [12, 14]. from the nodes which are directly connected to a given Betweenness centrality (𝐶𝐵 (𝑣) is defined as the number node. The centrality measures are the node-level mea- of shortest paths passing through a node 𝑣. sures quantifying the importance of an individual node 𝜎𝑠𝑡 (𝑣) in the network. A central node is an influential node 𝐶𝐵 (𝑣) = ∑ (5) having significant potential to communicate and access 𝑠≠𝑣≠𝑡 𝜎𝑠𝑡 information. There exists different centrality measures, where, 𝜎𝑠𝑡 is the total number of shortest paths from a and they are derived from the connections to a node, po- node 𝑠 to 𝑡 and 𝜎𝑠𝑡 (𝑣) is the number of paths that pass sition of a node in the network, and relative importance ′ through 𝑣. The relative betweenness centrality, 𝐶𝐵 (𝑣), of nodes. of any node in a graph with respect to the maximum centrality of the node is calculated from 𝐶𝐵 (𝑣). 2.1. Degree centrality ′ 2𝐶𝐵 (𝑣) This measure determines a central node based on the 𝐶𝐵 (𝑣) = 2 (6) 𝑛 − 3𝑛 + 2 connections to individual nodes. A node with a higher degree in a network is considered the most influential one. In a directed graph, two different centrality measures exist in-degree and out-degree based on the number of 2.4. Eigenvector centrality incoming and outgoing edges, respectively. The degree centrality, 𝐶𝐷 (𝑣), of a node, 𝑣, is equal to the number of its The Eigenvector centrality is a relative centrality mea- connections, 𝑑𝑒𝑔(𝑣), normalized 𝑁 𝐶𝐷 to the maximum sure, unlike the last three measures that are absolute. The possible degree of the node. Eigenvector centrality calculation depends on the largest real Eigenvalue present in the symmetric adjacency ma- 𝐶𝐷 (𝑣) = 𝑑𝑒𝑔(𝑣) (1) trix. The centrality of a node 𝑣 is proportional to the sum of the centralities of the nodes connected to it [15, 12]. 𝐶𝐷 (𝑣) 𝑑𝑒𝑔(𝑣) 𝑁 𝐶𝐷 = = (2) 𝑛 𝑛−1 𝑛−1 𝜆𝑣𝑖 = ∑ 𝑎𝑖𝑗 𝑣𝑗 (7) 𝑗=1 2.2. Closeness centrality In general, it requires the solution of the equation This measure identifies an influential node in terms of a 𝐴𝑣 = 𝜆𝑣 where 𝐴 is an adjacency matrix. faster and broader spread of information in a network. Figure 1: An Example of Class Dependencies with and their centrality scores. PID in-degree out-degree 𝐶𝐷 (𝑣) 𝐶𝐵 (v) 𝐶𝐶 (𝑣) 𝜆𝑣𝑖 Class A 0 3 3 0 0.71 0 Class B 0 1 1 0 0.5 0 Class C 1 1 2 2 0.6 0.055 Class D 2 2 4 2.5 1 0.27 Class E 2 3 5 5.5 0.8 0.0055 Class F 3 0 3 0 0 1 Class G 2 0 2 0 0 0.99 Layer 2 1 5 6 0 1 0.5 Layer 1 4 5 9 1 1 0.5 Layer 0 5 0 5 0 0 1 Figure 1 shows the centrality scores of various pro- gramming elements calculated by considering the depen- dencies as shown in the figure. Here, it is to be noted that centrality scores can be calculated at different granu- larity levels, i.e., at the object, method, class, package, or logical layer. In the figure, centrality scores are computed at class and layer level. Here, we have considered a layer as a logical encapsulation unit, loosely holding multiple classes. 3. Approach The broad objective of the approach is to extract high- level layering information, a form of architecture descrip- tions, from the implementation artefacts so that software architects can perform the analysis specific to layer archi- tecture style. In this paper, we demonstrate our approach Figure 2: Block diagram of a tool implement in Java to dis- with the help of implementation artefacts available in a cover layered architecture using centrality. Java-based system. The method assumes that a system under study is implemented around the layered archi- tecture style. Software maintenance engineer can verify in Section 2, and they are calculated at the Class such assumptions from the earlier documentation of the and Interface levels. The output of this stage is system when available. As shown in Figure 2, the ap- a data file in the CSV format describing the cen- proach consists of following two phases. trality score assigned to each program element. 1. Dependency Network Builder and Analysis: 2. Layer Assignment: The layer assignment ac- This phase retrieves the dependencies present tivity aims to assign the most appropriate layer among implementation artefacts. For a Java- to a program element. Additional style-specific based system, this phase takes Java or Jar files analyses such as analysis of layer violations, and as an input and generates a dependency network. performance modeling can be supported once the The programming elements such as Classes, In- programming elements are assigned to appropri- terfaces, and Packages are the nodes of the net- ate layers. work and the Java relationships such as 𝑒𝑥𝑡𝑒𝑛𝑑𝑠, 𝑖𝑚𝑝𝑙𝑒𝑚𝑒𝑛𝑡𝑠, and 𝑖𝑚𝑝𝑜𝑟𝑡𝑠 are the edges in the de- Building a dependency network and calculating central- pendency network. The output of this stage is ity scores are straightforward activities to realize when 1 are available. However, represented as a graph in Graph Modeling Lan- the tools such as JDependency guage (GML). In the second stage, a centrality assigning program elements to layers is a trickier issue score to each node is assigned. The centrality considering the large number of program elements and score includes the different measures described 1 https://sourceforge.net/projects/javadepend/ Centrality upper middle lower Algorithm 1 : primaryLabel(inDegree, outDegree, in-degree low - high n) out-degree high - low Input: inDegree[1:n],outDegree[1:n]: Vector, n:Integer between-ness low high low Output: inPartition[1:n], outPartition[1:n] Vector close-ness high high low eigenvector low low high 1: Initialize 𝛿𝑖𝑢 , 𝛿𝑖𝑙 , 𝛿𝑜𝑢 and 𝛿𝑜𝑙 2: for node in 1 to n do Table 1 3: if 𝑖𝑛(𝑛𝑜𝑑𝑒) = 0 and 𝑜𝑢𝑡(𝑛𝑜𝑑𝑒) = 0 then Relative Significance of centrality measures with respect to 4: 𝑖𝑛𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛[𝑛𝑜𝑑𝑒] ← 𝑙𝑜𝑤𝑒𝑟 layers 5: 𝑜𝑢𝑡𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛[𝑛𝑜𝑑𝑒] ← 𝑙𝑜𝑤𝑒𝑟 5 6: else 𝛿𝑖𝑙 , 𝛿𝑖𝑢 Lower and upper bound for in-degree cen- 7: if 𝑖𝑛(𝑛𝑜𝑑𝑒) > 𝛿𝑖𝑙 then trality values. 8: 𝑖𝑛𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛[𝑛𝑜𝑑𝑒] ← 𝑙𝑜𝑤𝑒𝑟 𝛿𝑜𝑙 , 𝛿𝑜𝑢 Lower and upper bound for out-degree cen- 9: else trality values. 10: if 𝑖𝑛(𝑛𝑜𝑑𝑒) < 𝛿𝑖𝑢 then 𝛿𝑏 Critical Value for between-ness centrality 11: 𝑖𝑛𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛[𝑛𝑜𝑑𝑒] ← 𝑢𝑝𝑝𝑒𝑟 𝛿𝑐 Critical Value for close-ness centrality 𝛿𝑒 Critical Value for eigen-value centrality 12: else 13: 𝑖𝑛𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛[𝑛𝑜𝑑𝑒] ← 𝑚𝑖𝑑𝑑𝑙𝑒 Table 2 14: end if Configuration Parameters 15: end if 16: if 𝑜𝑢𝑡(𝑛𝑜𝑑𝑒) > 𝛿𝑜𝑢 then 17: 𝑜𝑢𝑡𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛[𝑛𝑜𝑑𝑒] ← 𝑢𝑝𝑝𝑒𝑟 relationships among them are present. We describe two 18: else techniques for this purpose in the following section. 19: if 𝑜𝑢𝑡(𝑛𝑜𝑑𝑒) < 𝛿𝑜𝑙 then 20: 𝑜𝑢𝑡𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛[𝑛𝑜𝑑𝑒] ← 𝑙𝑜𝑤𝑒𝑟 21: else 4. Layer Assignment 22: 𝑜𝑢𝑡𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛[𝑛𝑜𝑑𝑒] ← 𝑚𝑖𝑑𝑑𝑙𝑒 23: end if The objective of the layer assignment stage is to identify 24: end if the most appropriate layer based on the centrality mea- 25: end if sures. Here, we use the term layer in the loose sense that 26: end for a layer is a logical coarse-grained unit of decomposing system functionalities and encapsulating program ele- ments according to common system functionalities. It is layer. The measure of degree centrality from Section 2.1 is not the term layer in the strict sense as that used in Layer further divided as in-degree and out-degree which count architecture style [16]. the number of incoming and outgoing edges of a node. We assume a three-layers based decomposition. The Total five measures of centrality are used. Five acces- decision to decompose all the system responsibilities into sor functions namely 𝑖𝑛, 𝑜𝑢𝑡, 𝑏𝑒𝑡𝑤𝑒𝑒𝑛, 𝑐𝑙𝑜𝑠𝑒𝑛𝑒𝑠𝑠 and 𝑒𝑖𝑔𝑒𝑛 three layers is based on the observation that most archi- are defined to get the values of in-degree centrality, out- tectural styles’ functionalities can be cleanly decomposed degree centrality, betweenness centrality, closeness cen- into three coarse-grained units of decomposition. For ex- trality and eigenvector centrality associated to a specific ample, architectural style such as Model-View-Controller node respectively. (MVC), Presentation-Abstraction-Control (PAC),[16], and Three layers are labelled as, i.e. upper, middle and 3-tier architectural patterns have three units of system lower. Table 1 describes the relative significance of vari- decomposition. ous centrality measures with references to upper, middle Two different techniques based on centrality measures and lower layers. A set of configuration parameters, as are developed to assign program elements to layers. The shown in Table 2, are defined. These parameters provide first techniques uses a set of pre-defined rules. The sec- flexibility while mapping program elements to a specific ond technique automatically learns the assignment rules layer. from the pre-labelled layer-assignments using a supervi- The Algorithm 1 is operated on a dependency- sory classification algorithm. network in which nodes represent program elements while edges represent dependencies. The objective of 4.1. Rule-Driven Layer Assignment this algorithm is to partition the node space representing This technique uses centrality measures, described in Sec- into three segments corresponding to lower, middle and tion 2, to map program elements to the most appropriate upper layers. This objective is achieved in two stages. First, the algorithm calculates two different partitions. Table 3 The first partition i.e. 𝑖𝑛𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛 is calculated using the Decision Table used to Refine Layering in-degree centrality measure while the second partition Layer Measure Significance Rationale is calculated using the out-degree centrality measure. in 0 Classes with in- upper degree value equal Algorithm 2 refineLabel(inParticion, outPartition, to 0 are placed in n) the upper layer. Input: inPartition[1:n], outPartition[1:n]: Vector out high Classes with high n: Integer out-degree are Output: nodeLabels[1:n]: Vector placed in the top layer because they 1: Initialize 𝛿𝑏 , 𝛿𝑐 and 𝛿𝑒 use services from 2: for node in 1 to n do layers beneath 3: if 𝑖𝑛𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛[𝑛𝑜𝑑𝑒] = 𝑜𝑢𝑡𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛[𝑛𝑜𝑑𝑒] then them. 4: 𝑛𝑜𝑑𝑒𝐿𝑎𝑏𝑒𝑙𝑠[𝑛𝑜𝑑𝑒] ← 𝑜𝑢𝑡𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛[𝑛𝑜𝑑𝑒] closeness high Classes with high 5: else closeness value are 6: 𝑛𝑜𝑑𝑒𝐿𝑎𝑏𝑒𝑙𝑠[𝑛𝑜𝑑𝑒] ← placed in the up- per layer because 𝑢𝑝𝐷𝑜𝑤𝑛(𝑖𝑛𝑃𝑎𝑟𝑡𝑖𝑡𝑖𝑜𝑛[𝑛𝑜𝑑𝑒], 𝑜𝑢𝑡𝑃𝑎𝑡𝑖𝑡𝑖𝑜𝑛[𝑛𝑜𝑑𝑒]) of large average dis- 7: end if tance from top layer 8: end for to bottom layer. middle between high Classes with high be- After the execution of Algorithm 1 each node is la- tweenness value are beled with two labels corresponding to layers. The vari- placed in the mid- ous combination of labels include (lower, lower), (middle, dle layer as they fall middle), (top, top), (middle, top), and (middle, lower). Out on the path from of these six labelling, the labels (middle, top), and (middle, top layer to bottom layer. lower) are conflicting because two different labels are in high Classes with high assigned to a node. This conflict needs to be resolved. in-degree value are The Algorithm 2 resolves the conflicting labels and lower placed in the bottom assigns the unique label to each node. The conflicting layer because they labels are resolved by using the rules described in the are highly used. decision Table 3. The function 𝑢𝑝𝐷𝑜𝑤𝑛 called in the out 0 Classes with out- Algorithm 2 uses these rules. The rules in Table 3 re- degree value equal solve the conflicting assignments using the centrality to zero are placed measures of close-ness, between-ness, and Eigen vector, to bottom layer while the primary layer assignment by Algorithm 1 is because they only done with in-degree and out-degree centrality measures. provide services. eigen 1 Classes with eigen When Algorithm 2 is executed, some of the nodes from value equal to 1 are the middle layer bubble up to the upper layer, and some placed to bottom nodes fall to the lower level. Some nodes remain at the layer because they middle layer. The vector 𝑛𝑜𝑑𝑒𝐿𝑎𝑏𝑒𝑙𝑠 holds the unique la- are highly reused. belling of each node in the dependency network after in - Classes with resolving all conflicts. in-degree and out- degree values are equal to 0 are placed 4.2. Supervised Classification based to bottom layer, Layered Assignment because they are isolated classes. The configuration parameters need to be suitably initial- ized for the correct functioning of the rule-driven ap- proach discussed in the previous section. The system ar- In the data-driven approach, the problem of layered chitect responsible for architecture recovery needs to fine assignment is modeled as a multi-class classification prob- tune the parameters to get layering at the desired level of lem with three labels i.e. lower (1), middle (2) and upper abstraction. To overcome this drawback a data-driven ap- (3) with numerically encoded as 1,2, and 3 respectively. proach is developed to assign labels to the programming The classification model is trained on the labeled data- elements. Table 4 Sample observations from the Datasets used for Supervised Learning Id Label In- Out- Close- Betweenness Eigenvec- Layer Degree Degree ness tor HealthWatcher 1 ComplaintRecord 1 10 1.714 19 0.0056 2 2 ObjectAlreadyInsertedException 37 0 0 0 0.347 1 3 ObjectNotFoundException 53 0 0 0 0.943 1 4 ObjectNotValidException 41 0 0 0 0.883 1 5 RepositoryException 60 0 0 0 1 1 ConStore 1 Cache 2 1 1 0 0.0162 2 2 CacheObject 4 0 0 0 0.053 2 3 LRUCache 0 2 1 0 0 2 4 MRUCache 1 2 1 7 0.0246 2 5 ItemQuery 1 20 0.412 47.166 0.0388 2 set. The data set, as shown in Table 4, includes program element identifiers, values of all the centrality measures and layering labels as specified by the system architect responsible for architecture recovery. The layering la- bels can be used from the previous version of the system under study or the labels guessed by system architect to explore different alternatives for system decomposition. We implement three supervised classification algo- rithm namely K-Nearest Neighbour, Support Vector Ma- chine, and Decision Tree. These are the machine learning algorithms particularly used for multi-class classification problems. A detailed comparison of these various algo- rithms can be found in [17]. Python’s Scikit-Learn [18] library is used to develop classification model based on these algorithms. Table 4 shows the format of the sam- ple dataset used to train the classification models. The developed models are evaluated against the classification metrics such as accuracy, precision, recall, and F1-Score. Figure 3: An Architecture of a System Designed to Train the . approach 5. Model Development classes named as SEC, TX, SERI in the figure rep- resent crosscutting concerns. The machine learning model development includes phases like training, testing and evaluation. This sec- 2. HealthWatcher: The HealthWatcher is a web- tion describes how these phases are carried out. based application providing healthcare-related services [19]. This application is selected to train the model because existing literature in the public 5.1. Model Training domain confirms that the application follows a The following software systems are used to train and client-server and layered architecture style. Ini- build the machine learning models. tially, the first author has manually assigned la- bels to all the programming elements. Later the 1. Training Architecture system: A small-scale second author checked the labelling of individual sample architecture system, as shown in Figure 3, elements. We used the following rules to label has been designed specially to train the approach. programming elements. These are: (i) Program- It includes 16 classes without the implementation ming elements that access low-level device func- of any functionalities; it contains only dependen- tions and data access functions are labelled as cies among classes, as shown in Figure 3. The lower. (ii) Programming elements accessing pre- Table 5 Accuracy and Confusion Matrix for Data-Driven and Algorithmic Approach SVM Decision Tree KNN classifier Rule based HealthWatcher(Size: 135 classes or interfaces) Confusion Matrix Layer lower mid- up- lower mid- up- lower mid- up- lower mid- up- dle per dle per dle per dle per lower 47 1 9 49 4 4 41 8 8 28 16 13 middle 20 5 12 15 20 2 7 28 2 6 30 1 upper 5 1 35 7 0 34 6 6 29 3 9 29 Accuracy = 0.64 Accuracy = 0.76 Accuracy = 0.72 Accuracy = 0.63 Recall (R), Precision (P), F1-Score (F1) Evaluation R P F-1 R P F-1 R P F-1 R P F-1 lower 0.65 0.82 0.73 0.69 0.86 0.77 0.76 0.72 0.74 0.76 0.49 0.60 middle 0.71 0.14 0.23 0.83 0.54 0.66 0.67 0.76 0.71 0.55 0.81 0.65 upper 0.62 0.85 0.72 0.85 0.83 0.84 0.74 0.71 0.72 0.66 0.66 0.66 Training Architecture System (Size = 16 Classes) Confusion Matrix lower 5 2 0 4 3 0 5 2 0 5 2 0 middle 1 4 0 0 5 0 1 4 0 1 4 0 upper 1 0 3 0 1 3 4 0 0 1 0 3 Accuracy = 0.75 Accuracy = 0.75 Accuracy = 0.56 Accuracy =0.75 Recall (R), Precision (P), F1-Score (F1) Evaluation R P F-1 R P F-1 R P F-1 R P F-1 lower 0.71 0.71 0.71 1.00 0.57 0.73 0.50 0.71 0.59 0.71 0.71 0.71 middle 0.67 0.80 0.73 0.56 1.00 0.71 0.67 0.80 0.73 0.67 0.80 0.73 upper 1.00 0.75 0.86 1.00 0.75 0.86 0.00 0.00 0.00 1.00 0.75 0.86 Constore (Size: 66 classes or interfaces) Confusion Matrix lower 43 0 0 43 0 0 40 3 0 27 16 0] middle 14 0 1 13 2 0 12 3 0 9 5 1 upper 6 0 2 6 0 2 7 1 0 4 2 2 Accuracy = 0.68 Accuracy = 0.71 Accuracy = 0.65 Accuracy =0.52 Recall (R), Precision (P), F1-Score (F1) Evaluation R P F-1 R P F-1 R P F-1 R P F-1 lower 0.68 1.00 0.81 0.69 1.00 0.82 0.68 0.93 0.78 0.68 0.63 0.65 middle 0.00 0.00 0.00 1.00 0.13 0.24 0.43 0.20 0.27 0.22 0.33 0.26 top 0.67 0.25 0.36 1.00 0.25 0.40 0.00 0.00 0.00 0.67 0.25 0.36 sentation functions are labelled as upper. (iii) Pro- ConStore is a framework for detailing out the concepts gramming elements that provide business logic or and creating a domain model for a given application. depend on programming elements defined within the application are labelled as middle. 5.3. Model Evaluation Total one hundred fifty-one classes, i.e. data instances, The performance of classification models is typically eval- are used to train the model—sixteen classes from the uated against measures such as accuracy, precision, recall, specially designed system and 135 classes from the and F1-Score [21]. These metrics are derived from a con- 𝐻 𝑒𝑎𝑙𝑡ℎ𝑊 𝑎𝑡𝑐ℎ𝑒𝑟 system. fusion matrix which compares the count of actual class labels for the observations in a given data set and the class 5.2. Model Testing labels as predicted by a classification model. Four differ- ent metrics are derived by comparing true labels with the We used ConStore[20], a small scale Java-based library predicted labels. These are accuracy, recall, precision and designed to manage concept networks, to test the model. F1-score. Table 5 shows the performance analysis against This application is selected to test the model because the these metrics. The table compares the performance of second author has involved in recovering architecture algorithmic-centric approach and data-driven approach. previously and knows the details of the system. The 5.3.1. Accuracy Analysis 5.3.3. Threats to Validation The accuracy is the rate of correction for classification The performance of the models has been evaluated in an models. Higher the value of accuracy, better is the model. academic setting with internal validation only. By inter- From the accuracy point of view, one can observe from nal validation, we mean the performance of algorithmic the Table 5 that the data-driven approach performs better and data-driven techniques that have been compared and as compared to the algorithmic-centric approach. The analyzed. This is because our prime aim is to demonstrate decision-based classifier preforms better on all the test the significance of social network analysis measures in cases with an average accuracy of 74%. This is because recovering architecture descriptions. The model’s per- the performance of algorithmic approach depends on formance needs to be further compared against similar the proper tuning of various configuration parameters. approaches developed earlier[22]. Also, the usefulness The results shown in Table 5 are obtained with following of recovered layering information needs to be assessed values in Table 6 of configuration parameters. The Table in the software industry setting. 6 shows combination of configuration values for the best accuracy obtained during twenty-five iterations. During each iterations, values of configuration parameters were 6. Related Work incremented by 1 or 0.1( for 𝛿𝑐 , 𝛿𝑒 ). Recovering architecture descriptions from the code has ConStore Healthwatcher Test Arch. been one of the widely and continuously explored prob- 𝛿𝑖𝑙 4 10 2 lem by Software Architecture researchers. This has re- 𝛿𝑖𝑢 1 1 1 sulted in a large number of techniques[23, 24, 25], survey 𝛿𝑜𝑙 4 2 2 papers [22] and books [26] devoted to the topic. In the 𝛿𝑜𝑢 1 5 2 context of these earlier approaches, this section provides 𝛿𝑏 6 9 6 the rationale behind the implementation decisions taken 𝛿𝑐 0.8 0.8 0.6 while developing our approach. 𝛿𝑒 0.6 0.5 0.6 (i) Include Dependencies vs Symbolic Dependen- cies: The recent study reported in [27] has recognized Table 6 Configuration parameters and their values that the quality of recovered architecture depends on the type of dependencies analyzed to recover architecture. The study analyzes the impact of symbolic dependencies The machine learning models automatically learn and i.e. dependencies at the program identifier level versus adjust the model parameters for the better results of accu- include dependencies i.e. at the level of importing files or racy. In case of algorithmic approach, the configuration including packages. Further, it emphasizes that symbolic parameter tuning is an iterative process and need to try dependencies are more accurate way to recover struc- different combinations. tural information. The use of include dependencies is error prone owing to the fact that a programmer may 5.3.2. Recall, Precision, F1-Score Analysis include a package without using it. We used include dependencies in our approach because Recall indicates the proportion of correctly identified extracting and managing include dependencies are sim- true positives while precision is the proportion of correct ple as compared to symbolic dependencies. Further, we positive identification. High values of both recall and mitigated the risk of unused packages by excluding these precision are desired, but it isn’t easy to get high values relationship from further analysis. Many programming simultaneously for recall and precision. Hence, F1- score environments facilitate the removal of unused packages. combines recall and precision into one metrics. From One of our objectives was to develop a data-driven ap- the recall, precision, and F1-score point of view, one can proach and cleaning data in this way is an established observe from Table 5 that decision tree-based classifier practice in the field of data engineering. performs better with the highest F1-score of 0.86 for (ii) Unsupervised Clustering vs Supervised clas- the upper layer classification of test architecture system. sification: The techniques of unsupervised clustering Recalling class labels with higher precision for middle have been adopted widely to extract high-level archi- layer is a challenging task for all the models described tectures through the analysis of dependencies between in this paper. This is because of the presence of many implementation artefacts [23]. These approaches use not so cleanly encapsulated functionalities in a module hierarchical and search-based methods for clustering. at the middle layer and mapping crosscutting concerns These approaches usually take substantial search time to one of the three layers. to find not so good architectures [28]. One of the ad- vantages of clustering methods is that unlabelled data sets drive these methods. But, the identified clusters of program elements need to be labelled with appropriate enough to extend. labels. The layering information extracted by the approach Our choice of supervised classification method is driven can be viewed as one way of decomposing a system. It by the fact that centrality measures quantify the structural is not the single ground truth architecture that is often properties with reference to a node, and relation of the difficult to agree upon and laborious to discover [22]. nodes with respect to others. Processing such quantified Further, the quality of extracted architecture descriptions, values in efficient way is one of the advantages of many i.e. clustering of program elements to layers, need to be of supervised classification methods. Further, assigning assessed for the properties such as a minimal layer of program elements with layering labels is not an issue if violation [31, 32] or satisfaction of a particular quality such information is available from the previous version attribute[26] or any other project-specific criteria. of software, which may be the case for re-engineering or We described the working of the approach by assum- modernization projects. In the absence of such labelled ing a three-layer decomposition. The work presented data set, the approach presented in the paper can still be in this paper can be extended to more than three layers. adopted in two stages. In the first stage, a tentative layer The algorithm-centric technique needs to be adapted by labelling can be done through algorithmic approach fol- redesigning rules for additional layers. Also, the super- lowed by the labelling through supervised classification vised classification method can be adjusted by relabelling method. program elements with the number of layers considered. (ii) Applications of Social Network Analysis Exploring the impact of fusing structural properties (SNA) Measures: The interest in applying theory of and some semantic features such as a dominant concern SNA has started growing in recent times. Some of the addressed by a programming element would be an excit- recent applications of SNA include predicting architec- ing exercise for future exploration. tural smell [29], predicting vulnerable software compo- nents [30] to measure structural similarity of program elements. Our approach characterizes each program el- References ements through its centrality measures which can be [1] D. Link, P. Behnamghader, R. Moazeni, B. Boehm, termed as feature representation, using machine learning The value of software architecture recovery for vocabulary, necessary to build data-driven models. maintenance, in: Proceedings of the 12th Inno- vations on Software Engineering Conference (for- 7. Conclusion merly known as India Software Engineering Con- ference), 2019, pp. 1–10. The main highlights of the approach presented in the [2] A. Pacheco, G. Marín-Raventós, G. López, Design- paper include: (i)The dependency graph formed by pro- ing a technical debt visualization tool to improve gramming elements (i.e. classes in Java) is treated as a stakeholder communication in the decision-making network, and centrality measures are applied to extract process: a case study, in: International Conference structural properties. (ii) It represents each program ele- on Research and Practical Issues of Enterprise In- ment as a set of values corresponding to different central- formation Systems, Springer, 2018, pp. 15–26. ity measures. Thus, each program element is represented [3] A. Shahbazian, Y. K. Lee, D. Le, Y. Brun, N. Medvi- as a feature vector in machine learning terminology. (iii) dovic, Recovering architectural design decisions, The paper treats a layer as a coarsely granular abstraction in: 2018 IEEE International Conference on Software encapsulating common system functionalities. Then it Architecture (ICSA), IEEE, 2018, pp. 95–9509. maps a group of programming elements sharing common [4] P. J. Carrington, J. Scott, S. Wasserman, Models structural properties to a layer. (iv) The paper describes and methods in social network analysis, volume 28, two mapping methods for this purpose called algorith- Cambridge university press, 2005. mic centric and data-driven. (v) Overall, the data-driven [5] S. B. Thakare, A. W. Kiwelekar, Skiplpa: An effi- method illustrated in the paper perform better compared cient label propagation algorithm for community to the algorithmic centric method. detection in sparse network, in: Proceedings of The paper makes several assumptions, such as (i) avail- the 9th Annual ACM India Conference, 2016, pp. ability of Java-based system implementation, (ii) a system 97–106. is decomposed into three layers, and (iii) availability of [6] R. Albert, A.-L. Barabási, Statistical mechanics of pre-labelled data set for supervised classification. These complex networks, Reviews of modern physics 74 are the assumption made to simplify the realization and (2002) 47. demonstration of the approach. Hence, these assump- [7] D. J. Watts, Networks, dynamics, and the small- tions do not restrict the approach. However, these as- world phenomenon 1, American Journal of sociol- sumptions can be relaxed, and the approach is flexible ogy 105 (1999) 493–527. [8] J. S. Silva, A. M. Saraiva, A methodology for ap- [23] O. Maqbool, H. Babri, Hierarchical clustering for plying social network analysis metrics to biologi- software architecture recovery, IEEE Transactions cal interaction networks, in: Advances in Social on Software Engineering 33 (2007) 759–780. Networks Analysis and Mining (ASONAM), 2015 [24] A. W. Kiwelekar, R. K. Joshi, An ontological frame- IEEE/ACM International Conference on, IEEE, 2015, work for architecture model integration, in: Pro- pp. 1300–1307. ceedings of the 4th International Workshop on [9] G. Amitai, A. Shemesh, E. Sitbon, M. Shklar, Twin Peaks of Requirements and Architecture, 2014, D. Netanely, I. Venger, S. Pietrokovski, Network pp. 24–27. analysis of protein structures identifies functional [25] A. W. Kiwelekar, R. K. Joshi, Ontological analysis residues, Journal of Molecular Biology 344 (2004) for generating baseline architectural descriptions, 1135–1146. doi:h t t p : / / d x . d o i . o r g / 1 0 . 1 0 1 6 / j . j m b . in: European Conference on Software Architecture, 2004.10.055. Springer, 2010, pp. 417–424. [10] M. E. J. Newman, Random graphs as models of [26] A. Isazadeh, H. Izadkhah, I. Elgedawy, Source code networks, arXiv preprint cond-mat/0202208 (2002). modularization: theory and techniques, Springer, [11] M. E. Newman, The structure and function of com- 2017. plex networks, SIAM review 45 (2003) 167–256. [27] T. Lutellier, D. Chollak, J. Garcia, L. Tan, D. Rayside, [12] S. P. Borgatti, Centrality and network flow, Social N. Medvidović, R. Kroeger, Measuring the impact networks 27 (2005) 55–71. of code dependencies on software architecture re- [13] L. C. Freeman, Centrality in social networks concep- covery techniques, IEEE Transactions on Software tual clarification, Social networks 1 (1979) 215–239. Engineering 44 (2017) 159–181. [14] D. R. White, S. P. Borgatti, Betweenness centrality [28] S. Mohammadi, H. Izadkhah, A new algorithm for measures for directed graphs, Social Networks 16 software clustering considering the knowledge of (1994) 335–346. dependency between artifacts in the source code, [15] P. Bonacich, Some unique properties of eigenvector Information and Software Technology 105 (2019) centrality, Social networks 29 (2007) 555–564. 252–256. [16] F. Buschmann, K. Henney, D. C. Schmidt, Pattern- [29] A. Tommasel, Applying social network analysis oriented software architecture, on patterns and pat- techniques to architectural smell prediction, in: tern languages, volume 5, John wiley & sons, 2007. 2019 IEEE International Conference on Software [17] C. A. U. Hassan, M. S. Khan, M. A. Shah, Compari- Architecture Companion (ICSA-C), IEEE, 2019, pp. son of machine learning algorithms in data classifi- 254–261. cation, in: 2018 24th International Conference on [30] V. H. Nguyen, L. M. S. Tran, Predicting vulnerable Automation and Computing (ICAC), IEEE, 2018, pp. software components with dependency graphs, in: 1–6. Proceedings of the 6th International Workshop on [18] J. Hao, T. K. Ho, Machine learning made easy: A Security Measurements and Metrics, 2010, pp. 1–8. review of scikit-learn package in python program- [31] S. Sarkar, G. Maskeri, S. Ramachandran, Discovery ming language, Journal of Educational and Behav- of architectural layers and measurement of layering ioral Statistics 44 (2019) 348–361. violations in source code, Journal of Systems and [19] P. Greenwood, T. Bartolomei, E. Figueiredo, Software 82 (2009) 1891–1905. M. Dosea, A. Garcia, N. Cacho, C. Sant’Anna, [32] S. Sarkar, V. Kaulgud, Architecture reconstruction S. Soares, P. Borba, U. Kulesza, et al., On the im- from code for business applications-a practical ap- pact of aspectual decompositions on design stabil- proach, in: 1st India Workshop on Reverse Engi- ity: An empirical study, in: European Conference neering,(IWRE), 2010. on Object-Oriented Programming, Springer, 2007, pp. 176–200. [20] http://www.cse.iitb.ac.in/constore, http://www.cse.iitb.ac.in/constore, 2009. [21] C. Goutte, E. Gaussier, A probabilistic interpreta- tion of precision, recall and f-score, with implica- tion for evaluation, in: European conference on information retrieval, Springer, 2005, pp. 345–359. [22] J. Garcia, I. Ivkovic, N. Medvidovic, A compara- tive analysis of software architecture recovery tech- niques, in: 2013 28th IEEE/ACM International Con- ference on Automated Software Engineering (ASE), IEEE, 2013, pp. 486–496.