=Paper= {{Paper |id=Vol-1742/MRT16_paper_3 |storemode=property |title=DeSARM: A Decentralized Mechanism for Discovering Software Architecture Models at Runtime in Distributed Systems |pdfUrl=https://ceur-ws.org/Vol-1742/MRT16_paper_3.pdf |volume=Vol-1742 |authors=Jason Porter,Daniel Menasce,Hassan Gomaa |dblpUrl=https://dblp.org/rec/conf/models/PorterMG16 }} ==DeSARM: A Decentralized Mechanism for Discovering Software Architecture Models at Runtime in Distributed Systems== https://ceur-ws.org/Vol-1742/MRT16_paper_3.pdf

DeSARM: A Decentralized Mechanism for
Discovering Software Architecture Models at
Runtime in Distributed Systems
Jason Porter Daniel A. Menascé Hassan Gomaa
Department of Computer Science Department of Computer Science Department of Computer Science
George Mason University George Mason University George Mason University
Fairfax, Virginia 22030 Fairfax, Virginia 22030 Fairfax, Virginia 22030
Email: jporte10@gmu.edu Email: menasce@gmu.edu Email: hgomaa@gmu.edu

Abstract—Runtime models play a critical role in modern self-configuration, self-optimization, self-healing and self-
self-adaptive systems. Hence, runtime architectural models are protection, and are also called self-* or autonomic sys-
needed when making adaptation decisions in architecture-based tems [17]. In architecture-based self-adaptive systems, com-
self-adaptive systems. However, when these systems are dis-
tributed and highly dynamic, there is an added need to discover ponents dynamically change in order to continuously adhere
the systems software architecture model at runtime. Current to architectural properties and system goals. As a result, these
methods of runtime architecture discovery take a centralized systems require runtime architectural models when making
approach, in which the process is carried out from a single adaptation decisions [11]. However, when these systems are
location. These methods are inadequate for large distributed distributed and highly dynamic there is an added need to
systems because they do not scale up well and have a single
point of failure. Also, systems of such size consist of nodes discover the system’s architectural model at runtime.
that are typically highly dynamic in nature. Existing approaches Current methods of runtime architecture discovery take a
to architecture discovery are not capable of addressing these centralized approach to constructing architectures dynamically.
concerns. This paper describes DeSARM (Decentralized Software These methods are inadequate for large distributed systems
Architecture discoveRy Mechanism), a completely decentralized because they do not scale up well and have a single point of
and automated approach for runtime discovery of software archi-
tecture models of distributed systems based on gossiping and mes- failure. Also, systems of such size are often dynamic in nature
sage tracing. DeSARM is able to identify at runtime important due to changes in the runtime environment, where nodes join
architectural characteristics such as components and connectors, and leave the system unpredictably, and are subject to fail-
in addition to synchronous and asynchronous communication ures. Existing approaches do not address the aforementioned
patterns. Furthermore, through its use of gossiping, it exhibits the concerns.
properties of scalability, global consistency among participating
nodes, and resiliency to failures. The paper discusses DeSARM’s We describe a method for runtime software architecture
architecture and detailed design, and demonstrates its properties discovery in a completely decentralized manner. This is ac-
through experimentation. complished by keeping message logs at each node and dis-
seminating among the nodes identified component interaction
I. I NTRODUCTION patterns (i.e., synchronous, asynchronous, single or multiple
Software architecture—the high level structure of a software destination) derived from the logs through selective gossip
system including a collection of components, connectors and exchanges. With selective gossiping, information dissemina-
constraints—plays an increasingly critical role in the design tion takes place only between nodes that are part of the
and development of any large complex software system. These same application. This results in the containment of network
artifacts (i.e., components, connectors, and constraints) are traffic, which in effect reduces communication overhead. All
needed to reason about the system and act as a bridge references to DeSARM’s gossiping method heretofore refer
between requirements and implementation as well as provide to selective gossiping. Through its use of gossiping, our
a blueprint for system construction and composition. The mechanism addresses the limitations of the current approaches
architecture helps in the understanding of complex systems, and exhibits the following properties: (1) resiliency to failures,
supports reuse at both the component and architectural level, where the gossiping protocol always converges irrespective
indicates the major components to be developed together with of node failures, (2) global consistency among participating
their relationships and constraints, exposes changeability of nodes, where all nodes converge to the same architectural
the system, and allows for verification and validation of the view, and (3) scalability, where the gossiping protocol can
target system at a high level [32]. accomodate systems of increasing size [29]. Accordingly, the
Software architectures have been very influential in scope of our work is the derivation of architectural models
self-adaptive systems, i.e., systems that are capable of at runtime to be used in decentralized decision making for
architecture-based adaptation in large distributed systems. This from which interaction patterns are identified using pattern
effort is part of a larger project, RASS: Resilient Autonomic detection in order to define architectural elements. Yuan and
Software Systems [27], aimed at developing a highly-resilient Malek [35] take a dynamic approach to discovering the ar-
and highly-dependable system, which is capable of making chitectural model of a distributed application by generating a
adaptation decisions in a distributed fashion without cen- component interaction model using data mining. Huang et al.
tralized control. To achieve this goal, each node requires a [15] use the reflective ability of the component framework to
complete model of the system at runtime. discover an up-to-date architecture from the running system.
Thus, the main contribution of this paper is DeSARM: In contrast to the aforementioned, DeSARM is based on a de-
Decentralized Software Architecture discoveRy Mechanism, centralized approach to architecture discovery. This approach
a completely decentralized and automated method and sys- is effective for large distributed systems which pose scalability,
tem for runtime discovery of software architecture models single point of failure and dynamicity concerns, which are not
of distributed systems using gossiping [6][18] and message addressed by the existing methods.
tracing. The information dissemination and relatively fast Also related to our work is architecture-based software
convergence capability of the gossiping protocol aid each node dynamic adaptation, which addresses the dynamic software
in deriving a complete model of the architecture. We also reconfiguration of the architecture model and corresponding
demonstrate experimentally DeSARM properties of resiliency, implementation for the purpose of runtime adaptation and
global consistency, and scalability. It is important to note that evolution (see e.g., [10], [21], [23], [25], [34]).
these properties relate to the architecture discovery process
and not to application failure recovery or adaptation, which III. A SSUMPTIONS
are outside the scope of this paper. This paper makes the following assumptions:
The rest of this paper is organized as follows. Section II 1) If a component fails, it is restarted on the same node
discusses related work. Section III provides some basic as- if the node is still running. This is common in modern
sumptions. Section IV describes the architecture of DeSARM. component-based systems whereby components can be
Section V presents the results of our experiments with De- restarted when they are detected to have failed.
SARM. Section VI provides a discussion on important is- 2) If a node fails, all the components running on that node
sues related to the architecture discovery mechanism. Finally, fail. This assumption is obvious and does not require
Section VII presents some concluding remarks and discusses further comments.
possible future work. 3) If a node cannot be restarted after a failure, its compo-
nents can be moved to other node(s) using an existing
II. R ELATED W ORK
component recovery mechanism not within the scope of
In this paper we use architecture discovery instead of recov- this work. This assumption is based on existing work on
ery as is often found in the literature. Architecture recovery a runtime application recovery mechanism with which
refers to cases where the architecture was known but may have DeSARM is being integrated. See Section VII.
been lost due to erosion or improper documentation [3], [15], 4) Software components can communicate either through
[33]. In contrast, architecture discovery refers to situations in a connection-less transport protocol such as UDP or a
which the architecture was not previously known. This is due connection-oriented protocol such as TCP. Both types
to the fact that the architecture may not have previously existed of communication are common in distributed systems
or may have changed at runtime due to the dynamic nature of and DeSARM supports either.
the system as is the case with dynamic software adaptation. 5) The software architecture is not known because it may
Software architecture discovery approaches can be classi- dynamically change due to churn and failures. As ex-
fied into dynamic, static, and hybrid, i.e., combining both plained in the introduction, this is the focus of this work.
dynamic and static analysis. Some examples of static analysis 6) The software deployment on physical nodes is not
approaches include [24][4][19] and examples of hybrid ap- known. The set of nodes on which the software system is
proaches are [28][30]. However, as the objective of this paper deployed is not assumed to be known and is discovered
is discovering architectural models at runtime, we shall focus by DeSARM.
only on dynamic approaches as follows.
Israr et al. [16] describe SAMEtech, a dynamic approach for IV. S OFTWARE A RCHITECTURE D ISCOVERY M ETHOD
automating the discovery of architecture models and layered This section discusses DeSARM’s architecture. We first
performance models from message trace information. Disco- describe the structure of a node running DeSARM. Then, we
Tect [31] uses a set of pattern recognizers and knowledge give an overview of how gossiping and message tracing are
of the architectural style being implemented to map low- incorporated into the discovery process. We then delve into
level system events into high-level architecturally meaningful the details of the architecture model discovery method.
events. Bojic and Velasevic [3] use test cases that cover rele- A node in a distributed system consists of three layers
vant use-cases, and concept analysis to group system entities according to Fig. 1: application layer, DeSARM layer, and
together that implement similar functionality. Vasconcelos et communication middleware. The application layer consists
al. [33] use specified use-cases to generate execution traces of the distributed application components that communicate
Fig. 2. Details of the DeSARM Layer.

components are aggregated to form an aggregate mes-
sage log (AML) at each node. The AML contains the
interactions between all components and their respective
destination components. This is further merged with the
AMLs from incoming gossip messages received from
other nodes (see below).
• Gossip-based dissemination: This forms the core of our
architecture discovery method and enables the distribu-
tion of AMLs from each node throughout the system.
• Peer node selection: This is achieved through the main-
tenance of a component/node database which is derived
by identifying component IDs and their related node IDs
from incoming and outgoing messages. This ensures that
only nodes running components that are part of the same
Fig. 1. Layered Structure at a Node of a Distributed System. application are selected for dissemination.
• Control: The DeSARM controller manages the execution
of the gossip process by maintaining the timing between
over the network via messaging events. Each component has consecutive rounds.
two logs: message sent log (MSL) and message received log • Architecture discovery: This is used to derive the archi-

(MRL) as shown in Fig. 2. The DeSARM layer forms a tectural model of the system based on the message traces.
wrapper around the communication middleware and provides Finally, the communication middleware provides network ac-
the same interface to the application layer components as the cess allowing the sending and receiving of component-level
communication middleware. DeSARM consists of a number and gossiping messages between nodes.
of modules each providing different functionality (see Fig.2):
Gossip is an epidemic protocol, which due to its simplicity,
• Message logging: All incoming messages are logged be- robustness and flexibility makes it ideal for reliable dissem-
fore being passed to the application layer components and ination of data in large-scale distributed systems. In gossip-
all outgoing messages are logged before being passed to based dissemination, data spreads exponentially fast and takes
the communication middleware. All messages are logged O (logN) rounds to reach all N nodes [18]. The essence
to stable storage. of this approach, which lies at the core of all gossip-based
• Message log aggregation: The MSLs and MRLs of all dissemination approaches, was first introduced in the seminal
!!!!"##$%&#'()'*&+,*'-) &&&&&!"##$%&*'.'$/'*&+,*'-)& • Destination Type: single destination (SD) or multiple
&&&&0%''*&12&&&&&&&&&&&&&&&&&&&&&& &&&&&0%''*&32& destination (MD)
!!!!)"&'/'*4&!"56'&7($+#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!*'.'%5"(&"8&*'97'#+&8*"6&1& • Message Type (mt): request reply (RQ), no-reply re-
!!!!!""!#$%$&'!$(&)*+,$!-*.'+$.! quested (NR), a reply to previous request (RP)
!!!!!!!!!#"$"%&'&()*&&+,-."
• Transaction ID
!!!!!""!-./&$$0!'/!$(&)*+,$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!""!.$&$12$!3$##*,$!
!!!!!!!!!%&/0,#1"234-."""""""""""""""""""""""""""""""""""""""""""""""""""""""""234*"$"+&(&56&7+89,*-." • Message Unique ID (mid)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"! • Return ID: equals 0 if mt 6= RP, otherwise equals “mid”
! ! ! !!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!
!!!!""!4*1'!5/.!.$#-/+#$!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!""!-./&$$0!'/!$(&)*+,$! of original request message
!!!!!!!!234#"$"+&(&56&7+89,#-.""""""""""""""""""""""""""""""""""""""""%&/0,*1"234-." • Source Component: component sending the message
!!!""!3$.,$!678#!'/!,$'!+$4!*,,.$,*'$0!678!!!!!!!!!!!!!""3$.,$!678#!'/!,$'!+$4!*,,.$,*'$0!678!
!!!!!!!"234"$"9&+:&48:%,2341"234#-.&&&&&&&&&&&&&&&&&&&&&&&&&&&234"$"9&+:&48:%,2341"234*-." • Destination Component: component receiving the mes-
!!!!'()&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&'()&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& sage
• Source Node ID: ID of sending node
Fig. 3. Design of the Gossip-based Dissemination Framework • Destination Node ID: ID of receiving node

The MRL and MSL for each component are scanned and
paper by Demers et al. [5] and involves the dissemination of the interaction patterns for these messages are identified. The
data by allowing randomly chosen pairs of nodes to exchange following types of messages are considered: Reply requests
new information. After the exchange, the two nodes forming (RQ), No-reply requests (NR), and Replies (RP).
a pair should have the same information effectively reducing These message types allows us to identify synchronous
the entropy of the system [29]. vs asynchronous interactions. In the former case, since reply
The main elements of the gossip-based dissemination frame- messages are guaranteed to have a request, then the original
work are: (1) peer selection, where a peer (node) selects request reply message and its associated reply message are
another peer uniformly at random from the set of available treated as a single synchronous interaction (SY). If the original
peers, (2) data exchanged, which involves the exchange of message was sent as a unicast (SD) then the tuple (source,
data between peers and is specific to the use of the gossip destination, SY, SD) is added to the AML. Otherwise, if the
mechanism, and (3) data processing, which details how each original message was sent as a multicast (MD), then the tuple
peer handles the information received from other peers and (source, destination, SY, MD) is added to the AML. No-reply
is also specific to the use of the gossip mechanism [18]. requested messages on the other hand are treated as asyn-
DeSARM’s gossip-based dissemination is depicted in Fig. 3 chronous interactions (AS) and added to the AML as (source,
and consists of: destination, AS, SD) if the message was sent as a unicast, or
• Peer selection: A peer P periodically chooses another (source, destination, AS, MD) if the message was multicasted.
peer Q uniformly at random from the set of peers that The AML is treated as a set so only unique tuple entries are
participate in the same application, i.e., DeSARM uses allowed for each component interaction, irrespective of the
selective gossiping. frequency of such interactions in the message logs because
• Data exchanged: The AML is sent from one peer to a software architecture does not consider how many times
another. a certain type of interaction occurred between components.
• Data processing: The received AML is merged with the Further details on the algorithms implemented by DeSARM
local AML at each node to produce an updated AML. can be found in [26].
Once a node detects that there are no further updates to the After each round of gossiping, the updated AML is used to
gossiped AMLs, it assumes that convergence was reached and incrementally discover the architecture, which is represented
a complete software architecture can be derived. as a labeled directed graph. The vertices of this graph cor-
respond to unique component ids and the edges correspond
to unique component interactions. Edges are labeled with the
interaction patterns (SY or AS) and destination types (SD or
MD). Further details on this process can also be found in [26].
To depict how DeSARM works, we use an example ar-
chitecture of a distributed emergency monitoring system (see
Fig. 4). The architecture consists of five types of components
with three instances each of the Monitoring Sensor and Re-
moteSystem Proxy components, and a single instance of the
other components. This example assumes that each component
is assigned to a single node. When referring to components we
mean component instances unless otherwise mentioned. The
Fig. 4. Emergency Monitoring System.
communication patterns within the system are:
Each log entry in the MSL or MRL has the following fields: • Operator Presentation sends synchronous messages with
• Timestamp reply to Alarm Service and Monitoring Data Service.
A. Experiment I
!"#$%"&$#'( /*-"%*(
3.*&4%"&(
)*#+"&( )0+%*-( We use the application described in Fig. 4, whose archi-
1&*+*#%45"#(
,"-."#*#%( 1&"20(
6);():(
6);():(
6);(!:( tecture is known, and show that DeSARM converges exactly
6);():(
to that known architecture. Table I shows the mapping be-
6);():( )<;():( 6);(!:( tween nodes, components, and physical machines for this
)<;():( !"#$%"&$#'( architecture. As discussed above, the known and discovered
674&-(
)*&8$9*(
:4%4(( architectures are represented as graphs; we compare the simi-
)*&8$9*(
larity between the two (the known and current version of the
discovered architecture) over time. For that, we use the graph
Fig. 5. Discovered Architecture Model as Labeled Directed Graph
comparison algorithms proposed in [20][36][9] and a graph
similarity metric that ranges from 0 to 1, where 0 indicates
• Alarm Service and Monitoring Data Service send asyn- no similarity and 1 indicates that the two graphs are identical.
chronous multicast messages to Operator Presentation. The use of graph comparison using the similarity measure is
• Monitoring Sensor and Remote System Proxy send asyn-
only for convergence checking during the experiments and is
chronous unicast messages to Alarm Service and Moni- not part of DeSARM’s implementation. We plot the evolution
toring Data Service. of the similarity metric over time to display the convergence
speed of the discovery mechanism.
The discovered architecture model corresponding to Fig. 4 is
the graph shown in Fig.5. This shows the five component types
TABLE I
of the system and their respective communication patterns as M APPING OF NODES TO COMPONENTS AND PHYSICAL MACHINES .
edge labels, which would have been derived by the architecture
discovery process. Node Software Component Machine
Node 1 Monitoring Sensor Component (MSC) Machine 1
V. D E SARM I MPLEMENTATION AND E XPERIMENTS Node 2 Monitoring Sensor Component 2 (MSC2) Machine 1
Node 3 Monitoring Sensor Component 3 (MSC3) Machine 1
DeSARM was implemented in Java, and was developed by Node 4 Remote System Proxy (RSP) Machine 1
extending an open source implementation of gossip [14] that Node 5 Operator Presentation (OP) Machine 2
manages a list of nodes on a network for cluster membership. Node 6 Alarm Service (ArmS) Machine 2
Node 7 Remote System Proxy 2 (RSP2) Machine 1
We extended the open source gosssip implementation by Node 8 Monitoring Data Service (MDS) Machine 2
adding the core functionality of DeSARM, as described in Node 9 Remote System Proxy 3 (RSP3) Machine 2
Section IV above, as well as incorporated push-pull based
gossip (discussed further below). We emulated a distributed Figure 6 shows how the architecture converges over time
system by implementing each node of the distributed system to the known architecture at each of the nodes shown in
on a different virtual machine (VM) and spread the VMs Table I under a no-failure case. Different nodes converge at
over physical machines connected over a network. The VMs different rates but at time 80 sec all nodes have converged
communicate over TCP/IP so they can be located anywhere on to the correct software architecture. Nodes 6 and 9 are the
the network. The DeSARM implementation is heavily multi- first to converge and node 7 is the last. Note that in our
threaded with different functions of DeSARM implemented as implementation of gossiping, each time a node i sends a gossip
different threads. Some examples of threads include sending message to node j, node j replies with a gossip message.
and receiving of gossip messages, message log aggregation, This way, two nodes will exchange AMLs more often, leading
architecture discovery, component/node database maintenance, to faster convergence. Because of the random nature of peer
and sending and receiving of component messages. All the selection in the gossip protocol, some nodes may gossip more
communication between nodes uses Java sockets. DeSARM often with others, leading to different convergence rates among
modules normally communicate using UDP, however if gossip nodes. Additionally, the convergence rate is affected by the
exchanges become too large, resulting in message fragmenta- communication pattern among components.
tion in multiple packets, TCP is used to guarantee message Figure 7 shows the evolution of the architecture similarity
integrity. when components fail. As the figure shows, failures start to
Our experiments demonstrate the operation of DeSARM occur after the first 80 sec, before convergence was achieved.
and assess its convergence and the number of messages In fact, the value of the similarity metric was equal to 0.9375
exchanged by the DeSARM middleware. Two sets of exper- (i.e., < 1) at all nodes at t = 80 sec. The failure probability of
iments were completed. In the first experiment, two types each component, while processing, is set at 20% (a relatively
of tests were conducted. In the first test, there were no high value) and the average component down time is set at
component/node failures. In the second test, we added random 180 seconds. Thus, at approximately t = 260 sec, the failed
failures with subsequent recovery for each of the components. components will start to recover from the failure and resume
This second test reveals the impact of failures on the con- sending messages. DeSARM automatically resumes its mes-
vergence of DeSARM to the final architecture. In the second sage collection and gossiping of newly updated AMLs when
experiment, the scalability of the mechanism is examined. components start to recover. When that happens, convergence
$"!#
TABLE III
N UMBER OF GOSSIP MESSAGES SENT PER NODE .
!",#
!"#$%&'#&("')*%+%,-"%&.)/01)2-%,("'34)

!"+# Sent Received Sent Received
!"*# (no failures) (no failures) (failures) (failures)
!")# 1 24 14 61 36
!"(# 2 22 15 59 52
)

3 20 22 63 50
!"'#
4 22 14 60 55
!"&# 5 21 19 39 54
!"%# 6 22 21 60 68
!"$# 7 25 11 65 69
!"!#
8 22 32 50 59
$!# %!# &!# '!# (!# )!# *!# +!# 9 21 21 63 59
5%+')/3'#4) Avg. 22 19 58 56
-./0#$# -./0#%# -./0#&# -./0#'# -./0#(#
-./0#)# -./0#*# -./0#+# -./0#,#

an event log (not part of DeSARM) during the experiments.
Fig. 6. Architecture similarity at the 9 nodes as a function of time with no Entries in these logs are timestamped in nanoseconds and cor-
failures. respond to events such as sending application-level messages,
sending/receiving DeSARM gossip messages, and computing
$"!# architecture similarity metrics. At the end of the experiment,
!",# the event logs of all nodes were sort-merged offline to produce
!"#$%&'#&("')*%+%,-"%&.)/0%&$)1-%,("'23)

!"+# a single log. Figure 8 shows a few excerpts of this log. The first
!"*# two entries of this log show application-level messages sent
!")# by component MSC at Node 1 to components ArmS (at Node
!"(# 6) and MDS (at Node 8). The next two entries correspond
!"'# to similar messages sent by MSC2 at Node 2, to components
!"&# ArmS and MDS. Later in time, DeSARM at Node 1 sends a
!"%# gossip message to Node 8 with Node 1’s current view of the
!"$# AML, namely [(MSC,ArmS,AS,SD), (MSC,MDS,AS,SD)].
!"!# This view only reflects the messages that component MSC
!# (!# $!!# $(!# %!!# %(!# &!!# &(!# '!!# at Node 1 sent to nodes 6 and 8. Later in time, De-
4%+')/2'#3)
-./0#$# -./0#%# -./0#&# -./0#'# -./0#(# SARM at Node 8 receives the following gossip message from
-./0#)# -./0#*# -./0#+# -./0#,# Node 4: [(RSP,ArmS,AS,SD), (RSP, MDS,AS, SD), (MDS,
OP, AS, MD), (MSC3,MDS,AS,SD), (MSC2,MDS,AS,SD),
Fig. 7. Architecture similarity at the 9 nodes as a function of time with (ArmS,OP,AS,MD), (MSC3, ArmS, AS, SD), (OP, ArmS, SY,
component failures. SD), (MSC2, ArmS, AS, SD)]. As a result, Node 8’s similarity
metric becomes 0.6875. Later in time, Node 8 receives a gossip
message from Node 9 with an AML that reflects Node 9’s
is achieved as illustrated in Table II, which shows the times at current view of the architecture. This AML is aggregated with
which nodes 1-9 converge after failure recovery. As shown in Node 8’s AML resulting in an AML that reflects the entire
Table II, node convergence times are spread between t = 340 software architecture. When the similarity metric for Node 8
sec and t = 400 sec. is next computed, it shows a value of 1, indicating convergence
Figures 6 and 7 are representative of similar results we at Node 8.
obtained in other experiments.
B. Experiment II
TABLE II To test the scalability of the approach we tested DeSARM
C ONVERGENCE TIME OF NODES 1-9 UNDER COMPONENT FAILURES .
on Argo [2], a high performance computing cluster operated
1 2 3 4 5 6 7 8 9 by the Office of Research Computing at George Mason
390 380 390 360 340 340 400 340 380 University. For this purpose, we put together a synthetic
application with 30 components, each one of them residing
Table III shows the number of gossip messages sent and on a different node of the research cluster. Some components
received per node until convergence is achieved in the case have a synchronous communication interface only, sending
when failures do not occur and in the case when failures occur. and receiving only synchronous messages, some have an
As the table shows, the average number of sent and received asynchronous communication interface only, sending and re-
gossip messages when no failures occur is almost 1/3 of the ceiving only asynchronous messages, while others comprise
corresponding number when there are failures. both synchronous and asynchronous interfaces, sending syn-
For illustration and debugging purposes, each node collected chronous messages and receiving asynchronous messages or
9004AC3S1.(4-;-4(
!"#$%&!#'"#&%&#()*+(,-./(01,/2343567(6-,,38-(/1(956*(
6-,,38-,(
!"#$%&!#'%#:"&:()*+(,-./(01,/2-;-./7(6-,,38-(/1()<*(
!"#$%#%=&:!#:'!()*+=(,-./(01,/2343567(6-,,38-(/1(956*(
!"#$%#%=&:%=>>#()*+=(,-./(01,/2-;-./7(6-,,38-(/1()<*(
!"#$#''#$>!$%!=(?@-(,A6A435A/B(6-/5AC(3/(D1E-(!(A,F(>G!=&(
HHHHH(
!"#">'!#$=>:%=:(D1E-(!(,-./(81,,A0(6-,,38-(/1F(D1E-(%(IA/@(9)JF(K2)*+L956*L9*L*<7L(
2)*+L)<*L9*L*<7M(
(
T1,,A0(6-,,38-(
( N516(.1E-(:(
HHHHH(
!"#":%'!=&":'$'(D1E-(%(5-C-A;-E(81,,A0(6-,,38-(N516F(D1E-(:(IA/@(9)J(K2O*PL956*L9*L*<7L(
- : IA/@ 9)J K2O*P 956* 9* *<
2O*PL)<*L9*L*<7L(2)<*LQPL9*L)<7L(2)*+'L)<*L9*L*<7L(2)*+=L)<*L9*L*<7L(2956*LQPL9*L)<7L(
2)*+'L956*L9*L*<7L(2QPL956*L*RL*<7L(2)*+=L956*L9*L*<7M(
HHHHH( T1,,A0(6-,,38-(
!"#"&':%=!:&'&$(?@-(,A6A435A/B(6-/5AC(3/(D1E-(%(A,F(>G$%"&( N516(.1E-(#((
(
HHHHH(
!"#"&'&:%$'#>>'(D1E-(%(5-C-A;-E(81,,A0(5-,01.,-(6-,,38-(N516F(D1E-(#(IA/@(9)J(
K2O*P'L956*L9*L*<7L(2O*P'L)<*L9*L*<7L(2)*+'L956*L9*L*<7L(2)*+'L)<*L9*L*<7L(2)*+=L956*L9*L*<7L(
2)*+=L)<*L9*L*<7L(2)*+L956*L9*L*<7L(2)*+L)<*L9*L*<7L(2O*P=L956*L9*L*<7L(2O*P=L)<*L9*L*<7L(
2956*LQPL9*L)<7L(2QPL956*L*RL*<7L(2O*PL956*L9*L*<7M(
HHHHH( D1E-(%(C1.;-58-E(
!"#"$':%:#%'">$(?@-(,A6A435A/B(6-/5AC(3/(D1E-(%(A,F(!G>(

Fig. 8. Excerpts of event trace. Components: Monitoring Sensor (MSC, MSC2, and MSC3), Remote System Proxy (RSP, RSP2, and RSP3), Operator
Presentation (OP), Alarm Service (ArmS), and Monitoring DataService (MDS).

vice versa. All communication between components take place The first issue we address is whether DeSARM’s selec-
at a random time interval to emulate local processing between tive gossiping mechanism always converges, i.e., if there
message exchanges. is the possibility of different nodes converging to different
Each component communicates with only two other com- discovered architectural models. To address this issue and to
ponents so that the initial AML at each node is a very small ensure convergence, we employed anti-entropy based gossip
fraction of the complete AML. Therefore, the value of the in DeSARM. In anti-entropy gossip, the gossiping protocol
similarity metric is very small to begin with. It starts at zero in runs indefinitely albeit at specified regular intervals [29]; this
some cases, when components on a node send a synchronous means that all nodes continuously make gossip exchanges. Be-
message to another component and have to wait for the reply. cause gossiping never stops, updates will reach all nodes and
This way the DeSARM instance at each node would require the system will always eventually converge. However, such
more information to derive the complete architectural model a mechanism can result in unnecessary message exchanges
than in the previous experiment. especially if no architectural changes have taken place within
The convergence time at the 30 nodes varied significantly the system. This problem is exacerbated in large distributed
due to the randomness in message exchange. The node that systems. To address this issue we adjusted the protocol to
took the longest time to converge converged in 260 sec (i.e., gradually decrease the gossip rate δ when no new updates
4.3 minutes) as shown in Table IV. This table shows the to the AML have been detected over a period of time. Then,
progression of the similarity metric over time. The table also when an update is received, the original gossip rate is restored
shows that the rate of convergence, roughly defined as the to speedup convergence.
increase in convergence over time is slower at the beginning
and faster at the end. For example, after 58% of the time, the The second issue is the relationship between gossip conver-
similarity metric has only achieved the value of 0.27. At 85% gence speed and the gossip rate δ. A higher value of δ implies
of the time, the similarity metrics achieved 0.73. While the faster convergence at the cost of higher message overhead.
slowest node to converge took over four minutes, the fastest Other factors such as system size, component communication
took 30 seconds. patterns, and gossip mode (see below) also affect convergence
speed. For example, convergence speed is generally affected
VI. D ISCUSSION by the communication pattern among components, including
We discuss here some additional issues of interest, some of how often they communicate and the number of components
which are being addressed in ongoing work. they communicate with. Two modes of gossiping can be used:
TABLE IV
S LOWEST CONVEGENCE RATE IN THE 30- NODE EXPERIMENT.

Time (sec) 0-110 120 130 140 150 160-170 180 190-200 210-220 230-250 260
Similarity Metric 0.00 0.10 0.13 0.20 0.27 0.30 0.47 0.57 0.73 0.97 1.00

push gossip (where a peer sends a message but does not to failures. These properties were demonstrated through vari-
receive one in return) vs. push-pull gossip (where a peer sends ous experiments, with and without component failures. These
a message and receives one in return). Our implementation experiments assessed the rate of convergence of the DeSARM
of DeSARM uses push-pull gossip, which accelerates conver- nodes towards the software architecture being discovered.
gence. These experiments showed that DeSARM is resilient and
The third issue is the implication of dynamic architectural is able to discover the architecture even in the presence of
changes during gossiping and how it affects convergence. We failures, albeit at a lower pace than the one when no failures
are currenty addressing this issue through the integration of occur. DeSARM was implemented in Java using a multi-
DeSARM with an application recovery layer that runs above threaded architecture.
DeSARM and provides DeSARM with updates to events that In future work we plan to examine DeSARM’s capability
cause architectural changes. Examples of such events include of discovering architectures that change over time as in the
component failures as well as component removal due to adap- case of dynamic software adaptation [13], [12]. This work will
tation. When such events occur, this layer informs DeSARM examine different recovery and adaptation models as well as
of the affected component(s). Due to the decentralized nature dynamic components, i.e., components that change their com-
of DeSARM, this update can be received at any node. Upon munication patterns at runtime based on their state or phase
receipt, the node will remove from its AML all entries related of execution, and the impact on DeSARM. Also of interest,
to the removed component(s) that contain the component(s) is addressing the mechanism’s capacity for discovering the
as a sender or receiver and use gossip to propogate these full behavioral architecture of a system such as identifying
deletions to other nodes within the system. Note that archi- component interaction protocols.
tectural changes that result in new components being added In the current experiments, we used a fixed architecture to
are handled automatically by DeSARM as described here. illustrate how DeSARM converges with and without failures
The fourth issue is the overhead of the gossip mechanism and how it operates equally well with two or thirty physical
in relation to both network traffic as well as latency in the machines. In the future, we will run experiments in which a va-
sending and receiving of component messages. DeSARM has riety of software architectures are generated following certain
very little network overhead for two reasons: the size of the distributions for parameters such as number of components
AMLs and its use of selective gossiping. As the AMLs contain and communication pattern types.
only unique tuple entries of component interactions, the size As mentioned previously, DeSARM is being integrated
of the AMLs are kept small. This in effect reduces the overall with an existing application recovery mechanism (ARM) that
bandwidth needed for gossip exchanges. Also, as mentioned allows for the runtime recovery of component(s) due to node
earlier, selective gossiping allows for the containment of failure [1]. Through the peer sampling capability of gossip
network traffic to only those nodes that are part of the same [18], DeSARM is able to identify suspected node failures and
application which further reduces the communication overhead alert the ARM. Once the node failure has been verified, the
of the mechanism. With respect to the latency of component ARM proceeds to recover the failed component(s) to a new
messaging, DeSARM can keep it at a minimum because, as node and re-instantiates them based on the current architecture
previously mentioned, all major functionality is separated into provided by DeSARM. As a result of the decentralized nature
different threads. This allows for all other processing to take of DeSARM, component recovery can be initiated from any
place in the background thus reducing any major impact the node as each maintains a complete model of the architecture.
gossip mechanism will have handling component messages. This integration will allow for the development of applications
that are resilient to failures.
VII. C ONCLUDING R EMARKS ACKNOWLEDGEMENTS
This paper described DeSARM, a completely decentralized This work was partially supported by the AFOSR grant
and automated approach for software architecture discovery FA9550-16-1-0030 and the Office of Research Computing at
of distributed systems based on gossiping and message trac- George Mason University.
ing. Through message tracing, DeSARM is able to identify
important architectural characteristics such as components
and connectors in addition to synchronous and asynchronous
communication patterns. Furthermore, through the use of
gossiping, DeSARM exhibits the properties of scalability,
global consistency among participating nodes, and resiliency
R EFERENCES [18] Kermarrec, Anne-Marie, and Maarten Van Steen. ”Gossiping in dis-
tributed systems.” ACM SIGOPS Operating Systems Review 41.5
[1] Albassam, Emad, et al. “Model-based Recovery Connectors for Self-
(2007): 2-7.
Adaptation and Self-Healing.” Proc. 11th Intl. Joint Conf. Software
[19] Koschke, Rainer. “Architecture reconstruction.” Software Engineering.
Technologies (ICSOFT 2016), July 24-26, 2016, Lisbon, Portugal.
Springer Berlin Heidelberg, 2009. pp. 140–173.
[2] http://orc.gmu.edu/research-computing/argo-cluster/
[20] Koutra, Danai, et al. Algorithms for graph similarity and subgraph
[3] Bojic, Dragan, and Dusan Velasevic. “A use-case driven method of ar-
matching. Technical Report Carnegie-Mellon-University, 2011.
chitecture recovery for program understanding and reuse reengineering.”
[21] Kramer, Jeff, and Jeff Magee. “Self-managed systems: an architectural
IEEE Conf. Software Maintenance and Reengineerig, 2000, pp. 23–32.
challenge.” Future of Software Engineering, 2007. FOSE’07. IEEE,
[4] Corazza, Anna, Sergio Di Martino, and Giuseppe Scanniello. “A proba-
2007.
bilistic based approach towards software system clustering.” IEEE 14th
[22] Menascé, Daniel A., John Ewing, Hassan Gomaa, Sam Malek, and
Conf. Software Maintenance and Reengineering, 2010, pp. 88-96.
Joao P. Sousa, “A framework for utility-based service-oriented design in
[5] Demers, Alan, et al. “Epidemic algorithms for replicated database
SASSY,” First Joint WOSP/SIPEW Intl. Conf. Performance Engineering,
maintenance.” Proc.Sixth Annual ACM Symp. Principles of distributed
San Jose, CA, USA, January 28-30, 2010.
computing. ACM, 1987.
[23] Menascé, Daniel A., Hassan Gomaa, Sam Malek, and Joao P. Sousa.
[6] Dulman, Stefan, and Eric Pauwels. “Self-Stabilized Fast Gossiping
”SASSY: A framework for self-architecting service-oriented systems.”
Algorithms.” ACM Tr. Autonomous and Adaptive Systems (TAAS), 10.4
Software, IEEE 28.6 (2011): pp. 78–85.
(2015): 29.
[24] Naseem, Rashid, Onaiza Maqbool, and Siraj Muhammad. “Improved
[7] Esfahani, Naeem, Eric Yuan, Kyle R, Canavera and Sam Malek. “In-
similarity measures for software clustering.” , 2011 15th IEEE European
ferring Software Component Interaction Dependencies for Adaptation
Conf. Software Maintenance and Reengineering (CSMR), 2011.
Support.””ACM Tr. Autonomous and Adaptive Systems (TAAS), 10.4 [25] Oreizy, Peyman, et al. “An architecture-based approach to self-adaptive
(2016): 26. software.” IEEE Intelligent systems 14.3 (1999): pp. 54–62.
[8] Ewing, John, and Daniel A. Menascé, “A Meta-controller method [26] Porter, Jason, Daniel A. Menascé, and Hassan Gomaa, “Decentralized
for improving run-time self-architecting in SOA systems,” Proc. 5th Software Architecture Discovery in Distributed Systems.” Technical
ACM/SPEC Intl. Conf. Performance Engineering (ICPE 2014), Dublin, Report GMU-CS-TR-2016-2, Dept. of Computer Science, George Ma-
Ireland, March 23-26, 2014. son University, 2016. (http://cs.gmu.edu/ tr-admin/papers/GMU-CS-TR-
[9] Foggia, Pasquale, Carlo Sansone, and Mario Vento. “A performance 2016-2.pdf).
comparison of five algorithms for graph isomorphism.” Proc. 3rd IAPR [27] RASS Project, http://cs.gmu.edu/˜menasce/rass/
TC-15 Workshop on Graph-based Representations in Pattern Recogni- [28] Riva, Claudio, and Jordi Vidal Rodriguez. “Combining static and dy-
tion. 2001. namic views for architecture reconstruction.” Proc. 6th IEEE European
[10] Garlan, David, et al. “Rainbow: Architecture-based self-adaptation with Conf. Software Maintenance and Reengineering, 2002.
reusable infrastructure.” Computer 37.10 (2004): pp. 46–54. [29] Riviere, Etienne, and Spyros Voulgaris. “Gossip-based networking for
[11] Garlan, David, and Bradley Schmerl. “Using architectural models at internet-scale distributed systems.” E-Technologies: Transformation in a
runtime: Research challenges.” European Workshop on Software Archi- Connected World. Springer Berlin Heidelberg, 2011. pp. 253–284.
tecture. Springer Berlin Heidelberg, 2004. [30] Sartipi, Kamran, and Nima Dezhkam. “An Amalgamated Dynamic and
[12] Gomaa, Hassan, and Koji Hashimoto, 2012. “Dynamic self-adaptation Static Architecture Reconstruction Framework to Control Component
for distributed service-oriented transactions,” Proc. 7th Intl. Symp. Interactions 259.” 14th IEEE Working Conf. Reverse Engineering, 2007.
Softw. Eng. for Adaptive and Self-Managing Systems, SEAMS 12. IEEE [31] Schmerl, Bradley, et al. “Discovering architectures from running sys-
Press, Piscataway, NJ, USA, pp. 11-20. tems.” IEEE Tr. Software Engineering, 32.7 (2006): pp. 454–466.
[13] Gomaa, Hassan, Koki Hashimoto, Minseong Kim, Sam Malek, and [32] Taylor, Richard N., Nenad Medvidovic, and Eric M. Dashofy. Software
Daniel A. Menascé, 2010. “Software adaptation patterns for service- architecture: foundations, theory, and practice. Wiley Publishing, 2009.
oriented architectures,” Proc. 2010 ACM Symp. Applied Computing, [33] Vasconcelos, Aline, and Cludia Werner. “Software architecture recovery
New York, NY, USA, pp. 462469. doi:10.1145/1774088.1774185. based on dynamic analysis.” Brazilian Symp. on Softw. Engineering.
[14] https://code.google.com/archive/p/java-gossip/ 2004.
[15] Huang, Gang, Hong Mei, and Fu-Qing Yang. “Runtime recovery and [34] Weyns, Danny, and Tanvir Ahmad. “Claims and evidence for
manipulation of software architecture of component-based systems.” architecture-based self-adaptation: A systematic literature review.” Soft-
Automated Software Engineering, 13.2 (2006): pp. 257–281. ware Architecture. Springer Berlin Heidelberg, 2013. pp. 249–265.
[16] Israr, Tauseef, Murray Woodside, and Greg Franks. “Interaction tree [35] Yuan, Eric, and Sam Malek. “Mining Software Component Interactions
algorithms to extract effective architecture and layered performance to Detect Security Threats at the Architectural Level.” Proc. 13th
models from traces.” J. Systems and Software 80.4 (2007): pp. 474– Working IEEE/IFIP Conf. Software Architecture. 2016.
492. [36] Zager, Laura A., and George C. Verghese. “Graph similarity scoring and
[17] Kephart, Jeffrey O., and David M. Chess. “The vision of autonomic matching.” Applied mathematics letters 21.1 (2008): pp. 86–94.
computing.” IEEE Computer 36.1 (2003): pp. 41–50.