<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Reconstructing Architectural Models of Software Tools by Runtime Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ian Peake</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jan Olaf Blech</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lasith Fernando</string-name>
          <email>lasith.fernandog@rmit.edu.au</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>RMIT University</institution>
          ,
          <addr-line>Melbourne</addr-line>
          ,
          <country country="AU">Australia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present a method and initial results on reverse engineering the architecture of monolithic software systems. Our approach is based on analysis of system binaries resulting in a series of models, which are successively refined into a component structure. Our approach comprises the following steps: 1) instrumentation of existing binaries for dynamically generating execution traces at runtime and connected analysis, 2) static inspection of binaries, 3) interpretation using domain knowledge, and 4) identifying component boundaries using software clustering. We motivate a generic method which covers a large class of software systems, and evaluate our method on concrete software tools for industrial automation systems development, focusing on Intel x86 and Microsoft Windows-compatible applications.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Related Work Published work on architecture reconstruction and related reverse
engineering tasks focussing on derivation of component candidates and inter-dependencies
is covered in existing surveys and overview papers [
        <xref ref-type="bibr" rid="ref15 ref3 ref5">5, 15, 3</xref>
        ]. Two main directions are 1)
based on analysis of source code and 2) based on the analyses or execution of system
binaries. In [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] a taxonomy of reverse engineering techniques by classifying according to
the artefacts used, and whether analysis is static (based on syntactic analysis of source
or executable) or dynamic (based on running, observing and/or animating the system
itself) is presented. We focus on runtime analysis for architecture derivation (also called
dynamic analysis) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. DiscoTect [
        <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
        ] is a framework that observes running systems
to reconstruct their architectures. A key feature of DiscoTect is its flexibility to cope
with a range of high architectural styles and a range of possible realizations in
implementations. DiscoTect uses a language: DiscoSTEP to define mappings interpreting
low level system events as more abstract architectural operations, which are formally
defined as coloured Petri Nets. The authors note that such mappings must be provided
by experts with correct domain knowledge. Reconstructing software architecture from
execution traces requires the analysis of the execution traces and the identification of
potential components. Combining potential component candidates into disjunct sets
denoting suggestions for aggregation of components is known as clustering and is an
important step for gaining suggestions on the original and potential future architectures.
The field of clustering for software components has been studied by several authors
including [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] featuring a proposition, [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] featuring the analysis of source code for
component detection, [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] studying clustering in the context of software evolution. In
this work we are using the Pin tool [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] for binary instrumentation and tracing hints
about architecture. Other well known tools comprise the more heavy weight [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] tool
which does not have native Windows support, but offers a wider range of
instrumentation possibilities potentially resulting in slower code.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Our Approach</title>
      <p>Our method for collecting runtime based architecture information has these steps:1)
We instrument an existing tool such that dynamically loaded libraries and control flow
events are tracked and collated as execution traces. These traces contain information
(e.g. available methods) for dynamically loaded libraries, as well as the order of method
calls. Instrumentation is done on a binary level. 2) The instrumented tool is run and
execution traces are generated. A user interacts with the tool (e.g. editing, simulating,
compiling). All dynamically loaded libraries and method calls (traced by memory address)
and time of invocation are traced. The generated execution traces are further processed
and abstracted. This involves the resolution of traced memory addresses to primitives
such as methods, objects or executables. Calls between methods denote a graph. Here,
each primitive corresponds to a node and the number of distinct caller/callee
combinations in the execution trace is annoted as a weight on a directed edge. We cluster
primitives into candidate components using the LIMBO algorithm (see below). This
gives first candidates for a component architecture. Final clustering is based on
interactions between methods, existing dll structure, analysis of names and knowledge about
reference architectures. 3) The generated data is interpreted by using information from
binaries and the domain, to derive information about the underlying architecture of the
tool. Manual binary inspection and domain knowledge are used to complete
reconstruction.Several tasks are carried out for the runtime-based analysis part of our method:
Usage Scenarios for Runtime Based Evaluation We evaluate our tools with the help of
usage scenarios. These are sequences of user interactions with tools. The component
interactions are then extracted from the generated execution trace in order to gain hints
on architectural details. The idea is to invoke the distinct components of a tool by user
interaction. For example a user may trigger a compilation at a certain time and the
execution trace may show the loading of distinct libraries and the invocation of the
desired methods. We can also compare interaction sequences in order to see if different
tools have a similar way of interacting e.g. with a compilation component.
Evaluating Execution Traces, Clustering and Component Candidate Identification We
aim to generate a graphical view showing a few high-level components and their
interactions. In Windows and similar environments components are most often associated with
distinct executables and libraries (or possibly packages), and their inter-dependencies
(associated with dynamic linking, import or transfer of control between their
respective methods). However a high-level view at such a level is often inappropriate because
the view has either too many or too few executables. We therefore tried to identify
high level component candidates by clustering groups of other programming
languagespecific notions such as classes/object or method. We will use the term primitives to
refer to low-level component categories selected for clustering.</p>
      <p>Software clustering is a long-standing and commonly used method for imposing
abstract, high level structure on an over-detailed view of primitives and their
relationships. For software, a set of low-level components is typically clustered on the basis of
properties such as which other components they call, authorship, or location in source
directories. As shown, clustering may be thought of as partitioning a collection of
objects based on the similarity of their properties. Typically clustering is based on static
analysis, here, we are using clustering based on the dynamic call structure between
primitives observed at runtime.</p>
      <p>
        Figures 1 (a) and (b) depict the clustering for a usage scenario in the open source
tool PLCEdit [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Both take the dynamic call structure between primitives into account
and are generated from the same execution trace file. Primitives of each cluster are
listed in nodes (boxes). The number of calls between primitives are provided as labels
on the edges. The main call direction is given first. Calls in the opposite direction are in
parentheses. The figures exemplify that based on the same execution trace files there are
different possible ways to depict abstract system structure. Arguably, good component
structures are selected based on domain specific knowledge.
      </p>
      <p>
        We use the LIMBO clustering algorithm [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. LIMBO is based on a generic method
called Agglomerative Information Bottleneck (AIB). It has been used for the analysis
of large systems across scientific disciplines. LIMBO and the underlying AIB method
are generic in the sense that they operate fundamentally on a set of objects O, a set
of attributes A and relation R O A with non-negative real number weighting
w : O A ! R+ [ ?. In our approach we represent primitives as follows. Each
primitive is modelled both by an object in O and an attribute in A. The weighting w
reflects the number of different ways an object o in O calls a different object o0 in A.
R and w are constructed from the execution traces in an application-dependent way.
LIMBO uses a generic information-theoretic approach as its basis for clustering. First,
weights are modified via a suitable weighting transformation such as TF.IDF, which
transforms weights according their significance (the more rarely held an attribute A is
overall by all objects, and the more frequently by some given object O, the more
sig: NVT QBtColorDemo, : QBtConfigOthers,
: QBtConfigTextViewer, : NVT Editor,
: NVT QBtConfigDialog, : QBtConfigDif Proces ,
: NVT QBiConfigWidget
      </p>
      <p>16 (0)
: QBtSyntax, : QBtBrowser,
: QBtSeparator, : NVT QBtBrowser,
: QBtSet ings, : QBtWorkspace,
: QBtConfig, : QBtOperator,
: QBtNumeration, : QVector&lt;QBtRange&gt;,
2 (0) : QBtShared, : QVector&lt;QBtMarkerInfo&gt;,
: QBtEvent, : NVT QBtIndicator,
: Dif Dialog, : QVector&lt;QBtDif Info&gt;,
: QBtEventsControl er, : QBtDif Proces ,
: NVT QBtWorkspace, : BtStringTo l,
: QBtDif Info, : QBtIndicator,
: NVT QBtOperator
2 (0)
10 (2)
40 (12)
326 (0)
20 (0)
6 (4)</p>
      <p>1 (0)
: QBtSeparator, : NVT QBtBrowser,
: QBtSet ings, : QBtWorkspace,
: QBtConfig, : QBtOperator,
: QBtNumeration, : QVector&lt;QBtRange&gt;,
: QBtShared, : QVector&lt;QBtMarkerInfo&gt;,
: QBtEvent, : NVT QBtIndicator,
: Dif Dialog, : QVector&lt;QBtDif Info&gt;,
: QBtEventsControl er, : QBtDif Proces ,
: NVT QBtWorkspace, : BtStringTool,
: QBtDif Info, : QBtIndicator,
: NVT QBtOperator
20 (7)
: NVT FindDialog, : Ui_HelpWidget,
: NVT BatchDialog, : Ui_FncDialog,
: NVT UpdateInfo, : Ui_BatchDialog,
: Ui_Ses ionDialog, : Ui_FindReplaceDialog,
: Ui_FBDialog, : NVT HelpWidget,
: NVT FncDialog, : Ui_POUInfoDialog,
: NVT FBDialog, : NVT POUInfoDialog,
: NVT Ses ionManager, : Ui_UpdateDialog
3 (0)
attributes by interpreting the primitive call
about
graph
as
directed
or undirected—That
is, whether two
primitives
which call each
other have
the same value in
both
directions (sum
of the number
of ways they can call each
other)
or possibly-distinct values. In
our work
the call
graph is
interpreted as undirected.</p>
      <p>objects
closest</p>
      <sec id="sec-2-1">
        <title>Additional</title>
      </sec>
      <sec id="sec-2-2">
        <title>Static</title>
        <p>and Domain</p>
      </sec>
      <sec id="sec-2-3">
        <title>Specific Information</title>
        <p>Binaries
like .dll
files
can
encapsulate
multiple components and
provide
hints on development
history. Names and size
of components
can indicate
usage. Binaries
can contain
method names
and
plain
that
hint on component
functionality. A major source
of knowledge in
our
reverse
text
engineering method is
the PLC development domain. For example
we know what
types
of components
to
expect.</p>
        <p>We started with
the following expected components: Source
and
target
code
storage
manage
the
modeling,
storage and exchange
of
source and
target specification models
and code by using a
file
system
or a database. Compilers,</p>
      </sec>
      <sec id="sec-2-4">
        <title>Analyzers and Simulators parse</title>
        <p>specification models and perform operations on them,
like generating
target
code,
interacting
with
a</p>
        <p>GUI component
behaviour
or
properties.</p>
        <p>Editors
manage
the editing
of
models
in
by
order
to</p>
        <p>visualize
the user.</p>
      </sec>
      <sec id="sec-2-5">
        <title>License</title>
        <p>Management and
other
miscellaneous functionality
can be realized
as a separate
component
e.g. that may interact with
a third
party
license
server. The GUI provides a user
interface. It does not have to be realized as a separate component inside the tool, since
existing GUI frameworks can be used.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Analysis and Evaluation</title>
      <p>
        Instrumentation of binaries is done by using the Intel Pin tool [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. We instrument the
binaries of our analyzed tools to extract: (i) A list of the loaded binaries and the names of
the methods (called routines) inside these binaries, if available, including their memory
addresses. (ii) A list of control flow operations that occurred during the execution of the
tool, and in particular the source and destination addresses.
      </p>
      <p>
        Case Study Tools We have used our method for analysing the architecture of a mix of
proprietary and open source tools. Tools are designed for performing at least some of
the following operations for the development of PLC software: 1) Editing PLC
specification models, 2) Saving and loading of PLC specification models, 3) Analysing and
compiling PLC specification models, 4) Simulating PLC specification models. We
initially expected that this functionality is provided by distinct software components as
described in the previous section. Open source systems considered included PLCEdit [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ],
Beremiz [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and MATIEC [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Of these, PLCEdit and MATIEC (also a Beremiz
component) were immediately suitable, consisting of plain binary executables. Beremiz is
written in Python and thus the Pin-based method is not immediately suitable.
Example Runs An Example usage scenario (Section 2) consists of the steps: Start tool;
Create new project file; Add ladder diagram; Invoke editor; Add coil (lamp) and contact
(switch) to ladder diagram, add connections; Save project; Compile and check project;
Close project; Start simulation of saved project; Close tool.
      </p>
      <p>Evaluation and Improvements The method was applied to different PLC development
tools. Execution of the usage scenarios was done manually, while the processing of the
execution traces was done automatically to generate models – one single model for each
usage scenario and number of desired components – comprising component candidates
and their interactions. Clustering based on dynamically linked libraries and
executables did not always provide the right granularity, since several major components are
typically encapsulated in a main executable. Determining the begin and end of entities
like methods or classes in the binaries as a basis for clustering was sometimes
possible. In some cases e.g. due to the use of different programming languages, additional
information on the location of entities for the basis of clustering was provided by the
tool developers and used by us. For example for some applications while symbol table
information is not available in the executable, a “.map” file provides similar
information for debugging purposes. There are a number of possible reasons why a clustering
may not reflect a system’s true architecture, for example there may be insufficient data
in the run time call graph, or architectural anti-patterns may be present. It may be
desirable to associate each component with a meaningful name or feature. As discussed
in clustering literature, this depends on understanding what abstractions (e.g. aspects)
are semantically common to all objects of a component, or the principle abstraction of
the component, which can be difficult. Currently all attributes are structural, derived
from the Pin call graph, however the graph is created by exercising tools using just a
few use case scenarios. There is scope to assign so-called non structural attributes, that
is, properties other than call relationships on which clustering could be based, possibly
based on manual assessment and with input from domain experts. These could pertain
to specific features or aspects such as GUI or safety. For example if several objects are
clearly GUI components, an additional GUI attribute could be assigned to those objects
and taken into account during clustering. There are existing aspect mining approaches
in the literature which may be applicable or adaptable to this purpose.</p>
      <p>The LIMBO algorithm was implemented by us in few hundred lines of Python. This
is supported by additional scripts which process output from our pin plugin.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Beremiz</surname>
            <given-names>IDE</given-names>
          </string-name>
          .
          <source>Version 1.1 RC3</source>
          .
          <article-title>Downloaded from beremiz</article-title>
          .
          <source>org (July</source>
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Periklis</given-names>
            <surname>Andritsos</surname>
          </string-name>
          and
          <string-name>
            <given-names>Vassilios</given-names>
            <surname>Tzerpos</surname>
          </string-name>
          .
          <article-title>Information-Theoretic Software Clustering</article-title>
          .
          <source>In IEEE Trans. on Software Eng.</source>
          ,
          <volume>31</volume>
          (
          <issue>2</issue>
          ):
          <fpage>150</fpage>
          -
          <lpage>165</lpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Gerardo</given-names>
            <surname>Canfora</surname>
          </string-name>
          , Massimiliano Di Penta, Luigi Cerulo:
          <article-title>Achievements and challenges in software reverse engineering</article-title>
          .
          <source>Commun. ACM</source>
          <volume>54</volume>
          (
          <issue>4</issue>
          ):
          <fpage>142</fpage>
          -
          <lpage>151</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Chi-Keung</surname>
            <given-names>Luk</given-names>
          </string-name>
          , Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi,
          <string-name>
            <given-names>Kim</given-names>
            <surname>Hazelwood</surname>
          </string-name>
          .
          <article-title>Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. Programming Language Design and Implementation (PLDI</article-title>
          ), Chicago, IL (
          <year>June 2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Elliot</surname>
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chikofsky</surname>
          </string-name>
          ,
          <string-name>
            <surname>James H. Cross</surname>
            <given-names>II</given-names>
          </string-name>
          :
          <article-title>Reverse Engineering and Design Recovery:A Taxonomy</article-title>
          .
          <source>IEEE Software</source>
          <volume>7</volume>
          (
          <issue>1</issue>
          ):
          <fpage>13</fpage>
          -
          <lpage>17</lpage>
          (
          <year>1990</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. CoDeSys - industrial IEC 61131
          <article-title>-3 PLC programming: www</article-title>
          .codesys.com
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Bas</given-names>
            <surname>Cornelissen</surname>
          </string-name>
          , Andy Zaidman, Arie van Deursen,
          <string-name>
            <surname>Leon Moonen</surname>
          </string-name>
          , Rainer Koschke:
          <article-title>A Systematic Survey of Program Comprehension through Dynamic Analysis</article-title>
          .
          <source>IEEE Trans. Software Eng</source>
          .
          <volume>35</volume>
          (
          <issue>5</issue>
          ):
          <fpage>684</fpage>
          -
          <lpage>702</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <source>4DIAC IDE. Version 1</source>
          .3: fordiac.
          <source>org (Accessed July</source>
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Rainer</given-names>
            <surname>Koschke</surname>
          </string-name>
          .
          <article-title>Atomic architectural component recovery for program understanding and evolution</article-title>
          .
          <source>Software Maintenance</source>
          (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Chung-Horng Lung</surname>
          </string-name>
          .
          <source>Software Architecture Recovery and Restructuring through Clustering Techniques. 3rd International Software Architecture Workshop</source>
          (ISAW):
          <fpage>101</fpage>
          -
          <lpage>104</lpage>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <article-title>MATIEC compiler. Source from bitbucket</article-title>
          .org/mjsousa/matiec (July
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. Brian S. Mitchell, Spiros Mancoridis.
          <article-title>Comparing the decompositions produced by software clustering algorithms using similarity measurements</article-title>
          .
          <source>Software Maintenance</source>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. Nicholas Nethercote and
          <string-name>
            <given-names>Julian</given-names>
            <surname>Seward</surname>
          </string-name>
          .
          <article-title>Valgrind: a framework for heavyweight dynamic binary instrumentation</article-title>
          .
          <source>ACM SIGPLAN conference on Programming language design and implementation</source>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. PLCEdit Editor.
          <source>Version 2.1.1. Downloaded from www.plcedit.org (July</source>
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Damien</surname>
            <given-names>Pollet</given-names>
          </string-name>
          , Ste´phane Ducasse, Lo¨ıc Poyet, Ilham Alloui,
          <string-name>
            <surname>Sorana</surname>
            <given-names>Cˆımpan</given-names>
          </string-name>
          , He´rve Verjus:
          <article-title>Towards A Process-Oriented Software Architecture Reconstruction Taxonomy</article-title>
          .
          <source>Conference on Software Maintenance and Reengineering:</source>
          <fpage>137</fpage>
          -
          <lpage>148</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Hong</surname>
            <given-names>Yan</given-names>
          </string-name>
          , David Garlan,
          <string-name>
            <surname>Bradley R. Schmerl</surname>
          </string-name>
          , Jonathan Aldrich, Rick Kazman.
          <article-title>DiscoTect: A System for Discovering Architectures from Running Systems</article-title>
          . International Conference on Software Engineering:
          <fpage>470</fpage>
          -
          <lpage>479</lpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Bradley</surname>
            <given-names>Schmerl</given-names>
          </string-name>
          , Jonathan Aldrich, David Garlan,
          <string-name>
            <given-names>Rick</given-names>
            <surname>Kazman</surname>
          </string-name>
          and
          <string-name>
            <given-names>Hong</given-names>
            <surname>Yan</surname>
          </string-name>
          .
          <article-title>Discovering Architectures from Running Systems</article-title>
          .
          <source>IEEE Trans. Software Eng</source>
          .
          <volume>32</volume>
          (
          <issue>7</issue>
          ):
          <fpage>454</fpage>
          -
          <lpage>466</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>