Towards Reconstructing Architectural Models of Software Tools by Runtime Analysis Ian Peake, Jan Olaf Blech, Lasith Fernando RMIT University, Melbourne, Australia {ian.peake, janolaf.blech,lasith.fernando}@rmit.edu.au Abstract. We present a method and initial results on reverse engineering the ar- chitecture of monolithic software systems. Our approach is based on analysis of system binaries resulting in a series of models, which are successively refined into a component structure. Our approach comprises the following steps: 1) in- strumentation of existing binaries for dynamically generating execution traces at runtime and connected analysis, 2) static inspection of binaries, 3) interpre- tation using domain knowledge, and 4) identifying component boundaries using software clustering. We motivate a generic method which covers a large class of software systems, and evaluate our method on concrete software tools for in- dustrial automation systems development, focusing on Intel x86 and Microsoft Windows-compatible applications. 1 Introduction We present an architectural reverse engineering approach. Instead of solely analysing binaries statically, we perform analysis at runtime thereby taking into account runtime dependencies between entities. This detailed dependency information denotes an ab- stract system model. Clustering is used to identify candidates for a component-based software architecture view suitable for human understanding. Additional information from binary inspection and domain specific knowledge is used to select an architectural system model. Furthermore, in the experimental part of this paper, we apply our method to tools for distributed industrial automation programmable logic control (PLC) spec- ification and configuration. Such tools typically support specifications compliant with the IEC 61131–3 or IEC 61499 standards (e.g. CoDeSys [6] and 4DIAC [8], respec- tively). Restriction to a particular domain gives us information for calibrating our anal- ysis. In this paper our novel contributions are as follows: 1) A suggested architecture reverse engineering method based on runtime instrumentation, automated clustering, hand-inspection of binaries, and domain knowledge. 2) Tailoring this method for Intel x86, in particular Microsoft Windows. 3) A case-study applying our method to PLC specification and configuration tools. Related Work Published work on architecture reconstruction and related reverse engi- neering tasks focussing on derivation of component candidates and inter-dependencies is covered in existing surveys and overview papers [5, 15, 3]. Two main directions are 1) based on analysis of source code and 2) based on the analyses or execution of system bi- naries. In [3] a taxonomy of reverse engineering techniques by classifying according to the artefacts used, and whether analysis is static (based on syntactic analysis of source or executable) or dynamic (based on running, observing and/or animating the system itself) is presented. We focus on runtime analysis for architecture derivation (also called dynamic analysis) [7]. DiscoTect [16, 17] is a framework that observes running systems to reconstruct their architectures. A key feature of DiscoTect is its flexibility to cope with a range of high architectural styles and a range of possible realizations in im- plementations. DiscoTect uses a language: DiscoSTEP to define mappings interpreting low level system events as more abstract architectural operations, which are formally defined as coloured Petri Nets. The authors note that such mappings must be provided by experts with correct domain knowledge. Reconstructing software architecture from execution traces requires the analysis of the execution traces and the identification of potential components. Combining potential component candidates into disjunct sets de- noting suggestions for aggregation of components is known as clustering and is an im- portant step for gaining suggestions on the original and potential future architectures. The field of clustering for software components has been studied by several authors including [10] featuring a proposition, [12] featuring the analysis of source code for component detection, [9] studying clustering in the context of software evolution. In this work we are using the Pin tool [4] for binary instrumentation and tracing hints about architecture. Other well known tools comprise the more heavy weight [13] tool which does not have native Windows support, but offers a wider range of instrumenta- tion possibilities potentially resulting in slower code. 2 Our Approach Our method for collecting runtime based architecture information has these steps:1) We instrument an existing tool such that dynamically loaded libraries and control flow events are tracked and collated as execution traces. These traces contain information (e.g. available methods) for dynamically loaded libraries, as well as the order of method calls. Instrumentation is done on a binary level. 2) The instrumented tool is run and exe- cution traces are generated. A user interacts with the tool (e.g. editing, simulating, com- piling). All dynamically loaded libraries and method calls (traced by memory address) and time of invocation are traced. The generated execution traces are further processed and abstracted. This involves the resolution of traced memory addresses to primitives such as methods, objects or executables. Calls between methods denote a graph. Here, each primitive corresponds to a node and the number of distinct caller/callee combi- nations in the execution trace is annoted as a weight on a directed edge. We cluster primitives into candidate components using the LIMBO algorithm (see below). This gives first candidates for a component architecture. Final clustering is based on interac- tions between methods, existing dll structure, analysis of names and knowledge about reference architectures. 3) The generated data is interpreted by using information from binaries and the domain, to derive information about the underlying architecture of the tool. Manual binary inspection and domain knowledge are used to complete reconstruc- tion.Several tasks are carried out for the runtime-based analysis part of our method: Usage Scenarios for Runtime Based Evaluation We evaluate our tools with the help of usage scenarios. These are sequences of user interactions with tools. The component interactions are then extracted from the generated execution trace in order to gain hints on architectural details. The idea is to invoke the distinct components of a tool by user interaction. For example a user may trigger a compilation at a certain time and the execution trace may show the loading of distinct libraries and the invocation of the desired methods. We can also compare interaction sequences in order to see if different tools have a similar way of interacting e.g. with a compilation component. Evaluating Execution Traces, Clustering and Component Candidate Identification We aim to generate a graphical view showing a few high-level components and their interac- tions. In Windows and similar environments components are most often associated with distinct executables and libraries (or possibly packages), and their inter-dependencies (associated with dynamic linking, import or transfer of control between their respec- tive methods). However a high-level view at such a level is often inappropriate because the view has either too many or too few executables. We therefore tried to identify high level component candidates by clustering groups of other programming language- specific notions such as classes/object or method. We will use the term primitives to refer to low-level component categories selected for clustering. Software clustering is a long-standing and commonly used method for imposing abstract, high level structure on an over-detailed view of primitives and their relation- ships. For software, a set of low-level components is typically clustered on the basis of properties such as which other components they call, authorship, or location in source directories. As shown, clustering may be thought of as partitioning a collection of ob- jects based on the similarity of their properties. Typically clustering is based on static analysis, here, we are using clustering based on the dynamic call structure between primitives observed at runtime. Figures 1 (a) and (b) depict the clustering for a usage scenario in the open source tool PLCEdit [14]. Both take the dynamic call structure between primitives into account and are generated from the same execution trace file. Primitives of each cluster are listed in nodes (boxes). The number of calls between primitives are provided as labels on the edges. The main call direction is given first. Calls in the opposite direction are in parentheses. The figures exemplify that based on the same execution trace files there are different possible ways to depict abstract system structure. Arguably, good component structures are selected based on domain specific knowledge. We use the LIMBO clustering algorithm [2]. LIMBO is based on a generic method called Agglomerative Information Bottleneck (AIB). It has been used for the analysis of large systems across scientific disciplines. LIMBO and the underlying AIB method are generic in the sense that they operate fundamentally on a set of objects O, a set of attributes A and relation R ⊆ O × A with non-negative real number weighting w : O × A → R+ ∪ ⊥. In our approach we represent primitives as follows. Each primitive is modelled both by an object in O and an attribute in A. The weighting w reflects the number of different ways an object o in O calls a different object o0 in A. R and w are constructed from the execution traces in an application-dependent way. LIMBO uses a generic information-theoretic approach as its basis for clustering. First, weights are modified via a suitable weighting transformation such as TF.IDF, which transforms weights according their significance (the more rarely held an attribute A is overall by all objects, and the more frequently by some given object O, the more sig- ::QBtSyntax, ::QBtBrowser, ::QBtSeparator, ::NVT QBtBrowser, ::NVT QBtColorDemo, ::QBtConfigOthers, ::QBtSettings, ::QBtWorkspace, ::NVT FindDialog, ::Ui_HelpWidget, ::QBtConfigTextViewer, ::NVT Editor, ::QBtConfig, ::QBtOperator, ::NVT BatchDialog, ::Ui_FncDialog, ::NVT QBtConfigDialog, ::QBtConfigDiffProcess, ::QBtNumeration, ::QVector, ::NVT UpdateInfo, ::Ui_BatchDialog, ::NVT QBiConfigWidget ::QBtShared, ::QVector, ::Ui_SessionDialog, ::Ui_FindReplaceDialog, ::QBtEvent, ::NVT QBtIndicator, ::Ui_FBDialog, ::NVT HelpWidget, 16 (0) ::DiffDialog, ::QVector, ::NVT FncDialog, ::Ui_POUInfoDialog, ::QBtEventsController, ::QBtDiffProcess, ::NVT FBDialog, ::NVT POUInfoDialog, ::QBtSyntax, ::QBtBrowser, ::QBtSeparator, ::NVT QBtBrowser, ::NVT QBtWorkspace, ::BtStringTool, ::NVT SessionManager, ::Ui_UpdateDialog ::QBtSettings, ::QBtWorkspace, ::QBtDiffInfo, ::QBtIndicator, ::QBtConfig, ::QBtOperator, ::NVT QBtOperator ::QBtNumeration, ::QVector, 2 (0) ::QBtShared, ::QVector, 40 (12) 3 (0) 16 (16) ::QBtEvent, ::NVT QBtIndicator, ::DiffDialog, ::QVector, ::QBtEventsController, ::QBtDiffProcess, ::HelpWidget, ::PrintPrepare, ::NVT QBtWorkspace, ::BtStringTool, ::QList, ::Prototype, ::QBtDiffInfo, ::QBtIndicator, ::TabWidget, ::ImportExport, ::NVT QBtOperator ::FindDialog, ::PrefDialog, ::SessionManager, ::POUInfoDialog, 2 (0) 10 (2) ::PageCtrl, ::Ui_PrefDialog, ::MainWindow, ::BatchDialog, ::FBCallConverter, ::Editor* QObject, ::FncDialog, ::FileViewDialog, ::PrefDialog, ::PrintPrepare, 20 (7) ::NVT FileViewDialog, ::QList, ::BatchDialog, ::QList, ::NVT FileViewDialog, ::Ui_FileViewDialog, ::NVT MainWindow, ::Highlighter_Dec, ::Prototype, ::POUInfoDialog, ::FBCallConverter, ::Editor* QObject, 6 (4) ::NVT PageCtrl, ::TabWidget, 1 (0) ::NVT PageCtrl, ::AboutDialog, ::ImportExport, ::Editor, ::Ui_FileViewDialog, ::Editor, ::PageCtrl, ::Highlighter_Ins, ::UpdateInfo, ::main, ::MainWindow, ::Highlighter_Dec, ::FBDialog, ::NVT PrefDialog, ::PageData ::Highlighter_Ins, ::QtLocalPeer, ::PageData ::QBtConfigDialog, ::NVT QBtConfigOthers, ::qCleanupResources_PLCEdit__dest_class__, ::QBiConfigWidget, 10 (2) 24 (7) 12 (2) 60 (34) ::NVT DiffDialog, ::QBtColorDemo, ::QMutexLocker, ::NVT QBtConfigTextViewer, ::NVT QBtConfigDiffProcess ::NVT Editor, ::QBtLineData, ::QBtConfigOthers, ::global constructors keyed to QBtOperator, ::global constructors keyed to qInitResources_PLCEdit(), ::QBtConfigDiffProcess, ::HelpWidget, ::AboutDialog, ::NVT QBtNumeration, ::_start, ::SessionManager, ::FileViewDialog, ::NVT QBtNumeration, ::QVector, ::NVT QBiConfigWidget, ::global constructors keyed to qInitResources_PLCEdit(), ::QBtRange, ::FBDialog, ::NVT PrefDialog, ::QVector, ::QVector, ::NVT MainWindow, ::QBtConfigTextViewer, ::NVT QBtSeparator, ::Ui_PrefDialog, ::main ::PathFileNameDialog, ::NVT QBtConfigDialog, ::QtLP_Private, ::QtSingleApplication, 8 (8) 4 (4) 2 (2) 326 (0) ::QBtRange, ::NVT QBtColorDemo, ::QBtSaveQuestion, ::__libc_csu_init, ::Ui_BatchDialog, ::NVT BatchDialog, ::qCleanupResources_PLCEdit(), ::NVT AboutDialog, ::QVector