=Paper=
{{Paper
|id=Vol-2361/short7
|storemode=property
|title=A Generic
Framework for the Analysis of Heterogeneous Legacy
Software
Systems
|pdfUrl=https://ceur-ws.org/Vol-2361/short7.pdf
|volume=Vol-2361
|authors=Amir
M. Saeidi,Jurriaan
Hage,Ravi
Khadka,Slinger
Jansen
|dblpUrl=https://dblp.org/rec/conf/benevol/SaeidiHKJ18
}}
==A Generic
Framework for the Analysis of Heterogeneous Legacy
Software
Systems==
A Generic Framework for the Analysis of
Heterogeneous Legacy Software Systems
Amir M. Saeidi, Jurriaan Hage, Ravi Khadka, Slinger Jansen
Department of Information and Computing Sciences, Utrecht University, The Netherlands
{a.m.saeidi, j.hage, r.khadka, slinger.jansen}@uu.nl
Abstract—The reverse engineering of legacy systems is a language to be processed (see Section II). This raises two
process that involves analysis and understanding of the given questions that need to be addressed: 1) Is it possible to capture
systems. Some people believe in-depth knowledge of the system the semantics upfront for all dialects and implementations
is a prerequisite for its analysis, whereas others, ourselves
included, argue that only specific knowledge is required on a of the same programming language? 2) How much semantic
per-project basis. To give support for the latter approach, we information is ‘necessary’ to establish a sound foundation for
propose a generic framework that employs the techniques of conducting a particular program analysis?
non-determinism and abstraction to enable us to build tooling for For a language like COBOL which comes in various di-
analyzing large systems. As part of the framework, we introduce
an extensible imperative procedural language called K ERNEL
alects, each of which may have different compiler products,
which can be used for constructing an abstract representation establishing such semantic knowledge is impractical. In short,
of the control flow and data flow of the system. To illustrate its no single semantics exist! On the other hand, the semantic
use, we show how such framework can be instantiated to build knowledge required strongly depends on the analysis one
a use-def graph for a large industrial legacy COBOL and JCL wants to perform. For example, a type-based program analysis
system. We have implemented our framework in a model-driven
fashion to facilitate development of relevant tools. The resulting
needs to decorate the data definitions with the appropriate
G E L AT O tool set can be used within the Eclipse environment. types, whereas a control-based analysis needs to know about
control dependencies. Moreover, when dealing with large
I. I NTRODUCTION systems, abstraction is not a choice but a necessity. The
Many companies operate systems which are developed over analysis techniques need to be precise and scale at the same
a period of many decades. These legacy systems are subject time.
to continuous adaptation and evolution to deal with changing Lämmel and Verhoef [2] propose a technique in which
internal and external factors. Many of these systems do not syntactic tools are constructed and later augmented with
meet the requirements of a maintainable system, mainly due semantic knowledge on a per-project basis (demand-driven
to lack of documentation and programming structure. Reverse semantics). We build on this approach by introducing a generic
engineering can be employed to create a high level abstraction framework that employs 1) nondeterminism to compute a
of the system and to identify its logical components [1]. sound abstraction of the control-flow of the program, and 2)
There are many challenges that one needs to deal with abstraction by computing a particular program analysis with
when reverse engineering a large legacy system. First of all, respect to enough amount of semantic information required.
finding a program understanding tool which can deal with the To realize the above features, the framework consists of an
system of interest is almost impossible. On the other hand, extensible intermediate language that helps achieve separation
implementing a high-quality tool from scratch that can handle between abstraction of the problem and data flow analysis.
the system is a tedious and time-consuming task. Furthermore, This separation provides the context for an incremental ap-
the old programming languages used to develop the legacy proach to analyzing large software systems.
systems tend to suffer from a lack of “singularity”[2] and The paper makes the following contributions:
“elegance”[3], as viewed from the perspective of modern
programming languages. We have investigated the use of 1) It presents a generic framework for performing program
automatic analysis techniques to provide tool support and help analysis on legacy systems that can be instantiated in a
with understanding programs written in these languages. system-specific fashion.
Program analysis is an automatic analysis technique that can 2) It employs techniques from MDE to facilitate analysis
be used as part of reverse engineering [4]. Any deep program of legacy systems and construct the required reverse
analysis starts with a syntactic analyzer parsing syntactic engineering tools.
units into what is known as an abstract syntax tree. The This paper is structured as follows. In Section II we outline
tree produced must be annotated with the necessary seman- the challenges we have faced in dealing with our industrial
tic knowledge by means of a semantic analysis. Although legacy system, and describe the generic framework to over-
syntactic analysis depends on the grammar of the language come the stated problems. We proceed by giving an empirical
for which analysis needs to be performed, we argue that evaluation of our framework in Section III. Finally, in Section
semantic analysis should be performed independent of the IV we conclude and outline future work.
II. A G ENERIC F RAMEWORK the COBOL language, and naturally paves the way for
heterogeneous systems comprising of COBOL and JCL.
We were involved in a legacy to SOA migration project 2) Abstraction and Nondeterminism: Semantic analysis
in a large banking institution in the Netherlands, comprising needs to be performed in a context-specific manner.
of five distinct legacy systems. Like many business-critical We borrow concepts from programming language theory
systems, their systems are implemented in COBOL which runs including non-determinism and abstraction to create an
on platforms such as IBM z/OS and HP Tandem nonstop. We environment through which semantic knowledge can
have proposed a method [5] for migrating legacy systems to be added to the system of interest. Non-determinism
SOA which involves identifying candidate services followed guarantees the soundness of the analysis by exploring
by concept slicing to extract relevant pieces of code. To all the possible variations at the cost of performance,
evaluate our methodology, we have been given access to one whereas, abstraction ensures that only a minimal amount
of their legacy systems, which from now on we will refer of information is stored to perform a sound analysis.
to as InterestCalculation. As it is the case with most legacy 3) Incrementality: Incrementality is key in building analysis
systems, the documentation of the InterestCalculation system tools that scale to large systems. Separation of problem
is outdated and many of the people who were involved in specification (abstraction) and data flow analysis is the
its development are not around anymore. We want to apply way forward for incremental analysis. In this approach,
techniques from the field of program analysis to help with the framework can be re-instantiated with the new infor-
both identification of services and slicing. mation obtained from the result of an analysis to perform
There are three important issues that need to be addressed more fine-grained analyses.
when performing program analysis on legacy systems. First To realize the above properties, the framework consists of
of all, many legacy systems are heterogeneous and constitute an extensible intermediate language called K ERNEL. K ER -
multi-language applications. For instance, the systems imple- NEL employs non-determinism to capture semantics variation
mented for IBM mainframe usually employ JCL job units points at the control flow level. Furthermore, it provides exten-
to describe different task routines that need to be performed sion points to extend the language to incorporate abstractions
within the legacy environment. Furthermore, COBOL has required to compute a particular data flow analysis.
several extensions to provide support for embedded languages Figure 1 depicts the step-by-step approach to instantiating
such as SQL and CICS. These are used to perform queries the framework for performing a particular data-flow analysis.
on tables and process customer transactions, respectively. This The first step involves syntactic abstraction (parsing) of the
also holds for our InterestCalculation system, which comprises source program into an AST. In the next step, an abstract
of COBOL and copybooks as well as JCL jobs, the former of (static) semantics is created based on the concrete or abstract
which contains embedded SQL statements. programming languages that the program has to conform to,
Second, programming languages used for legacy systems irrespective of whether those are different dialects/implementa-
do not follow an explicitly defined language standard. In tions of the same programming language or different languages
languages like COBOL and C, the semantics of many op- that the program is written in.
erations are left open and the implementation must choose Depending on the data flow problem we are interested
how to implement these operations. Furthermore, instances in, the deployed abstraction techniques ensure that enough
of a given programming language may be home-brewed. It information is stored to perform a sound analysis with respect
is estimated that there are about 300 COBOL dialects, each to that problem. For instance, reaching definition analysis
of which has its own compiler products with many patch used to build the use-def graph is expressed as an abstract
levels [2]. Consequently, the only possible way to deal with interpretation of the program which for each expression in the
inconsistencies is to rely on the compiler used to compile the program infers whether a variable is definitely used or defined
system that is subject to analysis. after the possible execution of the statement. To help with the
Finally, legacy systems contain code bases that run well formulation of this abstraction, extension points are provided
into millions of lines of code, hence scalability of any pro- in the kernel language to instantiate the abstract domain for a
gram analysis technique is essential. The InterestCalculation set of analysis problems.
system consists of almost half a MLoC of COBOL source Based on the extracted abstract semantics, a mapping is cre-
files and copybooks. Developing analysis techniques that are ated from the syntactic structure of the source program into an
simultaneously precise and scalable is not a simple task. instance of the K ERNEL language. The use of non-determinism
makes it possible to encode inconsistencies amongst different
To overcome the aforementioned problems, we introduce a
implementations, as well as points where the particular se-
generic framework for analyzing large legacy systems which
mantics cannot be derived, e.g. when no knowledge of the
has the following three features:
compiler is available.
1) Language/Dialect-Independence: We strongly believe In the next step, we specify the data flow analysis problem
standardization through conversion to a well-understood as a monotone framework instance [4] and solve the instance
syntactic structure with semantic variation points is the using an iterative work-list algorithm. The monotone frame-
key for analyzing different dialects and versions of work consists of a set of monotonic transfer functions which
Fig. 1. The Generic Framework for Analyzing Legacy Systems
express the effect of statements on desired properties of the the entry point to the COBOL program. Listing 3 depicts
program with respect to a flow analysis problem. Many flow the resulting K ERNEL program from translation of COBOL
analysis problems such as reaching definition analysis (RDA) program and JCL unit. Data flow analysis is then performed
meet the monotonicity requirement and can be expressed to build the use-def graph for a particular job unit.
in terms of the monotone framework. Based on the data
flow analysis problem, here RDA, we give a set of transfer 1 IDENTIFICATION DIVISION.
functions and data flow equations to instantiate the monotone PROGRAM-ID INTERESTCALCULATION
3 ...
framework.
DATA DIVISION.
The results of data flow analysis can be reused to incremen- 5 FILE SECTION.
tally analyze a legacy system. The result of one analysis serves FD IN-FILE.
as a foundation to conduct more fine-grained analyses. To 7 01 IN-REC.
02 IN-NAME PIC A(20).
demonstrate this, consider the inter-procedural dependencies 9 02 IN-ACCOUNT PIC 9(6)V99.
derived from RDA analysis. Once we have constructed the in- 02 IN-INTEREST PIC 99V99.
formation chain, we can interactively scope down our analysis 11 FD OUT-FILE.
01 OUT-REC PIC X(80).
to a smaller set of modules to perform much more detailed 13 ...
analyses, that because of their resource demands cannot be WORKING-STORAGE SECTION.
applied to the system as a whole. For more information about 15 77 EOF PIC X VALUE "N".
01 INTEREST1 PIC 99V99.
K ERNEL language as well as details of the analysis, please 17 01 INTEREST2 PIC 99V99.
refer to Saeidi et al. [6]. ...
We have implemented the framework as part of the 19 PROCEDURE DIVISION.
MAIN.
G E L AT O (Generic Language Tools) toolset [7]. G E L AT O 21 OPEN INPUT IN-FILE OUTPUT OUT-FILE.
is an integrated set of language-independent (generic) tools READ IN-FILE END MOVE "Y" TO EOF.
for legacy software system modernization, including parsers, 23 PERFORM INTEREST-CALC THRU PRINT.
PERFORM END-PROGRAM.
analyzers, transformers, visualizers and pretty printers for 25
different programming languages including COBOL and JCL. INTEREST-CALC.
27 IF IN-ACCOUNT IS NOT < 150000
III. E VALUATION SUBTRACT 50000 FROM IN-ACCOUNT
29 MULTIPLY IN-INTEREST BY IN-ACCOUNT
A. Example Case Study GIVING INTEREST1, INTEREST2.
Here we give an example COBOL program that is represen- 31 ...
PRINT.
tative of our InterestCalculation system, depicted in Listing 33 MOVE "INCOME INTEREST SLIP " TO OUT-REC.
1. The mapping from the set of referenceable elements in WRITE OUT-REC.
COBOL including datanames, recordnames and filenames to 35 ...
END-PROGRAM.
their corresponding set of variables in K ERNEL is stored that 37 STOP RUN.
can be used to interpret the K ERNEL program. Furthermore, a
mapping exists from the set of elements including procedures Listing 1. A representative COBOL program
and statements to their corresponding labels in K ERNEL that
can be used to trace back its origin. Listing 2 depicts an
example JCL batch job which is used to submit the INTER- B. Empirical Findings of InterestCalculation System
ESTCALCULATION program to the operating system. The In this section, we give our findings and observations with
programs called through the EXEC statements are used as respect to the InterestCalculation system.
1 //EXJCL JOB ‘CALC’,CLASS=6,MSGCLASS=X,NOTIFY=&SYSUID dependency is created from a calling program to the callee
//* through a CALL statement, whereas data dependency is
3 //STEP001 EXEC PGM=INTERESTCALCULATION
created from a program to a copybook through a COPY
Listing 2. An example JCL batch job for submitting COBOL program to OS statement. We extract the structural dependencies during the
inlining operation.
In order to extract the functional dependencies, we needed to
1 0:Procedure main(){ build the use-def graph for the InterestCalculation system. We
35: call INTERESTCALCULATION();
3 }
have followed the approach as given in the previous section
1:Procedure INTERESTCALCULATION(){ to instantiate the framework to construct the use-def graph.
5 2:Procedure PROC1(){ We have opted to make an exhaustive analysis by including
6:try {7:[uses(var1);uses(var2)];}
7 with 8: exception {9: abort;} a call to all the modules in the intial program entry. All
10:try {11:[uses(var1)];} the program calls in the system are dynamic, however upon
9 with 12: exception {13: [defines(var3)]; } completion of this analysis, we observe that all the values
14: { 15: call PROC2(); 16: call PROC3();}
11 17: { 18: call PROC4();} reaching call statements are uninitialized. That is, the value
} reached upon calling a program comes from the environment.
13 3:Procedure PROC2(){ A practice used in this system, as is common in many well-
19:if(20:[uses(var1)])then{
15 21:try {22:[uses(var1);defines(var1)];} engineered systems, is to use a set of immutable variables and
with 23: exception {24: abort;} initialize them with the name of the programs to be called
17 25:try {26:[uses(var1);uses(var1);defines(var4 in the declaration section of the COBOL program. Following
);defines(var5)];}
with 27: exception {28: abort;} up on this finding, we extract the string literals of the initial
19 }; values of the variables from the declaration section to create
} a functional dependency graph.
21 4:Procedure PROC3(){
29:[defines(var2)]; The experiment is performed on a 2.80 GHz Intel Core i7
23 30:try {31:[uses(var2)];} quad-core machine with 16 GB RAM. The parsing of COBOL
with 32: exception {33: abort;} and JCL code takes about 20 minutes. The transformation and
25 }
5:Procedure PROC4(){ writing to a new file takes the least amount of time, that is
27 34:{abort;} 812 seconds. Loading the generated K ERNEL code followed
} by instantiating the monotone framework and performing the
29 }
data flow analysis takes the longest with 1676 seconds. Thus,
Listing 3. The representation of COBOL program in K ERNEL the whole process from parsing COBOL and JCL units to
constructing the use-def graph takes just over an hour. Further
improvement on the performance is required to ensure the
completion of the analysis in a reasonable amount of time.
As is the case in many legacy systems, during our copybook
inlining operation, we found out that there are 45 copybooks IV. C ONCLUSION AND F UTURE W ORK
missing. Consequently, some of the identifiers used in the
We have proposed a generic framework for analyzing legacy
program could not be resolved to any data item. To overcome
software systems. Based on our observations regarding the
this hurdle, we create a set of proxy referenceable elements to
problems one may encounter when dealing with large legacy
resolve the unresolved identifiers. Moreover, to our surprise,
systems, our framework employs nondeterminism and abstrac-
we found out that just over a quarter of the 21085 copybooks
tion to achieve language-independency and incrementality.
handed to us were actually used. The entire set of copybooks
Language independency is achieved through the specification
comprised of almost 600 KLoC. Table I gives some metrics
of the source program in terms of an intermediate language
for the InterestCalculation system.
which uses nondeterminism to capture semantic variations
We use the classification of dependencies for COBOL as
points at the control flow level. Moreover, the intermediate
defined in [8]. They classify the dependencies in terms of
language provides extension points to give support to abstrac-
functional dependencies and data dependencies. A functional
tion of data flow problem. This gives rise to incrementality
which can be used to compute more precise as well as fine-
granular analyses.
TABLE I
M ETRICS FOR I NTEREST C ALCULATION S YSTEM As part of future research direction, we want to go beyond
COBOL, by both extending our tools to analyze programs
in heterogeneous environment, as well as handle embedded
Highest Lowest
#Files KLoC languages. We want to further maturize the G E L AT O toolset
(In/Out)degree (In/Out)degree
COBOL
321 413.17 151 2 by both conducting more experiments on real case studies
Source Files and conduct more testing to validate it. Furthermore, we want
Used
599 88.04 1055 1 to perform more analyses to assist with service identification
Copybooks
using the framework.
R EFERENCES [5] R. Khadka, G. Reijnders, A. Saeidi, S. Jansen, and J. Hage, “A method
engineering based legacy to SOA migration method,” in 27th ICSM’11.
[1] L. Moonen, “A generic architecture for data flow analysis to support IEEE, 2011, pp. 163–172.
reverse engineering,” Theory and Practice of Algebraic Specifications; [6] A. Saeidi, J. Hage, R. Khadka, and S. Jansen, “A generic framework for
ASF+ SDF, vol. 97, 1997. model-driven analysis of heterogeneous legacy software systems,” 2017.
[2] R. Lammel and C. Verhoef, “Cracking the 500-language problem,” [Online]. Available: https://dspace.library.uu.nl/handle/1874/359542
Software, IEEE, vol. 18, no. 6, pp. 78–88, 2001. [7] ——, “Gelato: GEneric LAnguage TOols for model-driven analysis of
[3] P. Baumann, J. Faessler, M. Kiser, Z. Oeztuerk, and L. Richter, legacy software systems,” in Reverse Engineering (WCRE), 2013 20th
“Semantics-based reverse engineering,” 1994. Working Conference on, Oct 2013, pp. 481–482.
[4] F. Nielson, H. R. Nielson, and C. Hankin, Principles of program analysis. [8] J. Van Geet and S. Demeyer, “Lightweight visualisations of Cobol code
Springer-Verlag New York Incorporated, 1999. for supporting migration to SOA,” Electronic Communications of the
EASST, vol. 8, 2008.