=Paper= {{Paper |id=Vol-2361/short7 |storemode=property |title=A Generic Framework for the Analysis of Heterogeneous Legacy Software Systems |pdfUrl=https://ceur-ws.org/Vol-2361/short7.pdf |volume=Vol-2361 |authors=Amir M. Saeidi,Jurriaan Hage,Ravi Khadka,Slinger Jansen |dblpUrl=https://dblp.org/rec/conf/benevol/SaeidiHKJ18 }} ==A Generic Framework for the Analysis of Heterogeneous Legacy Software Systems== https://ceur-ws.org/Vol-2361/short7.pdf
             A Generic Framework for the Analysis of
             Heterogeneous Legacy Software Systems
                               Amir M. Saeidi, Jurriaan Hage, Ravi Khadka, Slinger Jansen
                   Department of Information and Computing Sciences, Utrecht University, The Netherlands
                                     {a.m.saeidi, j.hage, r.khadka, slinger.jansen}@uu.nl


   Abstract—The reverse engineering of legacy systems is a           language to be processed (see Section II). This raises two
process that involves analysis and understanding of the given        questions that need to be addressed: 1) Is it possible to capture
systems. Some people believe in-depth knowledge of the system        the semantics upfront for all dialects and implementations
is a prerequisite for its analysis, whereas others, ourselves
included, argue that only specific knowledge is required on a        of the same programming language? 2) How much semantic
per-project basis. To give support for the latter approach, we       information is ‘necessary’ to establish a sound foundation for
propose a generic framework that employs the techniques of           conducting a particular program analysis?
non-determinism and abstraction to enable us to build tooling for       For a language like COBOL which comes in various di-
analyzing large systems. As part of the framework, we introduce
an extensible imperative procedural language called K ERNEL
                                                                     alects, each of which may have different compiler products,
which can be used for constructing an abstract representation        establishing such semantic knowledge is impractical. In short,
of the control flow and data flow of the system. To illustrate its   no single semantics exist! On the other hand, the semantic
use, we show how such framework can be instantiated to build         knowledge required strongly depends on the analysis one
a use-def graph for a large industrial legacy COBOL and JCL          wants to perform. For example, a type-based program analysis
system. We have implemented our framework in a model-driven
fashion to facilitate development of relevant tools. The resulting
                                                                     needs to decorate the data definitions with the appropriate
G E L AT O tool set can be used within the Eclipse environment.      types, whereas a control-based analysis needs to know about
                                                                     control dependencies. Moreover, when dealing with large
                      I. I NTRODUCTION                               systems, abstraction is not a choice but a necessity. The
   Many companies operate systems which are developed over           analysis techniques need to be precise and scale at the same
a period of many decades. These legacy systems are subject           time.
to continuous adaptation and evolution to deal with changing            Lämmel and Verhoef [2] propose a technique in which
internal and external factors. Many of these systems do not          syntactic tools are constructed and later augmented with
meet the requirements of a maintainable system, mainly due           semantic knowledge on a per-project basis (demand-driven
to lack of documentation and programming structure. Reverse          semantics). We build on this approach by introducing a generic
engineering can be employed to create a high level abstraction       framework that employs 1) nondeterminism to compute a
of the system and to identify its logical components [1].            sound abstraction of the control-flow of the program, and 2)
   There are many challenges that one needs to deal with             abstraction by computing a particular program analysis with
when reverse engineering a large legacy system. First of all,        respect to enough amount of semantic information required.
finding a program understanding tool which can deal with the         To realize the above features, the framework consists of an
system of interest is almost impossible. On the other hand,          extensible intermediate language that helps achieve separation
implementing a high-quality tool from scratch that can handle        between abstraction of the problem and data flow analysis.
the system is a tedious and time-consuming task. Furthermore,        This separation provides the context for an incremental ap-
the old programming languages used to develop the legacy             proach to analyzing large software systems.
systems tend to suffer from a lack of “singularity”[2] and              The paper makes the following contributions:
“elegance”[3], as viewed from the perspective of modern
programming languages. We have investigated the use of                 1) It presents a generic framework for performing program
automatic analysis techniques to provide tool support and help            analysis on legacy systems that can be instantiated in a
with understanding programs written in these languages.                   system-specific fashion.
   Program analysis is an automatic analysis technique that can        2) It employs techniques from MDE to facilitate analysis
be used as part of reverse engineering [4]. Any deep program              of legacy systems and construct the required reverse
analysis starts with a syntactic analyzer parsing syntactic               engineering tools.
units into what is known as an abstract syntax tree. The                This paper is structured as follows. In Section II we outline
tree produced must be annotated with the necessary seman-            the challenges we have faced in dealing with our industrial
tic knowledge by means of a semantic analysis. Although              legacy system, and describe the generic framework to over-
syntactic analysis depends on the grammar of the language            come the stated problems. We proceed by giving an empirical
for which analysis needs to be performed, we argue that              evaluation of our framework in Section III. Finally, in Section
semantic analysis should be performed independent of the             IV we conclude and outline future work.
               II. A G ENERIC F RAMEWORK                                 the COBOL language, and naturally paves the way for
                                                                         heterogeneous systems comprising of COBOL and JCL.
   We were involved in a legacy to SOA migration project             2) Abstraction and Nondeterminism: Semantic analysis
in a large banking institution in the Netherlands, comprising            needs to be performed in a context-specific manner.
of five distinct legacy systems. Like many business-critical             We borrow concepts from programming language theory
systems, their systems are implemented in COBOL which runs               including non-determinism and abstraction to create an
on platforms such as IBM z/OS and HP Tandem nonstop. We                  environment through which semantic knowledge can
have proposed a method [5] for migrating legacy systems to               be added to the system of interest. Non-determinism
SOA which involves identifying candidate services followed               guarantees the soundness of the analysis by exploring
by concept slicing to extract relevant pieces of code. To                all the possible variations at the cost of performance,
evaluate our methodology, we have been given access to one               whereas, abstraction ensures that only a minimal amount
of their legacy systems, which from now on we will refer                 of information is stored to perform a sound analysis.
to as InterestCalculation. As it is the case with most legacy        3) Incrementality: Incrementality is key in building analysis
systems, the documentation of the InterestCalculation system             tools that scale to large systems. Separation of problem
is outdated and many of the people who were involved in                  specification (abstraction) and data flow analysis is the
its development are not around anymore. We want to apply                 way forward for incremental analysis. In this approach,
techniques from the field of program analysis to help with               the framework can be re-instantiated with the new infor-
both identification of services and slicing.                             mation obtained from the result of an analysis to perform
   There are three important issues that need to be addressed            more fine-grained analyses.
when performing program analysis on legacy systems. First            To realize the above properties, the framework consists of
of all, many legacy systems are heterogeneous and constitute      an extensible intermediate language called K ERNEL. K ER -
multi-language applications. For instance, the systems imple-     NEL employs non-determinism to capture semantics variation
mented for IBM mainframe usually employ JCL job units             points at the control flow level. Furthermore, it provides exten-
to describe different task routines that need to be performed     sion points to extend the language to incorporate abstractions
within the legacy environment. Furthermore, COBOL has             required to compute a particular data flow analysis.
several extensions to provide support for embedded languages         Figure 1 depicts the step-by-step approach to instantiating
such as SQL and CICS. These are used to perform queries           the framework for performing a particular data-flow analysis.
on tables and process customer transactions, respectively. This   The first step involves syntactic abstraction (parsing) of the
also holds for our InterestCalculation system, which comprises    source program into an AST. In the next step, an abstract
of COBOL and copybooks as well as JCL jobs, the former of         (static) semantics is created based on the concrete or abstract
which contains embedded SQL statements.                           programming languages that the program has to conform to,
   Second, programming languages used for legacy systems          irrespective of whether those are different dialects/implementa-
do not follow an explicitly defined language standard. In         tions of the same programming language or different languages
languages like COBOL and C, the semantics of many op-             that the program is written in.
erations are left open and the implementation must choose            Depending on the data flow problem we are interested
how to implement these operations. Furthermore, instances         in, the deployed abstraction techniques ensure that enough
of a given programming language may be home-brewed. It            information is stored to perform a sound analysis with respect
is estimated that there are about 300 COBOL dialects, each        to that problem. For instance, reaching definition analysis
of which has its own compiler products with many patch            used to build the use-def graph is expressed as an abstract
levels [2]. Consequently, the only possible way to deal with      interpretation of the program which for each expression in the
inconsistencies is to rely on the compiler used to compile the    program infers whether a variable is definitely used or defined
system that is subject to analysis.                               after the possible execution of the statement. To help with the
   Finally, legacy systems contain code bases that run well       formulation of this abstraction, extension points are provided
into millions of lines of code, hence scalability of any pro-     in the kernel language to instantiate the abstract domain for a
gram analysis technique is essential. The InterestCalculation     set of analysis problems.
system consists of almost half a MLoC of COBOL source                Based on the extracted abstract semantics, a mapping is cre-
files and copybooks. Developing analysis techniques that are      ated from the syntactic structure of the source program into an
simultaneously precise and scalable is not a simple task.         instance of the K ERNEL language. The use of non-determinism
                                                                  makes it possible to encode inconsistencies amongst different
   To overcome the aforementioned problems, we introduce a
                                                                  implementations, as well as points where the particular se-
generic framework for analyzing large legacy systems which
                                                                  mantics cannot be derived, e.g. when no knowledge of the
has the following three features:
                                                                  compiler is available.
  1) Language/Dialect-Independence: We strongly believe              In the next step, we specify the data flow analysis problem
     standardization through conversion to a well-understood      as a monotone framework instance [4] and solve the instance
     syntactic structure with semantic variation points is the    using an iterative work-list algorithm. The monotone frame-
     key for analyzing different dialects and versions of         work consists of a set of monotonic transfer functions which
                                    Fig. 1. The Generic Framework for Analyzing Legacy Systems



express the effect of statements on desired properties of the the entry point to the COBOL program. Listing 3 depicts
program with respect to a flow analysis problem. Many flow the resulting K ERNEL program from translation of COBOL
analysis problems such as reaching definition analysis (RDA) program and JCL unit. Data flow analysis is then performed
meet the monotonicity requirement and can be expressed to build the use-def graph for a particular job unit.
in terms of the monotone framework. Based on the data
flow analysis problem, here RDA, we give a set of transfer 1 IDENTIFICATION DIVISION.
functions and data flow equations to instantiate the monotone PROGRAM-ID INTERESTCALCULATION
                                                                 3 ...
framework.
                                                                   DATA DIVISION.
   The results of data flow analysis can be reused to incremen- 5 FILE SECTION.
tally analyze a legacy system. The result of one analysis serves       FD IN-FILE.
as a foundation to conduct more fine-grained analyses. To 7              01 IN-REC.
                                                                            02 IN-NAME PIC A(20).
demonstrate this, consider the inter-procedural dependencies 9              02 IN-ACCOUNT PIC 9(6)V99.
derived from RDA analysis. Once we have constructed the in-                 02 IN-INTEREST PIC 99V99.
formation chain, we can interactively scope down our analysis 11       FD OUT-FILE.
                                                                         01     OUT-REC PIC X(80).
to a smaller set of modules to perform much more detailed 13 ...
analyses, that because of their resource demands cannot be           WORKING-STORAGE SECTION.
applied to the system as a whole. For more information about 15        77 EOF PIC X VALUE "N".
                                                                       01 INTEREST1 PIC 99V99.
K ERNEL language as well as details of the analysis, please 17         01 INTEREST2 PIC 99V99.
refer to Saeidi et al. [6].                                        ...
   We have implemented the framework as part of the 19 PROCEDURE DIVISION.
                                                                     MAIN.
G E L AT O (Generic Language Tools) toolset [7]. G E L AT O 21         OPEN INPUT IN-FILE OUTPUT OUT-FILE.
is an integrated set of language-independent (generic) tools           READ IN-FILE END MOVE "Y" TO EOF.
for legacy software system modernization, including parsers, 23        PERFORM INTEREST-CALC THRU PRINT.
                                                                       PERFORM END-PROGRAM.
analyzers, transformers, visualizers and pretty printers for 25
different programming languages including COBOL and JCL.             INTEREST-CALC.
                                                               27      IF IN-ACCOUNT IS NOT < 150000
                     III. E VALUATION                                    SUBTRACT 50000 FROM IN-ACCOUNT
                                                               29        MULTIPLY IN-INTEREST BY IN-ACCOUNT
A. Example Case Study                                                      GIVING INTEREST1, INTEREST2.
   Here we give an example COBOL program that is represen- 31 ...
                                                                   PRINT.
tative of our InterestCalculation system, depicted in Listing 33          MOVE "INCOME INTEREST SLIP " TO OUT-REC.
1. The mapping from the set of referenceable elements in               WRITE OUT-REC.
COBOL including datanames, recordnames and filenames to 35 ...
                                                                   END-PROGRAM.
their corresponding set of variables in K ERNEL is stored that 37 STOP RUN.
can be used to interpret the K ERNEL program. Furthermore, a
mapping exists from the set of elements including procedures                 Listing 1. A representative COBOL program
and statements to their corresponding labels in K ERNEL that
can be used to trace back its origin. Listing 2 depicts an
example JCL batch job which is used to submit the INTER- B. Empirical Findings of InterestCalculation System
ESTCALCULATION program to the operating system. The               In this section, we give our findings and observations with
programs called through the EXEC statements are used as respect to the InterestCalculation system.
1 //EXJCL JOB ‘CALC’,CLASS=6,MSGCLASS=X,NOTIFY=&SYSUID                       dependency is created from a calling program to the callee
  //*                                                                        through a CALL statement, whereas data dependency is
3 //STEP001 EXEC PGM=INTERESTCALCULATION
                                                                             created from a program to a copybook through a COPY
    Listing 2. An example JCL batch job for submitting COBOL program to OS   statement. We extract the structural dependencies during the
                                                                             inlining operation.
                                                                                In order to extract the functional dependencies, we needed to
1  0:Procedure main(){                                                       build the use-def graph for the InterestCalculation system. We
     35: call INTERESTCALCULATION();
 3 }
                                                                             have followed the approach as given in the previous section
   1:Procedure INTERESTCALCULATION(){                                        to instantiate the framework to construct the use-def graph.
 5   2:Procedure PROC1(){                                                    We have opted to make an exhaustive analysis by including
       6:try {7:[uses(var1);uses(var2)];}
 7          with 8: exception {9: abort;}                                    a call to all the modules in the intial program entry. All
       10:try {11:[uses(var1)];}                                             the program calls in the system are dynamic, however upon
 9         with 12: exception {13: [defines(var3)]; }                        completion of this analysis, we observe that all the values
       14: { 15: call PROC2(); 16: call PROC3();}
11     17: { 18: call PROC4();}                                              reaching call statements are uninitialized. That is, the value
     }                                                                       reached upon calling a program comes from the environment.
13   3:Procedure PROC2(){                                                    A practice used in this system, as is common in many well-
       19:if(20:[uses(var1)])then{
15       21:try {22:[uses(var1);defines(var1)];}                             engineered systems, is to use a set of immutable variables and
            with 23: exception {24: abort;}                                  initialize them with the name of the programs to be called
17       25:try {26:[uses(var1);uses(var1);defines(var4                      in the declaration section of the COBOL program. Following
       );defines(var5)];}
            with 27: exception {28: abort;}                                  up on this finding, we extract the string literals of the initial
19       };                                                                  values of the variables from the declaration section to create
     }                                                                       a functional dependency graph.
21   4:Procedure PROC3(){
       29:[defines(var2)];                                                      The experiment is performed on a 2.80 GHz Intel Core i7
23     30:try {31:[uses(var2)];}                                             quad-core machine with 16 GB RAM. The parsing of COBOL
            with 32: exception {33: abort;}                                  and JCL code takes about 20 minutes. The transformation and
25   }
     5:Procedure PROC4(){                                                    writing to a new file takes the least amount of time, that is
27     34:{abort;}                                                           812 seconds. Loading the generated K ERNEL code followed
     }                                                                       by instantiating the monotone framework and performing the
29 }
                                                                             data flow analysis takes the longest with 1676 seconds. Thus,
          Listing 3. The representation of COBOL program in K ERNEL          the whole process from parsing COBOL and JCL units to
                                                                             constructing the use-def graph takes just over an hour. Further
                                                                             improvement on the performance is required to ensure the
                                                                             completion of the analysis in a reasonable amount of time.
       As is the case in many legacy systems, during our copybook
    inlining operation, we found out that there are 45 copybooks                        IV. C ONCLUSION AND F UTURE W ORK
    missing. Consequently, some of the identifiers used in the
                                                                                We have proposed a generic framework for analyzing legacy
    program could not be resolved to any data item. To overcome
                                                                             software systems. Based on our observations regarding the
    this hurdle, we create a set of proxy referenceable elements to
                                                                             problems one may encounter when dealing with large legacy
    resolve the unresolved identifiers. Moreover, to our surprise,
                                                                             systems, our framework employs nondeterminism and abstrac-
    we found out that just over a quarter of the 21085 copybooks
                                                                             tion to achieve language-independency and incrementality.
    handed to us were actually used. The entire set of copybooks
                                                                             Language independency is achieved through the specification
    comprised of almost 600 KLoC. Table I gives some metrics
                                                                             of the source program in terms of an intermediate language
    for the InterestCalculation system.
                                                                             which uses nondeterminism to capture semantic variations
       We use the classification of dependencies for COBOL as
                                                                             points at the control flow level. Moreover, the intermediate
    defined in [8]. They classify the dependencies in terms of
                                                                             language provides extension points to give support to abstrac-
    functional dependencies and data dependencies. A functional
                                                                             tion of data flow problem. This gives rise to incrementality
                                                                             which can be used to compute more precise as well as fine-
                                                                             granular analyses.
                                  TABLE I
                M ETRICS FOR I NTEREST C ALCULATION S YSTEM                     As part of future research direction, we want to go beyond
                                                                             COBOL, by both extending our tools to analyze programs
                                                                             in heterogeneous environment, as well as handle embedded
                                            Highest           Lowest
                     #Files    KLoC                                          languages. We want to further maturize the G E L AT O toolset
                                        (In/Out)degree    (In/Out)degree
     COBOL
                      321     413.17          151               2            by both conducting more experiments on real case studies
     Source Files                                                            and conduct more testing to validate it. Furthermore, we want
     Used
                      599      88.04         1055               1            to perform more analyses to assist with service identification
     Copybooks
                                                                             using the framework.
                             R EFERENCES                                        [5] R. Khadka, G. Reijnders, A. Saeidi, S. Jansen, and J. Hage, “A method
                                                                                    engineering based legacy to SOA migration method,” in 27th ICSM’11.
[1] L. Moonen, “A generic architecture for data flow analysis to support            IEEE, 2011, pp. 163–172.
    reverse engineering,” Theory and Practice of Algebraic Specifications;      [6] A. Saeidi, J. Hage, R. Khadka, and S. Jansen, “A generic framework for
    ASF+ SDF, vol. 97, 1997.                                                        model-driven analysis of heterogeneous legacy software systems,” 2017.
[2] R. Lammel and C. Verhoef, “Cracking the 500-language problem,”                  [Online]. Available: https://dspace.library.uu.nl/handle/1874/359542
    Software, IEEE, vol. 18, no. 6, pp. 78–88, 2001.                            [7] ——, “Gelato: GEneric LAnguage TOols for model-driven analysis of
[3] P. Baumann, J. Faessler, M. Kiser, Z. Oeztuerk, and L. Richter,                 legacy software systems,” in Reverse Engineering (WCRE), 2013 20th
    “Semantics-based reverse engineering,” 1994.                                    Working Conference on, Oct 2013, pp. 481–482.
[4] F. Nielson, H. R. Nielson, and C. Hankin, Principles of program analysis.   [8] J. Van Geet and S. Demeyer, “Lightweight visualisations of Cobol code
    Springer-Verlag New York Incorporated, 1999.                                    for supporting migration to SOA,” Electronic Communications of the
                                                                                    EASST, vol. 8, 2008.