A Generic Framework for the Analysis of Heterogeneous Legacy Software Systems Amir M. Saeidi, Jurriaan Hage, Ravi Khadka, Slinger Jansen Department of Information and Computing Sciences, Utrecht University, The Netherlands {a.m.saeidi, j.hage, r.khadka, slinger.jansen}@uu.nl Abstract—The reverse engineering of legacy systems is a language to be processed (see Section II). This raises two process that involves analysis and understanding of the given questions that need to be addressed: 1) Is it possible to capture systems. Some people believe in-depth knowledge of the system the semantics upfront for all dialects and implementations is a prerequisite for its analysis, whereas others, ourselves included, argue that only specific knowledge is required on a of the same programming language? 2) How much semantic per-project basis. To give support for the latter approach, we information is ‘necessary’ to establish a sound foundation for propose a generic framework that employs the techniques of conducting a particular program analysis? non-determinism and abstraction to enable us to build tooling for For a language like COBOL which comes in various di- analyzing large systems. As part of the framework, we introduce an extensible imperative procedural language called K ERNEL alects, each of which may have different compiler products, which can be used for constructing an abstract representation establishing such semantic knowledge is impractical. In short, of the control flow and data flow of the system. To illustrate its no single semantics exist! On the other hand, the semantic use, we show how such framework can be instantiated to build knowledge required strongly depends on the analysis one a use-def graph for a large industrial legacy COBOL and JCL wants to perform. For example, a type-based program analysis system. We have implemented our framework in a model-driven fashion to facilitate development of relevant tools. The resulting needs to decorate the data definitions with the appropriate G E L AT O tool set can be used within the Eclipse environment. types, whereas a control-based analysis needs to know about control dependencies. Moreover, when dealing with large I. I NTRODUCTION systems, abstraction is not a choice but a necessity. The Many companies operate systems which are developed over analysis techniques need to be precise and scale at the same a period of many decades. These legacy systems are subject time. to continuous adaptation and evolution to deal with changing Lämmel and Verhoef [2] propose a technique in which internal and external factors. Many of these systems do not syntactic tools are constructed and later augmented with meet the requirements of a maintainable system, mainly due semantic knowledge on a per-project basis (demand-driven to lack of documentation and programming structure. Reverse semantics). We build on this approach by introducing a generic engineering can be employed to create a high level abstraction framework that employs 1) nondeterminism to compute a of the system and to identify its logical components [1]. sound abstraction of the control-flow of the program, and 2) There are many challenges that one needs to deal with abstraction by computing a particular program analysis with when reverse engineering a large legacy system. First of all, respect to enough amount of semantic information required. finding a program understanding tool which can deal with the To realize the above features, the framework consists of an system of interest is almost impossible. On the other hand, extensible intermediate language that helps achieve separation implementing a high-quality tool from scratch that can handle between abstraction of the problem and data flow analysis. the system is a tedious and time-consuming task. Furthermore, This separation provides the context for an incremental ap- the old programming languages used to develop the legacy proach to analyzing large software systems. systems tend to suffer from a lack of “singularity”[2] and The paper makes the following contributions: “elegance”[3], as viewed from the perspective of modern programming languages. We have investigated the use of 1) It presents a generic framework for performing program automatic analysis techniques to provide tool support and help analysis on legacy systems that can be instantiated in a with understanding programs written in these languages. system-specific fashion. Program analysis is an automatic analysis technique that can 2) It employs techniques from MDE to facilitate analysis be used as part of reverse engineering [4]. Any deep program of legacy systems and construct the required reverse analysis starts with a syntactic analyzer parsing syntactic engineering tools. units into what is known as an abstract syntax tree. The This paper is structured as follows. In Section II we outline tree produced must be annotated with the necessary seman- the challenges we have faced in dealing with our industrial tic knowledge by means of a semantic analysis. Although legacy system, and describe the generic framework to over- syntactic analysis depends on the grammar of the language come the stated problems. We proceed by giving an empirical for which analysis needs to be performed, we argue that evaluation of our framework in Section III. Finally, in Section semantic analysis should be performed independent of the IV we conclude and outline future work. II. A G ENERIC F RAMEWORK the COBOL language, and naturally paves the way for heterogeneous systems comprising of COBOL and JCL. We were involved in a legacy to SOA migration project 2) Abstraction and Nondeterminism: Semantic analysis in a large banking institution in the Netherlands, comprising needs to be performed in a context-specific manner. of five distinct legacy systems. Like many business-critical We borrow concepts from programming language theory systems, their systems are implemented in COBOL which runs including non-determinism and abstraction to create an on platforms such as IBM z/OS and HP Tandem nonstop. We environment through which semantic knowledge can have proposed a method [5] for migrating legacy systems to be added to the system of interest. Non-determinism SOA which involves identifying candidate services followed guarantees the soundness of the analysis by exploring by concept slicing to extract relevant pieces of code. To all the possible variations at the cost of performance, evaluate our methodology, we have been given access to one whereas, abstraction ensures that only a minimal amount of their legacy systems, which from now on we will refer of information is stored to perform a sound analysis. to as InterestCalculation. As it is the case with most legacy 3) Incrementality: Incrementality is key in building analysis systems, the documentation of the InterestCalculation system tools that scale to large systems. Separation of problem is outdated and many of the people who were involved in specification (abstraction) and data flow analysis is the its development are not around anymore. We want to apply way forward for incremental analysis. In this approach, techniques from the field of program analysis to help with the framework can be re-instantiated with the new infor- both identification of services and slicing. mation obtained from the result of an analysis to perform There are three important issues that need to be addressed more fine-grained analyses. when performing program analysis on legacy systems. First To realize the above properties, the framework consists of of all, many legacy systems are heterogeneous and constitute an extensible intermediate language called K ERNEL. K ER - multi-language applications. For instance, the systems imple- NEL employs non-determinism to capture semantics variation mented for IBM mainframe usually employ JCL job units points at the control flow level. Furthermore, it provides exten- to describe different task routines that need to be performed sion points to extend the language to incorporate abstractions within the legacy environment. Furthermore, COBOL has required to compute a particular data flow analysis. several extensions to provide support for embedded languages Figure 1 depicts the step-by-step approach to instantiating such as SQL and CICS. These are used to perform queries the framework for performing a particular data-flow analysis. on tables and process customer transactions, respectively. This The first step involves syntactic abstraction (parsing) of the also holds for our InterestCalculation system, which comprises source program into an AST. In the next step, an abstract of COBOL and copybooks as well as JCL jobs, the former of (static) semantics is created based on the concrete or abstract which contains embedded SQL statements. programming languages that the program has to conform to, Second, programming languages used for legacy systems irrespective of whether those are different dialects/implementa- do not follow an explicitly defined language standard. In tions of the same programming language or different languages languages like COBOL and C, the semantics of many op- that the program is written in. erations are left open and the implementation must choose Depending on the data flow problem we are interested how to implement these operations. Furthermore, instances in, the deployed abstraction techniques ensure that enough of a given programming language may be home-brewed. It information is stored to perform a sound analysis with respect is estimated that there are about 300 COBOL dialects, each to that problem. For instance, reaching definition analysis of which has its own compiler products with many patch used to build the use-def graph is expressed as an abstract levels [2]. Consequently, the only possible way to deal with interpretation of the program which for each expression in the inconsistencies is to rely on the compiler used to compile the program infers whether a variable is definitely used or defined system that is subject to analysis. after the possible execution of the statement. To help with the Finally, legacy systems contain code bases that run well formulation of this abstraction, extension points are provided into millions of lines of code, hence scalability of any pro- in the kernel language to instantiate the abstract domain for a gram analysis technique is essential. The InterestCalculation set of analysis problems. system consists of almost half a MLoC of COBOL source Based on the extracted abstract semantics, a mapping is cre- files and copybooks. Developing analysis techniques that are ated from the syntactic structure of the source program into an simultaneously precise and scalable is not a simple task. instance of the K ERNEL language. The use of non-determinism makes it possible to encode inconsistencies amongst different To overcome the aforementioned problems, we introduce a implementations, as well as points where the particular se- generic framework for analyzing large legacy systems which mantics cannot be derived, e.g. when no knowledge of the has the following three features: compiler is available. 1) Language/Dialect-Independence: We strongly believe In the next step, we specify the data flow analysis problem standardization through conversion to a well-understood as a monotone framework instance [4] and solve the instance syntactic structure with semantic variation points is the using an iterative work-list algorithm. The monotone frame- key for analyzing different dialects and versions of work consists of a set of monotonic transfer functions which Fig. 1. The Generic Framework for Analyzing Legacy Systems express the effect of statements on desired properties of the the entry point to the COBOL program. Listing 3 depicts program with respect to a flow analysis problem. Many flow the resulting K ERNEL program from translation of COBOL analysis problems such as reaching definition analysis (RDA) program and JCL unit. Data flow analysis is then performed meet the monotonicity requirement and can be expressed to build the use-def graph for a particular job unit. in terms of the monotone framework. Based on the data flow analysis problem, here RDA, we give a set of transfer 1 IDENTIFICATION DIVISION. functions and data flow equations to instantiate the monotone PROGRAM-ID INTERESTCALCULATION 3 ... framework. DATA DIVISION. The results of data flow analysis can be reused to incremen- 5 FILE SECTION. tally analyze a legacy system. The result of one analysis serves FD IN-FILE. as a foundation to conduct more fine-grained analyses. To 7 01 IN-REC. 02 IN-NAME PIC A(20). demonstrate this, consider the inter-procedural dependencies 9 02 IN-ACCOUNT PIC 9(6)V99. derived from RDA analysis. Once we have constructed the in- 02 IN-INTEREST PIC 99V99. formation chain, we can interactively scope down our analysis 11 FD OUT-FILE. 01 OUT-REC PIC X(80). to a smaller set of modules to perform much more detailed 13 ... analyses, that because of their resource demands cannot be WORKING-STORAGE SECTION. applied to the system as a whole. For more information about 15 77 EOF PIC X VALUE "N". 01 INTEREST1 PIC 99V99. K ERNEL language as well as details of the analysis, please 17 01 INTEREST2 PIC 99V99. refer to Saeidi et al. [6]. ... We have implemented the framework as part of the 19 PROCEDURE DIVISION. MAIN. G E L AT O (Generic Language Tools) toolset [7]. G E L AT O 21 OPEN INPUT IN-FILE OUTPUT OUT-FILE. is an integrated set of language-independent (generic) tools READ IN-FILE END MOVE "Y" TO EOF. for legacy software system modernization, including parsers, 23 PERFORM INTEREST-CALC THRU PRINT. PERFORM END-PROGRAM. analyzers, transformers, visualizers and pretty printers for 25 different programming languages including COBOL and JCL. INTEREST-CALC. 27 IF IN-ACCOUNT IS NOT < 150000 III. E VALUATION SUBTRACT 50000 FROM IN-ACCOUNT 29 MULTIPLY IN-INTEREST BY IN-ACCOUNT A. Example Case Study GIVING INTEREST1, INTEREST2. Here we give an example COBOL program that is represen- 31 ... PRINT. tative of our InterestCalculation system, depicted in Listing 33 MOVE "INCOME INTEREST SLIP " TO OUT-REC. 1. The mapping from the set of referenceable elements in WRITE OUT-REC. COBOL including datanames, recordnames and filenames to 35 ... END-PROGRAM. their corresponding set of variables in K ERNEL is stored that 37 STOP RUN. can be used to interpret the K ERNEL program. Furthermore, a mapping exists from the set of elements including procedures Listing 1. A representative COBOL program and statements to their corresponding labels in K ERNEL that can be used to trace back its origin. Listing 2 depicts an example JCL batch job which is used to submit the INTER- B. Empirical Findings of InterestCalculation System ESTCALCULATION program to the operating system. The In this section, we give our findings and observations with programs called through the EXEC statements are used as respect to the InterestCalculation system. 1 //EXJCL JOB ‘CALC’,CLASS=6,MSGCLASS=X,NOTIFY=&SYSUID dependency is created from a calling program to the callee //* through a CALL statement, whereas data dependency is 3 //STEP001 EXEC PGM=INTERESTCALCULATION created from a program to a copybook through a COPY Listing 2. An example JCL batch job for submitting COBOL program to OS statement. We extract the structural dependencies during the inlining operation. In order to extract the functional dependencies, we needed to 1 0:Procedure main(){ build the use-def graph for the InterestCalculation system. We 35: call INTERESTCALCULATION(); 3 } have followed the approach as given in the previous section 1:Procedure INTERESTCALCULATION(){ to instantiate the framework to construct the use-def graph. 5 2:Procedure PROC1(){ We have opted to make an exhaustive analysis by including 6:try {7:[uses(var1);uses(var2)];} 7 with 8: exception {9: abort;} a call to all the modules in the intial program entry. All 10:try {11:[uses(var1)];} the program calls in the system are dynamic, however upon 9 with 12: exception {13: [defines(var3)]; } completion of this analysis, we observe that all the values 14: { 15: call PROC2(); 16: call PROC3();} 11 17: { 18: call PROC4();} reaching call statements are uninitialized. That is, the value } reached upon calling a program comes from the environment. 13 3:Procedure PROC2(){ A practice used in this system, as is common in many well- 19:if(20:[uses(var1)])then{ 15 21:try {22:[uses(var1);defines(var1)];} engineered systems, is to use a set of immutable variables and with 23: exception {24: abort;} initialize them with the name of the programs to be called 17 25:try {26:[uses(var1);uses(var1);defines(var4 in the declaration section of the COBOL program. Following );defines(var5)];} with 27: exception {28: abort;} up on this finding, we extract the string literals of the initial 19 }; values of the variables from the declaration section to create } a functional dependency graph. 21 4:Procedure PROC3(){ 29:[defines(var2)]; The experiment is performed on a 2.80 GHz Intel Core i7 23 30:try {31:[uses(var2)];} quad-core machine with 16 GB RAM. The parsing of COBOL with 32: exception {33: abort;} and JCL code takes about 20 minutes. The transformation and 25 } 5:Procedure PROC4(){ writing to a new file takes the least amount of time, that is 27 34:{abort;} 812 seconds. Loading the generated K ERNEL code followed } by instantiating the monotone framework and performing the 29 } data flow analysis takes the longest with 1676 seconds. Thus, Listing 3. The representation of COBOL program in K ERNEL the whole process from parsing COBOL and JCL units to constructing the use-def graph takes just over an hour. Further improvement on the performance is required to ensure the completion of the analysis in a reasonable amount of time. As is the case in many legacy systems, during our copybook inlining operation, we found out that there are 45 copybooks IV. C ONCLUSION AND F UTURE W ORK missing. Consequently, some of the identifiers used in the We have proposed a generic framework for analyzing legacy program could not be resolved to any data item. To overcome software systems. Based on our observations regarding the this hurdle, we create a set of proxy referenceable elements to problems one may encounter when dealing with large legacy resolve the unresolved identifiers. Moreover, to our surprise, systems, our framework employs nondeterminism and abstrac- we found out that just over a quarter of the 21085 copybooks tion to achieve language-independency and incrementality. handed to us were actually used. The entire set of copybooks Language independency is achieved through the specification comprised of almost 600 KLoC. Table I gives some metrics of the source program in terms of an intermediate language for the InterestCalculation system. which uses nondeterminism to capture semantic variations We use the classification of dependencies for COBOL as points at the control flow level. Moreover, the intermediate defined in [8]. They classify the dependencies in terms of language provides extension points to give support to abstrac- functional dependencies and data dependencies. A functional tion of data flow problem. This gives rise to incrementality which can be used to compute more precise as well as fine- granular analyses. TABLE I M ETRICS FOR I NTEREST C ALCULATION S YSTEM As part of future research direction, we want to go beyond COBOL, by both extending our tools to analyze programs in heterogeneous environment, as well as handle embedded Highest Lowest #Files KLoC languages. We want to further maturize the G E L AT O toolset (In/Out)degree (In/Out)degree COBOL 321 413.17 151 2 by both conducting more experiments on real case studies Source Files and conduct more testing to validate it. Furthermore, we want Used 599 88.04 1055 1 to perform more analyses to assist with service identification Copybooks using the framework. R EFERENCES [5] R. Khadka, G. Reijnders, A. Saeidi, S. Jansen, and J. Hage, “A method engineering based legacy to SOA migration method,” in 27th ICSM’11. [1] L. Moonen, “A generic architecture for data flow analysis to support IEEE, 2011, pp. 163–172. reverse engineering,” Theory and Practice of Algebraic Specifications; [6] A. Saeidi, J. Hage, R. Khadka, and S. Jansen, “A generic framework for ASF+ SDF, vol. 97, 1997. model-driven analysis of heterogeneous legacy software systems,” 2017. [2] R. Lammel and C. Verhoef, “Cracking the 500-language problem,” [Online]. Available: https://dspace.library.uu.nl/handle/1874/359542 Software, IEEE, vol. 18, no. 6, pp. 78–88, 2001. [7] ——, “Gelato: GEneric LAnguage TOols for model-driven analysis of [3] P. Baumann, J. Faessler, M. Kiser, Z. Oeztuerk, and L. Richter, legacy software systems,” in Reverse Engineering (WCRE), 2013 20th “Semantics-based reverse engineering,” 1994. Working Conference on, Oct 2013, pp. 481–482. [4] F. Nielson, H. R. Nielson, and C. Hankin, Principles of program analysis. [8] J. Van Geet and S. Demeyer, “Lightweight visualisations of Cobol code Springer-Verlag New York Incorporated, 1999. for supporting migration to SOA,” Electronic Communications of the EASST, vol. 8, 2008.