SEEKing Knowledge in Legacy Information Systems to Support Interoperability Joachim Hammer, Mark Schmalz, William O’Brien¥, Sangeetha Shekar and Nikhil Haldevnekar Dept. of Computer & Information Science & Engineering University of Florida Gainesville, FL 32605, U.S.A. Abstract. The SEEK project (Scalable Extraction of Enterprise data and knowledge from heterogeneous sources. Current Knowledge) is developing methodologies to overcome the instantiations of SEEK are designed to extract the limited range problems of assembling knowledge resident in numerous of information needed by these process models to support legacy information systems by enabling rapid connection to, project optimization. and privacy-constrained filtering of, legacy data and applications with little programmatic setup. In this report we Extended Enterprise outline our use of data reverse engineering and code analysis techniques to automatically infer as much as possible the Coordinator/ Sub/ Lead schema and semantics of a legacy information system. We Supplier illustrate the approach using an example from our construction Supplier Sub/ supply chain testbed. Supplier … 1 MOTIVATION We are developing methodologies and algorithms to facilitate SEEK … SEEK discovery and extraction of enterprise knowledge from legacy SEEK wrapper wrapper wrapper sources. These capabilities are being implemented in a toolkit Analysis Secure Hosting Infrastructure called SEEK (Scalable Extraction of Enterprise Knowledge). (e.g., E-ERP) SEEK is being developed as part of a larger, multi-disciplinary research project to develop theory and methodologies in support of computerized decision and negotiation support Figure 1: Using the SEEK toolkit to improve coordination in extended across a network of firms (general overview in [6]). SEEK is enterprises. not meant as a replacement for wrapper or mediator development toolkits. Rather, it complements existing tools by 2 SEEK APPROACH TO KNOWLEDGE providing input about the contents and structure of the legacy EXTRATCION source that has so far been supplied manually by domain experts. This streamlines the process and makes wrapper SEEK applies Data Reverse Engineering (DRE) and Schema development scalable. Matching (SM) processes to legacy database(s), to produce a Figure 1 illustrates the need for knowledge extraction source wrapper for a legacy source. The source wrapper will be tools in support of wrapper development in the context of a used by another component (for the analysis component in supply chain. There are many firms (principally, subcontractors Figure 1) wishing to communicate and exchange information and suppliers), and each firm contains legacy data used to with the legacy system. manage internal processes. This data is also useful as input to a First SEEK generates a detailed description of the legacy project level decision support tool. However, the large number source, including entities, relationships, application-specific of firms working on a project makes it likely that there will be a meanings of the entities and relationships, business rules, data high degree of physical and semantic heterogeneity in their formatting and reporting constraints, etc. We collectively refer legacy systems. This implies practical difficulties in connecting to this information as enterprise knowledge. The extracted firms’ data and systems with enterprise-level decision support enterprise knowledge forms a knowledgebase that serves as tools. It is the role of the SEEK toolkit to help establish the input for subsequent steps. In particular, DRE connects to the necessary connections with minimal burden on the underlying underlying DBMS to extract schema information (most data firms, which often have limited technical expertise. The SEEK sources support some form of Call-Level Interface such as wrappers shown in Fig. 1 are wholly owned by the firm they JDBC). The schema information from the database is are accessing and hence provide a safety layer between the semantically enhanced using clues extracted by the semantic source and end user. Security can be further enhanced by analyzer from available application code, business reports, and, deploying the wrappers in a secure hosting infrastructure at an in the future, perhaps other electronically available information ISP, for example, as shown in the figure. that may encode business data such as e-mail correspondence, We note that SEEK is not intended to be a general- corporate memos, etc. It has been our experience (through purpose data extraction tool: SEEK extracts a narrow range of visits with representatives from the construction and ¥ Rinker School of Building Construction, University of Florida, Gainesville, FL 32634-6134 manufacturing domains) that such application code exists and document so that it can be shared with other components in the can be made available electronically. Second, the semantically SEEK architecture (e.g., the semantic matcher). The Metadata enhanced legacy source schema must be mapped into the Repository is internal to DRE and used to store intermediate domain model (DM) used by the application(s) that want(s) to run-time information needed by the algorithms including user access the legacy source. This is done using a schema mapping input parameters, the abstract syntax tree for the code (e.g., process that produces the mapping rules between the legacy from a previous invocation), etc. source schema and the application domain model. In addition We now highlight each of the eight steps and related to the domain model, the schema mapper also needs access to activities outlined in Figure 3 using an example from our the domain ontology (DO) describing the model. construction supply chain testbed. For a detailed description of Finally, the extracted legacy schema and the mapping our algorithm, refer to [3]. For simplicity, we assume without rules provide the input to the wrapper generator (not shown), lack of generality or specificity that only the following relations which produces the source wrapper. In this paper, we focus on exist in the MS-Project application, which will be discovered our implementation of the DRE algorithm. using DRE (for a description of the entire schema refer to [5]): MSP-Project [PROJ_ID, ...] 3 Data Reverse Engineering MSP-Availability[PROJ_ID, AVAIL_UID, ...] Data reverse engineering (DRE) is defined as the application of MSP-Resources [PROJ_ID, RES_UID, ...] analytical techniques to one or more legacy data sources to MSP-Tasks [PROJ_ID, TASK_UID, ...] elicit structural information (e.g., term definitions, schema MSP-Assignment [PROJ_ID, ASSN_UID, ...] definitions) from the legacy source(s) in order to improve the In order to illustrate the code analysis and how it enhances database design or produce missing schema documentation. So the schema extraction, we refer the reader to the following C far in SEEK, we are applying DRE to relational databases only. code fragment representing a simple, hypothetical interaction However, since the relational model has only limited semantic with the MS Project database. expressability, in addition to the schema, our DRE algorithm generates an E/R-like representation of the entities and char *aValue, *cValue; relationships that are not explicitly defined in the legacy int flag = 0; int bValue = 0; schema (but which exist implicitly). Our approach to data EXEC SQL SELECT A,C INTO :aValue, :cValue reverse engineering for relational sources is based on existing FROM Z WHERE B = :bValue; algorithms by Chiang [1, 2] and Petit [8]. However, we have if (cValue < aValue) improved their methodologies in several ways, most { flag = 1; } importantly to reduce the dependency on human input and to printf(“Task Start Date %s “, aValue); eliminate some of the limitations of their algorithms (e.g., printf(“Task Finish Date %s “, cValue); consistent naming of key attributes, legacy schema in 3-NF). Step 1: AST Generation Legacy DB Interface Source Module Data Application Application Code Code We start by creating an Abstract Syntax Tree (AST) shown in configuration 1 AST Generation Figure 3. The AST will be used by the semantic analyzer for code exploration during step 3. Our objective in AST Queries 2 Dictionary Extraction AST generation is to be able to associate “meaning” with program variables. Format strings in input/output statements contain semantic information that can be associated with the variables 3 Code Analysis 4 Inclusion in the input/output statement. This program variable in turn Dependency Mining Business Knowledge may be associated with a column of a table in the underlying Metadata 5 Relation Repository Classification legacy database. 6 Attribute Schema Classification Entity Knowledge 1 Program 7 XML DTD Identification Encoder dclns 2 8 Relationship Classification XML DOC embSQL 3 To Schema Matcher if 4 print Figure 2: Conceptual overview of the DRE algorithm. 2 5 embSQL print Our DRE algorithm is divided into schema extraction and beginSQL semantic analysis, which operate in interleaved fashion. An SQLselectone overview of the two algorithms, which are comprised of eight columnlist steps, is shown in Figure 2. In addition to the modules that hostvariablelist SQLAssignment execute each of the eight steps, the architecture in Figure 3 includes three support components: the configurable Database A B C Interface Module (upper-right hand corner), which provides aValue cValue bValue connectivity to the underlying legacy source. Note that this component is the ONLY source-specific component in the Figure 3: Application-specific code analysis via AST decomposition architecture: in order to perform knowledge extraction from and code slicing. The direction of slicing is backwards (forward) if the different sources, only the interface module needs to be variable in question is in an output (resp. input or declaration) statement. changed. The Knowledge Encoder (lower right-hand corner) represents the extracted knowledge in the form of an XML 2 Step 2. Dictionary Extraction. During code slicing sub-step we traverse the AST for the source code and retain only those nodes that have an The goal of step 2 is to obtain the relation and attribute names occurrence of the slicing variable in sub-tree. This results in a from the legacy source. This is done by querying the data reduced AST, which is shown in Fig. 4. dictionary, stored in the underlying database in the form of one or more system tables. Otherwise, if primary key information dclns cannot be retrieved directly from the data dictionary, the algorithm passes the set of candidate keys along with predefined “rule-out” patterns to the code analyzer. The code embSQL analyzer searches for these patterns in the application code and eliminates those attributes from the candidate set, which occur if in the rule-out pattern. The rule-out patterns, which are expressed as SQL queries, occur in the application code whenever programmer expects to select a SET of tuples. If, print after the code analysis, not all primary key can be identified, the reduced set of candidate keys is presented to the user for final primary key selection. Figure 4: Reduced AST. Result. In the example DRE application, the following During the analysis sub-step, our algorithm extracts the relations and their attributes were obtained from the MS- information shown in Table 2, while traversing the reduced Project database: AST in pre-order. 1. If a dcln node is encountered, the data type of the identifier MSP-Project [PROJ_ID, ...] can be learned. MSP-Availability[PROJ_ID, AVAIL_UID, ...] 2. embSQL contain the mapping information of identifier MSP-Resources [PROJ_ID, RES_UID, ...] name to corresponding column name and table name in the MSP-Tasks [PROJ_ID, TASK_UID, ...] database. MSP-Assignment [PROJ_ID, ASSN_UID, ...] 3. Printf/scanf nodes contain the mapping information from the text string to the identifier. In other words we can Step 3: Code Analysis extract the ‘meaning’ of the identifier from the text string. The objective of step 3, code analysis, is twofold: (1) augment Table 2: Information inferred during the analysis sub-step. entities extracted in step 2 with domain semantics, and (2) Identifier Meaning Possible Business Rule identify business rules and constraints not explicitly stored in Name the database, but which may be important to the wrapper aValue Task Start if (cValue < aValue) developer or application program accessing the legacy source. Date { Our approach to code analysis is based on code analysis, which } includes slicing [4] and pattern matching [7]. cValue Task if (cValue < aValue) The first step is the pre-slicing. From the AST of the Finish { application code, the pre-slicer identifies all the nodes Date } corresponding to input, output and embedded SQL statements. Data type Column Name Table Name in It appends the statement node name, and identifier list to an in Source Source array as the AST is traversed in pre-order. For example, for the Char * => A Z string AST in Figure 3, the array contains the following information Char * => C Z depicted in Table 1. The identifiers that occur in this data string structure maintained by the pre-slicer form the set of slicing variables. The results of analysis sub-step are appended to a result report file. After the code slicer and analyzer have been Table 1: Information maintained by the pre-slicer. invoked on every slicing variable identified by the pre-slicer, Node Statement Text String Identifiers Direction the results report file is presented to the user. The user can base number (for print of Slicing his decision of whether to perform further analysis based on the nodes) information extracted so far. If the user decides not to perform 2 embSQL ----- aValue Backwards further analysis, code analysis passes control to the inclusion (Embedded cValue dependency detection module. SQL node) It is important to note, that we identify enterprise The code slicer and analyzer, which represent steps two knowledge by matching templates against code fragments in and three respectively, are executed once for each slicing the AST. So far, we have developed patterns for discovering variable identified by the pre-slicer. In the above example, the business rules which are encoded in loop structures and/or slicing variables that occur in SQL and output statements are conditional statements and mathematical formulae, which are aValue and cValue. The direction of slicing is fixed as encoded in loop structures and/or assignment statements. Note, backwards or forwards depending on whether the variable in the occurrence of an assignment statement itself does not question is part of a output (backwards) or input (forwards) necessarily indicate the presence of a mathematical formula, statement. The slicing criterion is the exact statement (SQL or but the likelihood increases significantly if the statement input or output) node that corresponds to the slicing variable. contains one of the “slicing variables.” 3 Step 4. Discovering Inclusion Dependencies. The last two inclusion dependencies are removed since they are implicitly contained in the inclusion dependencies After extraction of the relational schema in step 2, the goal of listed in lines 2, 3 and 4 using the transitivity relationship. step 4 is to identify constraints to help classify the extracted relations, which represent both the real-world entities and the Step 5. Classification of the Relations. relationships among them. This is done using inclusion dependencies (INDs), which indicate the existence of inter- When reverse-engineering a relational schema, it is important relational constraints including class/subclass relationships. to understand that due to the limited expressability of the Let A and B be two relations, and X and Y be attributes or relational model, all real-world entities are represented as a set of attributes of A and B respectively. An inclusion relations irrespective of their types and role in the model. The dependency A.X << B.Y denotes that a set of values appearing goal of this step is to identify the different “types” of relations, in A.X is a subset of B.Y. Inclusion dependencies are some of which correspond to actual real-world entities while discovered by examining all possible subset relationships others represent relationships among them. between any two relations A and B in the legacy source. In this step all the relations in the database are classified Without additional input from the domain expert, into one of four types – strong, regular, weak or specific. inclusion dependencies can be identified in an exhaustive Identifying different relations is done using the primary key manner as follows: for each pair of relations A and B in the information obtained in step 2 and the inclusion dependencies legacy source schema, compare the values for each non-key from step 4. Intuitively, a strong entity-relation represents a attribute combination X in B with the values of each candidate real-world entity whose members can be identified exclusively key attribute combination Y in A (note that X and Y may be through its own properties. A weak entity-relation represents an single attributes). An inclusion dependency B.X<