=Paper= {{Paper |id=Vol-64/paper-4 |storemode=property |title=SEEKing Knowledge in Legacy Information Systems to Support Interoperability |pdfUrl=https://ceur-ws.org/Vol-64/hammer.pdf |volume=Vol-64 |authors=Joachim Hammer,Mark Schmalz,William O'Brien,Sangeetha Shekar,Nikhil Haldevnekar }} ==SEEKing Knowledge in Legacy Information Systems to Support Interoperability== https://ceur-ws.org/Vol-64/hammer.pdf
    SEEKing Knowledge in Legacy Information Systems to
                Support Interoperability
      Joachim Hammer, Mark Schmalz, William O’Brien¥, Sangeetha Shekar and Nikhil Haldevnekar
                       Dept. of Computer & Information Science & Engineering
                                       University of Florida
                                   Gainesville, FL 32605, U.S.A.

Abstract. The SEEK project (Scalable Extraction of Enterprise                 data and knowledge from heterogeneous sources. Current
Knowledge) is developing methodologies to overcome the                        instantiations of SEEK are designed to extract the limited range
problems of assembling knowledge resident in numerous                         of information needed by these process models to support
legacy information systems by enabling rapid connection to,                   project optimization.
and privacy-constrained filtering of, legacy data and
applications with little programmatic setup. In this report we
                                                                                                                  Extended Enterprise
outline our use of data reverse engineering and code analysis
techniques to automatically infer as much as possible the                                                                                         Coordinator/
                                                                                                                Sub/                                 Lead
schema and semantics of a legacy information system. We                                                        Supplier
illustrate the approach using an example from our construction                           Supplier
                                                                                                                                      Sub/
supply chain testbed.                                                                                                                Supplier


                                                                                                                           …
1      MOTIVATION
We are developing methodologies and algorithms to facilitate
                                                                                                              SEEK     …    SEEK
discovery and extraction of enterprise knowledge from legacy                                     SEEK
                                                                                                wrapper      wrapper       wrapper

sources. These capabilities are being implemented in a toolkit                                                                                     Analysis
                                                                                                  Secure Hosting Infrastructure
called SEEK (Scalable Extraction of Enterprise Knowledge).                                                                                      (e.g., E-ERP)

SEEK is being developed as part of a larger, multi-disciplinary
research project to develop theory and methodologies in
support of computerized decision and negotiation support                      Figure 1: Using the SEEK toolkit to improve coordination in extended
across a network of firms (general overview in [6]). SEEK is                                             enterprises.
not meant as a replacement for wrapper or mediator
development toolkits. Rather, it complements existing tools by                2    SEEK APPROACH TO KNOWLEDGE
providing input about the contents and structure of the legacy                     EXTRATCION
source that has so far been supplied manually by domain
experts. This streamlines the process and makes wrapper                       SEEK applies Data Reverse Engineering (DRE) and Schema
development scalable.                                                         Matching (SM) processes to legacy database(s), to produce a
      Figure 1 illustrates the need for knowledge extraction                  source wrapper for a legacy source. The source wrapper will be
tools in support of wrapper development in the context of a                   used by another component (for the analysis component in
supply chain. There are many firms (principally, subcontractors               Figure 1) wishing to communicate and exchange information
and suppliers), and each firm contains legacy data used to                    with the legacy system.
manage internal processes. This data is also useful as input to a                   First SEEK generates a detailed description of the legacy
project level decision support tool. However, the large number                source, including entities, relationships, application-specific
of firms working on a project makes it likely that there will be a            meanings of the entities and relationships, business rules, data
high degree of physical and semantic heterogeneity in their                   formatting and reporting constraints, etc. We collectively refer
legacy systems. This implies practical difficulties in connecting             to this information as enterprise knowledge. The extracted
firms’ data and systems with enterprise-level decision support                enterprise knowledge forms a knowledgebase that serves as
tools. It is the role of the SEEK toolkit to help establish the               input for subsequent steps. In particular, DRE connects to the
necessary connections with minimal burden on the underlying                   underlying DBMS to extract schema information (most data
firms, which often have limited technical expertise. The SEEK                 sources support some form of Call-Level Interface such as
wrappers shown in Fig. 1 are wholly owned by the firm they                    JDBC). The schema information from the database is
are accessing and hence provide a safety layer between the                    semantically enhanced using clues extracted by the semantic
source and end user. Security can be further enhanced by                      analyzer from available application code, business reports, and,
deploying the wrappers in a secure hosting infrastructure at an               in the future, perhaps other electronically available information
ISP, for example, as shown in the figure.                                     that may encode business data such as e-mail correspondence,
      We note that SEEK is not intended to be a general-                      corporate memos, etc. It has been our experience (through
purpose data extraction tool: SEEK extracts a narrow range of                 visits with representatives from the construction and

¥
    Rinker School of Building Construction, University of Florida, Gainesville, FL 32634-6134
manufacturing domains) that such application code exists and                                       document so that it can be shared with other components in the
can be made available electronically. Second, the semantically                                     SEEK architecture (e.g., the semantic matcher). The Metadata
enhanced legacy source schema must be mapped into the                                              Repository is internal to DRE and used to store intermediate
domain model (DM) used by the application(s) that want(s) to                                       run-time information needed by the algorithms including user
access the legacy source. This is done using a schema mapping                                      input parameters, the abstract syntax tree for the code (e.g.,
process that produces the mapping rules between the legacy                                         from a previous invocation), etc.
source schema and the application domain model. In addition                                             We now highlight each of the eight steps and related
to the domain model, the schema mapper also needs access to                                        activities outlined in Figure 3 using an example from our
the domain ontology (DO) describing the model.                                                     construction supply chain testbed. For a detailed description of
        Finally, the extracted legacy schema and the mapping                                       our algorithm, refer to [3]. For simplicity, we assume without
rules provide the input to the wrapper generator (not shown),                                      lack of generality or specificity that only the following relations
which produces the source wrapper. In this paper, we focus on                                      exist in the MS-Project application, which will be discovered
our implementation of the DRE algorithm.                                                           using DRE (for a description of the entire schema refer to [5]):
                                                                                                   MSP-Project [PROJ_ID, ...]
3   Data Reverse Engineering                                                                       MSP-Availability[PROJ_ID, AVAIL_UID, ...]
Data reverse engineering (DRE) is defined as the application of                                    MSP-Resources [PROJ_ID, RES_UID, ...]
analytical techniques to one or more legacy data sources to                                        MSP-Tasks [PROJ_ID, TASK_UID, ...]
elicit structural information (e.g., term definitions, schema                                      MSP-Assignment [PROJ_ID, ASSN_UID, ...]
definitions) from the legacy source(s) in order to improve the                                          In order to illustrate the code analysis and how it enhances
database design or produce missing schema documentation. So                                        the schema extraction, we refer the reader to the following C
far in SEEK, we are applying DRE to relational databases only.                                     code fragment representing a simple, hypothetical interaction
However, since the relational model has only limited semantic                                      with the MS Project database.
expressability, in addition to the schema, our DRE algorithm
generates an E/R-like representation of the entities and                                                char *aValue, *cValue;
relationships that are not explicitly defined in the legacy                                             int flag = 0;
                                                                                                        int bValue = 0;
schema (but which exist implicitly). Our approach to data                                               EXEC SQL SELECT A,C INTO :aValue, :cValue
reverse engineering for relational sources is based on existing                                         FROM Z WHERE B = :bValue;
algorithms by Chiang [1, 2] and Petit [8]. However, we have                                             if (cValue < aValue)
improved their methodologies in several ways, most                                                              {  flag = 1; }
importantly to reduce the dependency on human input and to                                              printf(“Task Start Date %s “, aValue);
eliminate some of the limitations of their algorithms (e.g.,                                            printf(“Task Finish Date %s “, cValue);
consistent naming of key attributes, legacy schema in 3-NF).
                                                                                                   Step 1: AST Generation
                                                                                  Legacy
                DB Interface                                                      Source
                  Module                   Data          Application
                                                         Application Code
                                                                     Code
                                                                                                   We start by creating an Abstract Syntax Tree (AST) shown in
      configuration                                      1 AST Generation
                                                                                                   Figure 3. The AST will be used by the semantic analyzer for
                                                                                                   code exploration during step 3. Our objective in AST
             Queries
                               2 Dictionary Extraction                                AST          generation is to be able to associate “meaning” with program
                                                                                                   variables. Format strings in input/output statements contain
                                                                                                   semantic information that can be associated with the variables
                                                         3     Code
                                                              Analysis
                               4       Inclusion

                                                                                                   in the input/output statement. This program variable in turn
                                   Dependency Mining                Business
                                                                    Knowledge


                                                                                                   may be associated with a column of a table in the underlying
                                                                                       Metadata
                               5       Relation                                       Repository
                                     Classification

                                                                                                   legacy database.
                               6       Attribute
                                                                    Schema
                                     Classification


                                        Entity            Knowledge                                             1
                                                                                                                                                   Program
                               7                                                      XML DTD
                                     Identification        Encoder
                                                                                                              dclns                     2
                               8     Relationship
                                     Classification          XML DOC                                                            embSQL                  3

                                                                  To Schema Matcher
                                                                                                                                                       if                 4

                                                                                                                                                                       print
        Figure 2: Conceptual overview of the DRE algorithm.                                                                                                 2                               5


                                                                                                                                                     embSQL                            print
     Our DRE algorithm is divided into schema extraction and                                                               beginSQL

semantic analysis, which operate in interleaved fashion. An                                                                                 SQLselectone
overview of the two algorithms, which are comprised of eight                                                               columnlist
steps, is shown in Figure 2. In addition to the modules that                                                                                hostvariablelist
                                                                                                                                                                       SQLAssignment
                                                                                                                    
execute each of the eight steps, the architecture in Figure 3                                                                                                         
includes three support components: the configurable Database                                                         A
                                                                                                                                                                       B
                                                                                                                                                                                     
                                                                                                                               C
Interface Module (upper-right hand corner), which provides                                                                                    aValue
                                                                                                                                                            cValue
                                                                                                                                                                                     bValue

connectivity to the underlying legacy source. Note that this
component is the ONLY source-specific component in the                                             Figure 3: Application-specific code analysis via AST decomposition
architecture: in order to perform knowledge extraction from                                        and code slicing. The direction of slicing is backwards (forward) if the
different sources, only the interface module needs to be                                           variable in question is in an output (resp. input or declaration)
                                                                                                   statement.
changed. The Knowledge Encoder (lower right-hand corner)
represents the extracted knowledge in the form of an XML


                                                                                                                                                                                                2
Step 2. Dictionary Extraction.                                              During code slicing sub-step we traverse the AST for the
                                                                       source code and retain only those nodes that have an
The goal of step 2 is to obtain the relation and attribute names       occurrence of the slicing variable in sub-tree. This results in a
from the legacy source. This is done by querying the data              reduced AST, which is shown in Fig. 4.
dictionary, stored in the underlying database in the form of one
or more system tables. Otherwise, if primary key information
                                                                                         dclns
cannot be retrieved directly from the data dictionary, the
algorithm passes the set of candidate keys along with
predefined “rule-out” patterns to the code analyzer. The code                                        embSQL
analyzer searches for these patterns in the application code and
eliminates those attributes from the candidate set, which occur
                                                                                                                if
in the rule-out pattern. The rule-out patterns, which are
expressed as SQL queries, occur in the application code
whenever programmer expects to select a SET of tuples. If,                                                               print
after the code analysis, not all primary key can be identified,
the reduced set of candidate keys is presented to the user for
final primary key selection.                                                                 Figure 4: Reduced AST.

Result. In the example DRE application, the following                       During the analysis sub-step, our algorithm extracts the
relations and their attributes were obtained from the MS-              information shown in Table 2, while traversing the reduced
Project database:                                                      AST in pre-order.
                                                                       1. If a dcln node is encountered, the data type of the identifier
MSP-Project [PROJ_ID, ...]                                                 can be learned.
MSP-Availability[PROJ_ID, AVAIL_UID, ...]                              2. embSQL contain the mapping information of identifier
MSP-Resources [PROJ_ID, RES_UID, ...]                                      name to corresponding column name and table name in the
MSP-Tasks [PROJ_ID, TASK_UID, ...]                                         database.
MSP-Assignment [PROJ_ID, ASSN_UID, ...]                                3. Printf/scanf nodes contain the mapping information from
                                                                           the text string to the identifier. In other words we can
Step 3: Code Analysis                                                      extract the ‘meaning’ of the identifier from the text string.
The objective of step 3, code analysis, is twofold: (1) augment              Table 2: Information inferred during the analysis sub-step.
entities extracted in step 2 with domain semantics, and (2)
                                                                               Identifier    Meaning          Possible Business Rule
identify business rules and constraints not explicitly stored in
                                                                                 Name
the database, but which may be important to the wrapper                        aValue        Task Start    if (cValue < aValue)
developer or application program accessing the legacy source.                                Date          {
Our approach to code analysis is based on code analysis, which                                             }
includes slicing [4] and pattern matching [7].                                 cValue        Task          if (cValue < aValue)
      The first step is the pre-slicing. From the AST of the                                 Finish        {
application code, the pre-slicer identifies all the nodes                                    Date          }
corresponding to input, output and embedded SQL statements.                     Data type      Column Name           Table Name in
It appends the statement node name, and identifier list to an                                     in Source              Source
array as the AST is traversed in pre-order. For example, for the               Char * =>      A                   Z
                                                                               string
AST in Figure 3, the array contains the following information
                                                                               Char * =>         C                   Z
depicted in Table 1. The identifiers that occur in this data                   string
structure maintained by the pre-slicer form the set of slicing
variables.                                                                  The results of analysis sub-step are appended to a result
                                                                       report file. After the code slicer and analyzer have been
         Table 1: Information maintained by the pre-slicer.
                                                                       invoked on every slicing variable identified by the pre-slicer,
  Node       Statement      Text String    Identifiers    Direction    the results report file is presented to the user. The user can base
  number                    (for print                    of Slicing   his decision of whether to perform further analysis based on the
                            nodes)                                     information extracted so far. If the user decides not to perform
  2          embSQL         -----          aValue         Backwards    further analysis, code analysis passes control to the inclusion
             (Embedded                     cValue
                                                                       dependency detection module.
             SQL node)
                                                                            It is important to note, that we identify enterprise
      The code slicer and analyzer, which represent steps two          knowledge by matching templates against code fragments in
and three respectively, are executed once for each slicing             the AST. So far, we have developed patterns for discovering
variable identified by the pre-slicer. In the above example, the       business rules which are encoded in loop structures and/or
slicing variables that occur in SQL and output statements are          conditional statements and mathematical formulae, which are
aValue and cValue. The direction of slicing is fixed as                encoded in loop structures and/or assignment statements. Note,
backwards or forwards depending on whether the variable in             the occurrence of an assignment statement itself does not
question is part of a output (backwards) or input (forwards)           necessarily indicate the presence of a mathematical formula,
statement. The slicing criterion is the exact statement (SQL or        but the likelihood increases significantly if the statement
input or output) node that corresponds to the slicing variable.        contains one of the “slicing variables.”




                                                                                                                                           3
Step 4. Discovering Inclusion Dependencies.                                     The last two inclusion dependencies are removed since
                                                                          they are implicitly contained in the inclusion dependencies
After extraction of the relational schema in step 2, the goal of          listed in lines 2, 3 and 4 using the transitivity relationship.
step 4 is to identify constraints to help classify the extracted
relations, which represent both the real-world entities and the           Step 5. Classification of the Relations.
relationships among them. This is done using inclusion
dependencies (INDs), which indicate the existence of inter-               When reverse-engineering a relational schema, it is important
relational constraints including class/subclass relationships.            to understand that due to the limited expressability of the
      Let A and B be two relations, and X and Y be attributes or          relational model, all real-world entities are represented as
a set of attributes of A and B respectively. An inclusion                 relations irrespective of their types and role in the model. The
dependency A.X << B.Y denotes that a set of values appearing              goal of this step is to identify the different “types” of relations,
in A.X is a subset of B.Y. Inclusion dependencies are                     some of which correspond to actual real-world entities while
discovered by examining all possible subset relationships                 others represent relationships among them.
between any two relations A and B in the legacy source.                         In this step all the relations in the database are classified
      Without additional input from the domain expert,                    into one of four types – strong, regular, weak or specific.
inclusion dependencies can be identified in an exhaustive                 Identifying different relations is done using the primary key
manner as follows: for each pair of relations A and B in the              information obtained in step 2 and the inclusion dependencies
legacy source schema, compare the values for each non-key                 from step 4. Intuitively, a strong entity-relation represents a
attribute combination X in B with the values of each candidate            real-world entity whose members can be identified exclusively
key attribute combination Y in A (note that X and Y may be                through its own properties. A weak entity-relation represents an
single attributes). An inclusion dependency B.X<