SEEKing Knowledge in Legacy Information Systems to
                Support Interoperability
      Joachim Hammer, Mark Schmalz, William O’Brien¥, Sangeetha Shekar and Nikhil Haldevnekar
                       Dept. of Computer & Information Science & Engineering
                                       University of Florida
                                   Gainesville, FL 32605, U.S.A.

Abstract. The SEEK project (Scalable Extraction of Enterprise                 data and knowledge from heterogeneous sources. Current
Knowledge) is developing methodologies to overcome the                        instantiations of SEEK are designed to extract the limited range
problems of assembling knowledge resident in numerous                         of information needed by these process models to support
legacy information systems by enabling rapid connection to,                   project optimization.
and privacy-constrained filtering of, legacy data and
applications with little programmatic setup. In this report we
                                                                                                                  Extended Enterprise
outline our use of data reverse engineering and code analysis
techniques to automatically infer as much as possible the                                                                                         Coordinator/
                                                                                                                Sub/                                 Lead
schema and semantics of a legacy information system. We                                                        Supplier
illustrate the approach using an example from our construction                           Supplier
                                                                                                                                      Sub/
supply chain testbed.                                                                                                                Supplier


                                                                                                                           …
1      MOTIVATION
We are developing methodologies and algorithms to facilitate
                                                                                                              SEEK     …    SEEK
discovery and extraction of enterprise knowledge from legacy                                     SEEK
                                                                                                wrapper      wrapper       wrapper

sources. These capabilities are being implemented in a toolkit                                                                                     Analysis
                                                                                                  Secure Hosting Infrastructure
called SEEK (Scalable Extraction of Enterprise Knowledge).                                                                                      (e.g., E-ERP)

SEEK is being developed as part of a larger, multi-disciplinary
research project to develop theory and methodologies in
support of computerized decision and negotiation support                      Figure 1: Using the SEEK toolkit to improve coordination in extended
across a network of firms (general overview in [6]). SEEK is                                             enterprises.
not meant as a replacement for wrapper or mediator
development toolkits. Rather, it complements existing tools by                2    SEEK APPROACH TO KNOWLEDGE
providing input about the contents and structure of the legacy                     EXTRATCION
source that has so far been supplied manually by domain
experts. This streamlines the process and makes wrapper                       SEEK applies Data Reverse Engineering (DRE) and Schema
development scalable.                                                         Matching (SM) processes to legacy database(s), to produce a
      Figure 1 illustrates the need for knowledge extraction                  source wrapper for a legacy source. The source wrapper will be
tools in support of wrapper development in the context of a                   used by another component (for the analysis component in
supply chain. There are many firms (principally, subcontractors               Figure 1) wishing to communicate and exchange information
and suppliers), and each firm contains legacy data used to                    with the legacy system.
manage internal processes. This data is also useful as input to a                   First SEEK generates a detailed description of the legacy
project level decision support tool. However, the large number                source, including entities, relationships, application-specific
of firms working on a project makes it likely that there will be a            meanings of the entities and relationships, business rules, data
high degree of physical and semantic heterogeneity in their                   formatting and reporting constraints, etc. We collectively refer
legacy systems. This implies practical difficulties in connecting             to this information as enterprise knowledge. The extracted
firms’ data and systems with enterprise-level decision support                enterprise knowledge forms a knowledgebase that serves as
tools. It is the role of the SEEK toolkit to help establish the               input for subsequent steps. In particular, DRE connects to the
necessary connections with minimal burden on the underlying                   underlying DBMS to extract schema information (most data
firms, which often have limited technical expertise. The SEEK                 sources support some form of Call-Level Interface such as
wrappers shown in Fig. 1 are wholly owned by the firm they                    JDBC). The schema information from the database is
are accessing and hence provide a safety layer between the                    semantically enhanced using clues extracted by the semantic
source and end user. Security can be further enhanced by                      analyzer from available application code, business reports, and,
deploying the wrappers in a secure hosting infrastructure at an               in the future, perhaps other electronically available information
ISP, for example, as shown in the figure.                                     that may encode business data such as e-mail correspondence,
      We note that SEEK is not intended to be a general-                      corporate memos, etc. It has been our experience (through
purpose data extraction tool: SEEK extracts a narrow range of                 visits with representatives from the construction and

¥
    Rinker School of Building Construction, University of Florida, Gainesville, FL 32634-6134
manufacturing domains) that such application code exists and                                       document so that it can be shared with other components in the
can be made available electronically. Second, the semantically                                     SEEK architecture (e.g., the semantic matcher). The Metadata
enhanced legacy source schema must be mapped into the                                              Repository is internal to DRE and used to store intermediate
domain model (DM) used by the application(s) that want(s) to                                       run-time information needed by the algorithms including user
access the legacy source. This is done using a schema mapping                                      input parameters, the abstract syntax tree for the code (e.g.,
process that produces the mapping rules between the legacy                                         from a previous invocation), etc.
source schema and the application domain model. In addition                                             We now highlight each of the eight steps and related
to the domain model, the schema mapper also needs access to                                        activities outlined in Figure 3 using an example from our
the domain ontology (DO) describing the model.                                                     construction supply chain testbed. For a detailed description of
        Finally, the extracted legacy schema and the mapping                                       our algorithm, refer to [3]. For simplicity, we assume without
rules provide the input to the wrapper generator (not shown),                                      lack of generality or specificity that only the following relations
which produces the source wrapper. In this paper, we focus on                                      exist in the MS-Project application, which will be discovered
our implementation of the DRE algorithm.                                                           using DRE (for a description of the entire schema refer to [5]):
                                                                                                   MSP-Project [PROJ_ID, ...]
3   Data Reverse Engineering                                                                       MSP-Availability[PROJ_ID, AVAIL_UID, ...]
Data reverse engineering (DRE) is defined as the application of                                    MSP-Resources [PROJ_ID, RES_UID, ...]
analytical techniques to one or more legacy data sources to                                        MSP-Tasks [PROJ_ID, TASK_UID, ...]
elicit structural information (e.g., term definitions, schema                                      MSP-Assignment [PROJ_ID, ASSN_UID, ...]
definitions) from the legacy source(s) in order to improve the                                          In order to illustrate the code analysis and how it enhances
database design or produce missing schema documentation. So                                        the schema extraction, we refer the reader to the following C
far in SEEK, we are applying DRE to relational databases only.                                     code fragment representing a simple, hypothetical interaction
However, since the relational model has only limited semantic                                      with the MS Project database.
expressability, in addition to the schema, our DRE algorithm
generates an E/R-like representation of the entities and                                                char *aValue, *cValue;
relationships that are not explicitly defined in the legacy                                             int flag = 0;
                                                                                                        int bValue = 0;
schema (but which exist implicitly). Our approach to data                                               EXEC SQL SELECT A,C INTO :aValue, :cValue
reverse engineering for relational sources is based on existing                                         FROM Z WHERE B = :bValue;
algorithms by Chiang [1, 2] and Petit [8]. However, we have                                             if (cValue < aValue)
improved their methodologies in several ways, most                                                              {  flag = 1; }
importantly to reduce the dependency on human input and to                                              printf(“Task Start Date %s “, aValue);
eliminate some of the limitations of their algorithms (e.g.,                                            printf(“Task Finish Date %s “, cValue);
consistent naming of key attributes, legacy schema in 3-NF).
                                                                                                   Step 1: AST Generation
                                                                                  Legacy
                DB Interface                                                      Source
                  Module                   Data          Application
                                                         Application Code
                                                                     Code
                                                                                                   We start by creating an Abstract Syntax Tree (AST) shown in
      configuration                                      1 AST Generation
                                                                                                   Figure 3. The AST will be used by the semantic analyzer for
                                                                                                   code exploration during step 3. Our objective in AST
             Queries
                               2 Dictionary Extraction                                AST          generation is to be able to associate “meaning” with program
                                                                                                   variables. Format strings in input/output statements contain
                                                                                                   semantic information that can be associated with the variables
                                                         3     Code
                                                              Analysis
                               4       Inclusion

                                                                                                   in the input/output statement. This program variable in turn
                                   Dependency Mining                Business
                                                                    Knowledge


                                                                                                   may be associated with a column of a table in the underlying
                                                                                       Metadata
                               5       Relation                                       Repository
                                     Classification

                                                                                                   legacy database.
                               6       Attribute
                                                                    Schema
                                     Classification


                                        Entity            Knowledge                                             1
                                                                                                                                                   Program
                               7                                                      XML DTD
                                     Identification        Encoder
                                                                                                              dclns                     2
                               8     Relationship
                                     Classification          XML DOC                                                            embSQL                  3

                                                                  To Schema Matcher
                                                                                                                                                       if                 4

                                                                                                                                                                       print
        Figure 2: Conceptual overview of the DRE algorithm.                                                                                                 2                               5


                                                                                                                                                     embSQL                            print
     Our DRE algorithm is divided into schema extraction and                                                               beginSQL

semantic analysis, which operate in interleaved fashion. An                                                                                 SQLselectone
overview of the two algorithms, which are comprised of eight                                                               columnlist
steps, is shown in Figure 2. In addition to the modules that                                                                                hostvariablelist
                                                                                                                                                                       SQLAssignment
                                                                                                                    <id>
execute each of the eight steps, the architecture in Figure 3                                                                 <id>              <id>                          <id>
includes three support components: the configurable Database                                                         A
                                                                                                                                                                <id>       B
                                                                                                                                                                                     <id>
                                                                                                                               C
Interface Module (upper-right hand corner), which provides                                                                                    aValue
                                                                                                                                                            cValue
                                                                                                                                                                                     bValue

connectivity to the underlying legacy source. Note that this
component is the ONLY source-specific component in the                                             Figure 3: Application-specific code analysis via AST decomposition
architecture: in order to perform knowledge extraction from                                        and code slicing. The direction of slicing is backwards (forward) if the
different sources, only the interface module needs to be                                           variable in question is in an output (resp. input or declaration)
                                                                                                   statement.
changed. The Knowledge Encoder (lower right-hand corner)
represents the extracted knowledge in the form of an XML


                                                                                                                                                                                                2
Step 2. Dictionary Extraction.                                              During code slicing sub-step we traverse the AST for the
                                                                       source code and retain only those nodes that have an
The goal of step 2 is to obtain the relation and attribute names       occurrence of the slicing variable in sub-tree. This results in a
from the legacy source. This is done by querying the data              reduced AST, which is shown in Fig. 4.
dictionary, stored in the underlying database in the form of one
or more system tables. Otherwise, if primary key information
                                                                                         dclns
cannot be retrieved directly from the data dictionary, the
algorithm passes the set of candidate keys along with
predefined “rule-out” patterns to the code analyzer. The code                                        embSQL
analyzer searches for these patterns in the application code and
eliminates those attributes from the candidate set, which occur
                                                                                                                if
in the rule-out pattern. The rule-out patterns, which are
expressed as SQL queries, occur in the application code
whenever programmer expects to select a SET of tuples. If,                                                               print
after the code analysis, not all primary key can be identified,
the reduced set of candidate keys is presented to the user for
final primary key selection.                                                                 Figure 4: Reduced AST.

Result. In the example DRE application, the following                       During the analysis sub-step, our algorithm extracts the
relations and their attributes were obtained from the MS-              information shown in Table 2, while traversing the reduced
Project database:                                                      AST in pre-order.
                                                                       1. If a dcln node is encountered, the data type of the identifier
MSP-Project [PROJ_ID, ...]                                                 can be learned.
MSP-Availability[PROJ_ID, AVAIL_UID, ...]                              2. embSQL contain the mapping information of identifier
MSP-Resources [PROJ_ID, RES_UID, ...]                                      name to corresponding column name and table name in the
MSP-Tasks [PROJ_ID, TASK_UID, ...]                                         database.
MSP-Assignment [PROJ_ID, ASSN_UID, ...]                                3. Printf/scanf nodes contain the mapping information from
                                                                           the text string to the identifier. In other words we can
Step 3: Code Analysis                                                      extract the ‘meaning’ of the identifier from the text string.
The objective of step 3, code analysis, is twofold: (1) augment              Table 2: Information inferred during the analysis sub-step.
entities extracted in step 2 with domain semantics, and (2)
                                                                               Identifier    Meaning          Possible Business Rule
identify business rules and constraints not explicitly stored in
                                                                                 Name
the database, but which may be important to the wrapper                        aValue        Task Start    if (cValue < aValue)
developer or application program accessing the legacy source.                                Date          {
Our approach to code analysis is based on code analysis, which                                             }
includes slicing [4] and pattern matching [7].                                 cValue        Task          if (cValue < aValue)
      The first step is the pre-slicing. From the AST of the                                 Finish        {
application code, the pre-slicer identifies all the nodes                                    Date          }
corresponding to input, output and embedded SQL statements.                     Data type      Column Name           Table Name in
It appends the statement node name, and identifier list to an                                     in Source              Source
array as the AST is traversed in pre-order. For example, for the               Char * =>      A                   Z
                                                                               string
AST in Figure 3, the array contains the following information
                                                                               Char * =>         C                   Z
depicted in Table 1. The identifiers that occur in this data                   string
structure maintained by the pre-slicer form the set of slicing
variables.                                                                  The results of analysis sub-step are appended to a result
                                                                       report file. After the code slicer and analyzer have been
         Table 1: Information maintained by the pre-slicer.
                                                                       invoked on every slicing variable identified by the pre-slicer,
  Node       Statement      Text String    Identifiers    Direction    the results report file is presented to the user. The user can base
  number                    (for print                    of Slicing   his decision of whether to perform further analysis based on the
                            nodes)                                     information extracted so far. If the user decides not to perform
  2          embSQL         -----          aValue         Backwards    further analysis, code analysis passes control to the inclusion
             (Embedded                     cValue
                                                                       dependency detection module.
             SQL node)
                                                                            It is important to note, that we identify enterprise
      The code slicer and analyzer, which represent steps two          knowledge by matching templates against code fragments in
and three respectively, are executed once for each slicing             the AST. So far, we have developed patterns for discovering
variable identified by the pre-slicer. In the above example, the       business rules which are encoded in loop structures and/or
slicing variables that occur in SQL and output statements are          conditional statements and mathematical formulae, which are
aValue and cValue. The direction of slicing is fixed as                encoded in loop structures and/or assignment statements. Note,
backwards or forwards depending on whether the variable in             the occurrence of an assignment statement itself does not
question is part of a output (backwards) or input (forwards)           necessarily indicate the presence of a mathematical formula,
statement. The slicing criterion is the exact statement (SQL or        but the likelihood increases significantly if the statement
input or output) node that corresponds to the slicing variable.        contains one of the “slicing variables.”


                                                                                                                                           3
Step 4. Discovering Inclusion Dependencies.                                     The last two inclusion dependencies are removed since
                                                                          they are implicitly contained in the inclusion dependencies
After extraction of the relational schema in step 2, the goal of          listed in lines 2, 3 and 4 using the transitivity relationship.
step 4 is to identify constraints to help classify the extracted
relations, which represent both the real-world entities and the           Step 5. Classification of the Relations.
relationships among them. This is done using inclusion
dependencies (INDs), which indicate the existence of inter-               When reverse-engineering a relational schema, it is important
relational constraints including class/subclass relationships.            to understand that due to the limited expressability of the
      Let A and B be two relations, and X and Y be attributes or          relational model, all real-world entities are represented as
a set of attributes of A and B respectively. An inclusion                 relations irrespective of their types and role in the model. The
dependency A.X << B.Y denotes that a set of values appearing              goal of this step is to identify the different “types” of relations,
in A.X is a subset of B.Y. Inclusion dependencies are                     some of which correspond to actual real-world entities while
discovered by examining all possible subset relationships                 others represent relationships among them.
between any two relations A and B in the legacy source.                         In this step all the relations in the database are classified
      Without additional input from the domain expert,                    into one of four types – strong, regular, weak or specific.
inclusion dependencies can be identified in an exhaustive                 Identifying different relations is done using the primary key
manner as follows: for each pair of relations A and B in the              information obtained in step 2 and the inclusion dependencies
legacy source schema, compare the values for each non-key                 from step 4. Intuitively, a strong entity-relation represents a
attribute combination X in B with the values of each candidate            real-world entity whose members can be identified exclusively
key attribute combination Y in A (note that X and Y may be                through its own properties. A weak entity-relation represents an
single attributes). An inclusion dependency B.X<<A.Y may be               entity that has no properties of its own that can be used to
present if:                                                               identify its members. In the relation model, the primary keys of
1. X and Y have same number of attributes.                                weak entity-relations usually contain primary key attributes
2. X and Y must have pair wise domain compatibility.                      from other (strong) entity-relations. Both regular and specific
3. B.X ⊆ A.Y                                                              relations are relations that represent relationships between two
                                                                          entities in the real world (rather then the entities themselves).
      In order to check the subset criteria (3), we have designed         However, there are instances when not all of the entities
the following generalized SQL query templates, which are                  participating in an (n-ary) relationship are present in the
instantiated for each pair of relations and attribute                     database schema (e.g., one or more of the relations were
combinations and run against the legacy source:                           deleted as part of the normal database schema evolution
C1 =                           C2 =                                       process). While reverse engineering the database, we identify
SELECT count (*)               SELECT count (*)                           such relationships as special relations.
FROM R1                        FROM R2
WHERE U NOT IN                 WHERE V NOT IN                             Result:
      (SELECT V                           (SELECT U                       Strong Entities: MSP_Projects
        FROM R2);                         FROM R1);                       Weak Entities: MSP_Resources, MSP_Tasks,
                                                                                           MSP_Availability
      If C1 is zero, we can deduce that there may exist an
inclusion dependency R1.U << R2.V; likewise, if C2 is zero                Regular Relationship: MSP-Assignment
there may exist an inclusion dependency R2.V << R1.U. Note
                                                                          Step 6. Classification of the Attributes.
that it is possible for both C1 and C2 to be zero. In that case,
we can conclude that the two sets of attributes U and V are               We classify attributes as (a) PK or FK (from DRE-1 or DRE-
equal.                                                                    2), (b) Dangling or General, or (c) Non-Key (rest).
      The worst-case complexity of this exhaustive search,
given N tables and M attributes per table (NM total attributes),          Result: Table 3 illustrates attributes obtained from the example
is O(N2M2). However, we reduce the search space in those                  legacy source.
cases where we can identify equi-join queries in the application           Table 3. Example of attribute classification from MS-Project legacy
code (during semantic analysis). Each equi-join query allows us                                          source.
to deduce the existence of one or more inclusion dependencies                             PKA       DKA         GKA        FKA        NKA
in the underlying schema. In addition, using the results of the            MS-Project     Proj_ID                                     All
corresponding count queries we can also determine the                      MS-            Proj_ID   Res_uid                           Remaining
                                                                           Resources                                                  Attributes
“direction” of the dependencies. This allows us to limit our
                                                                           MS-Tasks       Proj_ID   Task_uid
exhaustive searching to only those relations not mentioned in              MS-            Proj_ID   Avail_uid              Res_uid+
the extracted queries.                                                     Availability                                    Proj_ID
                                                                           MS-            Proj_ID               Assn_uid   Res_uid+
Result: Inclusion dependencies are as follows:                             Assignment                                      Proj_ID,
                                                                                                                           Task_uid
1 MSP_Assignment[Task_uid,Proj_ID] << MSP_Tasks [Task_uid,Proj_ID]                                                         +
2 MSP_Assignment[Res_uid,Proj_ID] << MSP_Resources[Res_uid,Proj_ID]                                                        Proj_ID
3 MSP_Availability [Res_uid,Proj_ID] << MSP_Resources [Res_uid,Proj_ID]
4 MSP_Resources [Proj_ID] << MSP_Project [Proj_ID]
                                                                          Step 7. Identify Entity Types.
5 MSP_Tasks [Proj_ID] << MSP_Project [Proj_ID]                            Strong (weak) entity relations obtained from step 5 are directly
6 MSP_Assignment [Proj_ID] << MSP_Project [Proj_ID]                       converted into strong (resp. weak) entities.
7 MSP_Availability [Proj_ID] << MSP_Project [Proj_ID]


                                                                                                                                                   4
Result: The following entities were classified:                        •   Names and classification of all entities and attributes.
Strong entities:
       MSP_Project with Proj_ID as its key.
                                                                       •   Primary and foreign keys.
Weak entities:                                                         •   Data types.
       MSP_Tasks with Task_uid as key and                              •   Simple constraints (e.g., unique) and explicit assertions.
       MSP_Project as its owner.                                       •   Relationships and their cardinalities.
       MSP_Resources with Res_uid as key and
       MSP_Project as its owner.                                       •   Business rules
       MSP_Availability with Avail_uid as key and                           A conceptual overview of the extracted schema is
       MSP_Resources as owner.                                         represented by the entity-relationship diagram shown in Figure
                                                                       5 (business rules not shown), which is an accurate
Step 8. Identify Relationship Types.                                   representation of the information in encoded in the original MS
The inclusion dependencies discovered in step 4 form the basis         Project schema.
for determining the relationship types among the entities
identified above. This is a two-step process:
                                                                                                                       Res_UID
1. Identify relationships present as relations in the relational                  Proj_ID

    database. The relation types (regular and specific) obtained
    from the classification of relations (Step 5) are converted                               1                N
                                                                              MSP_PROJECTS             Use          MSP_RESOURCES
    into relationships. The participating entity types are derived
    from the inclusion dependencies. For completeness of the                             1
    extracted schema, we may decide to create a new entity                         Has
                                                                                                       MSP_    N
                                                                                                                        Have
                                                                                                      ASSIGN
    when conceptualizing a specific relation.
                                                                                         N
    The cardinality between the entities is M:N.                                                  M

2. Identify relationships among the entity types (strong and                    MSP_TASKS                          MSP_AVAILABILITY

    weak) that were not present as relations in the relational
    database, via the following classification.                                  Task_UID                             Avail_UID

      • IS-A relationships can be identified using the PKAs of
        strong entity relations and the inclusion dependencies              Figure 5: E/R diagram representing the extracted schema.
        among PKAs. The cardinality of the IS-A relationship
        between the corresponding strong entities is 1:1.
                                                                       4   STATUS AND FUTURE WORK
      • Dependent relationship: For each weak entity type, the
        owner is determined by examining the inclusion                 We have manually tested our approach for a number of
        dependencies involving the corresponding weak entity-          scenarios and domains (including construction, manufacturing
        relation. The cardinality of the dependent relationship        and health care) to validate our knowledge extraction algorithm
        between the owner and the weak entity is 1:N.                  and to estimate how much user input is required. In addition,
        Aggregate relationships: If the foreign key in any of the      we have also conducted experiments using nine different
        regular and specific relations refers to the PKA of one        database applications that were created by students during
        of the strong entity relations, an aggregate relationship      course projects. The experimental results so far are
        is identified. The cardinality is either 1:1 or 1:N.           encouraging: the DRE algorithm was able to reverse engineer
                                                                       all of the sample legacy sources encountered so far. When
      • Other binary relationships: Other binary relationships         coupled with semantic analysis, human input is reduced
        are identified from the FKAs not used in identifying the       compared to existing methods. Instead the user is presented
        above relationships. If the foreign key contains unique        with clues and guidelines that lead to the augmentation of the
        values, the cardinality is 1:1, else the cardinality is 1:N.   schema with additional semantic knowledge.
Result:                                                                     The SEEK prototype is being extended using sample data
                                                                       from a large building construction project on the University of
We discovered 1:N binary relationships between the following
                                                                       Florida campus in cooperation with the manager, Centex
weak entity types:
                                                                       Rooney Inc., and several subcontractors or suppliers. This data
Between MSP_Project and MSP_Tasks                                      testbed will support much more rigorous testing of the SEEK
Between MSP_Project and MSP_Resources                                  toolkit. Other plans for the SEEK toolkit are:
Between MSP_Resources and MSP_Availabilty                              •    Develop a formal representation for the extracted
                                                                            knowledge.
     Since    two     inclusion    dependencies  involving             •    Develop a matching tool capable of producing mappings
MSP_Assignment exist (i.e., between Task and                                between two semantically related yet structurally different
Assignment and between Resource and Assignment),                            schemas. Currently, schema matching is performed
there is no need to define a new entity.             Thus,                  manually, which is a tedious, error-prone, and expensive
MSP_Assignment becomes an M:N relationship between                          process.
MSP_Tasks and MSP_Resources.                                           •    Integrate SEEK with a wrapper development toolkit to
     At the end of Step 8, DRE has extracted the following                  determine if the extracted knowledge is sufficiently rich
schema information from the legacy database:                                semantically to support compilation of legacy source
                                                                            wrappers for our construction testbed.


                                                                                                                                        5
ACKNOWLEDGEMENTS
This material is based upon work supported by the National
Science Foundation under grant numbers CMS-0075407 and
CMS-0122193. The authors also thank Dr. Raymond Issa for
his valuable comments and feedback on a draft of this paper.

REFERENCES
[1] R. H. Chiang, “A knowledge-based system for performing
    reverse engineering of relational database,” Decision
    Support Systems, 13, pp. 295-312, 1995.
[2] R. H. L. Chiang, T. M. Barron, and V. C. Storey, “Reverse
    engineering of relational databases: Extraction of an EER
    model from a relational database,” Data and Knowledge
    Engineering, 12:1, pp. 107-142., 1994.
[3] J. Hammer, M. Schmalz, W. O'Brien, S. Shekar, and N.
    Haldavnekar, “Knowledge Extraction in the SEEK
    Project,” University of Florida, Gainesville, FL 32611-
    6120, Technical Report TR-0214, June 2002.
[4] S. Horwitz and T. Reps, “The use of program dependence
    graphs in software engineering,” in Proceedings of the
    Fourteenth International Conference on Software
    Engineering, Melbourne, Australia, 1992.
[5] Microsoft Corp., “Microsoft Project 2000 Database Design
    Diagram”,
   http://www.microsoft.com/office/project/prk/2
   000/Download/VisioHTM/P9_dbd_frame.htm.
[6] W. O'Brien, R. R. Issa, J. Hammer, M. S. Schmalz, J.
    Geunes, and S. X. Bai, “SEEK: Accomplishing Enterprise
    Information Integration Across Heterogeneous Sources,”
    ITCON - Journal of Information Technology in
    Construction, 2002.
[7] S. Paul and A. Prakash, “A Framework for Source Code
    Search Using Program Patterns,” Software Engineering,
    20:6, pp. 463-475, 1994.
[8] J.-M. Petit, F. Toumani, J.-F. Boulicaut, and J.
    Kouloumdjian, “Towards the Reverse Engineering of
    Denormalized Relational Databases,” in Proceedings of the
    Twelfth International Conference on Data Engineering
    (ICDE), New Orleans, LA, pp. 218-227, 1996.


                                                                6