A Principled Approach to Data Integration and Reconciliation in
                      Data Warehousing

                               Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini,
                                             Daniele Nardi, Riccardo Rosati
                                       Dipartimento di Informatica e Sistemistica,
                                           Università di Roma “La Sapienza”
                                           Via Salaria 113, 00198 Roma, Italy
                                           lastname@dis.uniroma1.it
                                  http://www.dis.uniroma1.it/˜lastname


                                                                             interest. The typical architecture of an integration systems
                                                                             is described in terms of two types of modules: wrappers
                                Abstract                                     and mediators [Wie92, Ull97]. The goal of a wrapper is
                                                                             to access a source, extract the relevant data, and present
      Integration is one of the most important aspects                       such data in a specified format. The role of a mediator is to
      of a Data Warehouse. When data passes from the                         merge data produced by different wrappers (or mediators),
      sources of the application-oriented operational en-                    so as to meet a specific information need of the integra-
      vironment to the Data Warehouse, possible incon-                       tion system. The specification and the realization of me-
      sistencies and redundancies should be resolved, so                     diators is the core problem in the design of an integration
      that the warehouse is able to provide an integrated                    system. This problem has recently become a central issue
      and reconciled view of data of the organization.                       in several contexts, including multi-database systems, Data
      We describe a novel approach to data integration                       Warehousing and information gathering from the Web.
      and reconciliation, based on a conceptual repre-                          The constraints that are typical of Data Warehouse ap-
      sentation of the Data Warehouse application do-                        plications restrict the large spectrum of approaches that are
      main. The main idea is to declaratively specify                        being proposed [Hul97, Inm96, JLVV99]. First, while the
      suitable matching, conversion, and reconciliation                      sources on the Web are often external, in a Data Warehouse
      operations to be used in order to solve possibile                      they are mostly internal to the organization. Second, a Data
      conflicts among data in different sources. Such a                      Warehouse should reflect the informational needs of an or-
      specification is provided in terms of the concep-                      ganization, and should therefore be defined in terms of a
      tual model of the application, and is effectively                      global, corporate view of data. Third, such a view should
      used during the design of the software modules                         be provided in terms of conceptual representation mecha-
      that load the data from the sources into the Data                      nism that is able to abstract from the physical and logical
      Warehouse.                                                             organization of data in the sources. It follows that the need
                                                                             and requirements for maintaining an integrated, conceptual
1 Introduction                                                               view of the corporate data in the organization are stronger
                                                                             with respect to other contexts. A direct consequence of this
Information Integration is the problem of acquiring data
                                                                             fact is that the data in the sources and in the Data Ware-
from a set of sources that are available for the application of
                                                                             house should be defined in terms of the conceptual model,
The copyright of this paper belongs to the paper’s authors. Permission to    and not the other way around. In other words, data in-
copy without fee all or part of this material is granted provided that the   tegration in Data Warehousing should follow the local as
copies are not made or distributed for direct commercial advantage.          view approach, where each table in a source and in the Data
Proceedings of the International Workshop on Design and                      Warehouse is defined as a view of a global model of the cor-
Management of Data Warehouses (DMDW’99)                                      porate data. On the contrary, the global as view approach
Heidelberg, Germany, 14. - 15.6. 1999                                        requires, for each information need, to specify the corre-
(S. Gatziu, M. Jeusfeld, M. Staudt, Y. Vassiliou, eds.)                      sponding query in terms of the data at the sources, and is
http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-19/        therefore suited when no global, integrated view of the data


D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati                                                                       16-1
of the organization is available.                                 browsing through a meta-level description of the relations
   The above considerations motivate the approach to in-          of the sources. In addition, it generally provides both for
formation integration proposed in [CDGL+ 98d], whose              automatic code generators and for the possibility of attach-
distinguishing feature is to exploit the possibility of rep-      ing procedures to accomplish ad hoc transformations and
resenting the conceptual level of a Data Warehouse in a           filtering of the data. Even though there are powerful and
very expressive language and use reasoning tools to support       effective environments with the above features, their nature
the Data Warehouse construction, maintenance and evolu-           is inherently procedural and close to the notion of global as
tion. In fact, the idea is to balance the effort of building a    view, where the task of relating the sources with the Data
conceptual model of the Data Warehouse by improving the           Warehouse is done on a query-by-query basis.
capabilities of the system in maintaining the Data Ware-              Several recent research contributions address the same
house and support the incremental addition of information         problem from a more formal perspective [HGMW+ 95,
sources. The proposed approach follows a local as view            Wid95, GM95, HZ96, ZHK96, ZHKF95, PGMW95,
paradigm, by explicitly requiring an enterprise conceptual        GMS94]. For example, a methodology for extracting, com-
model which is therefore regarded as a unified view of the        paring and matching data and objects located in different
data available within the organization.                           sources is described in [PGMW95]. The methodology is
   Most of the work on integration has been concerned             based on the Object Exchange Model, which requires the
with the intensional/schema level, while less attention has       explicit semantic labeling of the objects, to support object
been devoted to the problem of data integration at the exten-     exchange, and emphasizes the need for a tight interaction
sional level. Integration of data is, nonetheless, at the heart   between the system and the user. However, the method
of Data Warehousing [Inm96]. When data passes from the            remains of procedural nature, since it requires the user to
application-oriented operational environment to the Ware-         build and maintain the relationship between the sources and
house, possible inconsistencies and redundancies should be        the Data Warehouse on a query-by-query basis.
resolved, so that the Warehouse is able to provide an in-             The approach proposed in [GMS94] is more declarative
tegrated and reconciled view of data of the organization.         in nature. Suitable data structures for reconciling different
Thus, in the context of a Data Warehouse, data integration        representations of the same data are represented in a con-
and reconciliation is the process of acquiring data from the      text theory, which is used by the system to transform the
sources and making them available within the Warehouse.           queries as appropriate for gathering the data from the vari-
   Given a request for data (e.g., for materializing a new        ous sources. In such a declarative approach, the user is not
relation in the Data Warehouse), which is formulated in           directly concerned with the identification and resolution of
terms of the global view of the corporate data, (i.e., not        semantic conflicts when formulating the requests for data.
the language of the sources, but of the enterprise), there are    Rather, once the specification of the sources is available,
several steps that enable for the acquisition of data from the    conflicts are detected by the system, and conversion and
sources:                                                          filtering are automatically enforced. However, the method
                                                                  still follows the global-as-view approach, and the context
 1. Identification of the sources where the relevant infor-
                                                                  theory is used as a description of reconciled data structures,
    mation resides. Note that this task is typical of the
                                                                  rather than as the conceptual model of the corporate data.
    local-as-view approach, and requires algorithms that
                                                                      In this paper we present the approach to data integra-
    are generally both sophisticated and costly [AD98,
                                                                  tion and reconciliation proposed within the DWQ (Data
    LMSS95].
                                                                  Warehouse Quality) project [JJQV98, JLVV99]. In DWQ,
 2. Decomposition of the user request into queries to in-         the ultimate goal of source integration and data reconcil-
    dividual sources that would return the data of interest.      iation is to represent the migration of the data from the
                                                                  sources to the Data Warehouse, in order to support the de-
 3. Interpretation of the data provided by a source. In-          sign of materialized views that meet user requirements, and
    terpreting data can be regarded as the task of casting        have high quality with respect to correctness, interpretabil-
    them into a common representation, which can there-           ity, usefulness, believability. The method for data integra-
    after be used to manipulate the data.                         tion and reconciliation builds upon and extends the work
                                                                  in [CDGL+ 98d], therefore relying on the availability of a
 4. Merging of the data. The data returned by various
                                                                  Conceptual Model to declaratively represent the relation-
    sources need to be combined to provide the Data
                                                                  ship between the sources and the Data Warehouse. The
    Warehouse with the requested information.
                                                                  declarative approach is further pursued in the task of data
   In commercial environments for Data Warehouse design           integration and reconciliation, where the system is given a
and management the above tasks are taken care of through          declarative description of the data in the sources and pro-
ad-hoc components [JLVV99]. In general, such a compo-             vides automatic support in satisfying the data requests for
nent provides the user with the capability of specifying the      populating the Data Warehouse.
mapping between the sources and the Data Warehouse by                 Compared with the existing proposals mentioned above,


D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati                                                             16-2
                                                               Data Warehouse
                                                                   Schema                             Mediators            Data Warehouse
                                                                                                                                Store

                   Conceptual Model


                                                          Source                    Source
                                                         Schema1                Scheman             Wrappers


                     conceptual level                              logical level                                     physical level


                      conceptual/logical mapping
                                                                                                                     Sources
                      physical/logical mapping                                                           Source                    Source
                                                                                                                        
                      data flow                                                                        Data Store1               Data Storen


                                                   Figure 1: Architecture for Data Integration

the novelty of our approach stems from the following fea-                                   The conceptual level contains a conceptual represen-
tures:                                                                                       tation of the corporate data.
     It relies on the Conceptual Model of the corporate                                    The logical level contains a representation in terms of
      data, which is expressed in an Entity-Relationship for-                                a logical data model of the sources and of the data
      malism.                                                                                materialized in the Data Warehouse.
     It follows the local-as-view paradigm.                                                The physical level contains a store for the material-
     It allows the designer to declaratively specify sev-                                   ized data, wrappers for the sources and mediators for
      eral types of correspondences between data in differ-                                  loading the materialized data store.
      ent schemas (either source schemas or Data Ware-
                                                                                     The relationship between the conceptual and the logical,
      house schema). Three types of Interschema Corre-
                                                                                     and between the logical and the physical level is repre-
      spondences are taken into account, namely Conver-
                                                                                     sented explicitly by specifying mappings between corre-
      sion, Matching, and Reconciliation Correspondences.
                                                                                     sponding objects of the different levels.
     It uses such correspondences for supporting the task                               We briefly describe the conceptual and logical levels,
      of specifying the correct mediators for the loading of                         referring to the abstract architecture of DWQ as depicted
      the materialized views of the Data Warehouse.                                  in Figure 1.
                                                                                         The Conceptual Model is a conceptual representation
Our methodology relies on a novel query rewriting algo-
                                                                                     of the data managed by the enterprise, including a con-
rithm, whose role is to reformulate the query that defines
                                                                                     ceptual representation of the data residing in sources, and
the view to materialize in terms of both the source relations
                                                                                     of the global concepts and relationships that are of in-
and the interschema correspondences.
                                                                                     terest to the Data Warehouse application. The concep-
    The paper is organized as follows. In Section 2, we sum-
                                                                                     tual model is expressed in terms of an enriched Entity-
marize the relevant features of the proposed approach to
                                                                                     Relationship model in which complex entity and relation-
information integration. Section 3 illustrates the method
                                                                                     ship expressions can be constructed and used, and in which
we use to describe the content of the sources at the logical
                                                                                     interdependencies between elements of different sources
level. Section 4 is devoted to a discussion of the mean-
                                                                                     and of the enterprise are captured using intermodel asser-
                                                                                     tions [CDGL+ 98b, CL93]. Intermodel assertions provide a
ing and the role of interschema correspondences. Section 5
describes the query rewriting algorithm at the basis of our
                                                                                     simple and effective declarative mechanism to express the
approach to the design of mediators. Section 6 concludes
                                                                                     dependencies that hold between entities (i.e. classes and re-
the paper.
                                                                                     lationships) in different models [Hul97]. The use of inter-
                                                                                     model assertions allows for an incremental approach to the
2 The DWQ framework                                                                  integration of the conceptual models of the sources and of
In this section we briefly describe the general framework                            the enterprise. Due to space limitations, we cannot con-
adopted in the DWQ project [CDGL+ 98d]. The proposed                                 sider this aspect in further detail, and refer the interested
framework allows one to explicitly model data and infor-                             reader to [CDGL+ 98b].
mation needs – i.e., a specification of the data that the                                The conceptual representation contains, besides entities
Data Warehouse provides to the user – at various lev-                                and relationships, also a description of domains, which are
els [CDGL+ 98b, CDGL+ 98d, CDGL+ 98c]:                                               used to typify the attributes of entities and relationships.


D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati                                                                                 16-3
Rather than considering only concrete domains, such as                      DeptNameString v String
strings, integers, and reals, our approach is based on the                       SSNString v String
use of abstract domains. An abstract domain may have                       DeptCodeInteger v Integer
                                                                             ServNoInteger v Integer
an underlying concrete domain, but allows the designer to
distinguish between the different meanings that a value of
the concrete domain may have. Additionally, also Boolean
combinations of domains and the possibility to construct an        At the logical level, the logical content of each source,
ISA hierarchy between domains are supported.                   called the Source Schema (see Section 3), is provided in
                                                               terms of a set of relational tables using the relational model.
Example 1 Consider two attributes A1 in a source and
                                                               The link between the logical representation and the concep-
A2 in the Data Warehouse, both representing amounts of         tual representation of the source is formally defined by as-
money. Rather than specifying that both attributes have val-
                                                               sociating with each table a query that describes its content
ues of type real, the designer may specify that the domain
                                                               in terms of a query over the Conceptual Model. In other
of attribute A1 is MoneyInLire while the domain of at-
                                                               words, the logical content of a source table is described in
tribute A2 is MoneyInEuro, both of which have real
                                                               terms of a view over the Conceptual Model. To map phys-
as the underlying concrete domain. In this way, it be-
                                                               ical structures to logical structures we make use of suit-
comes possible to specify declaratively the difference be-
                                                               able wrappers, which encapsulate the sources. The wrapper
tween values of the two attributes, and take into account
                                                               hides how the source actually stores its data, the data model
such knowledge for loading data from the source to the
                                                               it adopts, etc., and presents the source as a set of relational
Data Warehouse.
                                                               tables. In particular, we assume that all attributes in the
   We provide an example of the form of the Conceptual         tables are of interest to the Data Warehouse application (at-
Model, and refer to [CDGL+ 98b] for a more detailed de-        tributes that are not of interest are hidden by the wrapper).
scription of the adopted formalism.                            The logical content of the materialized views constituting
                                                               the Data Warehouse, called the Data Warehouse Schema
Example 2 As our running example we consider an en-            (see Section 4), is provided in terms of a set of relational
terprise and two sources containing information about con-     tables. Similarly to the case of the sources, each table of
tracts between customers and departments for services, and     the Data Warehouse Schema is described in terms of a view
about registration of customers at departments. Source 1       over the Conceptual Model. As we said before, the way in
contains information about customers registered at public-     which a view is actually materialized, starting from the data
relations departments. Source 2 contains information about     in the sources, is specified by means of mediators.
contracts and complete information about services. Such            In such a framework, we have devised suitable inference
situation can be represented by means of the ER diagrams       techniques, which allow for carrying out several reasoning
shown in Figure 2, together with the following intermodel      services on both the conceptual representation, such as in-
assertions (v represents ISA while  represents equiva-        ferring inclusion between entities and relationships, satis-
lence):                                                        fiability of entities, etc. [CDGL+ 98d], and the logical rep-
                                                               resentation, such as query containment [CDGL98a], which
                                                               is at the basis of query rewriting. The possibilities offered
       Department1     PrDept0                                by such reasoning tools are used in the accomplishment of
            REG-AT1   v REG-AT0                                several activities concerning both the design and the oper-
          Customer1  Customer0 u                              ation of the Data Warehouse.
                    ( 1 [$1](REG-AT0 u ($2: PrDept0 )))
       Customer0 u                                             3 Source schema description
( 1 [$1]CONTRACT0 ) v ( 1 [$1]PROMOTION1 )                   In this section we focus on the specification of the logical
          Customer2 v Customer0 u                              schemas of the sources. Such schemas are intended to pro-
                           ( 1 [$1]CONTRACT0 )                vide a structural description of the content of the sources,
       Department2 v Department0                               which are encapsulated by suitable wrappers.
           Service2  Service0                                    We describe a source S by associating to each relational
          CONTRACT2 v CONTRACT0
                                                               table T of S an adorned query that is constituted by a head,
                                                               a body, and an adornment:
          Customer1  Customer0
       Department1  Department0                                    The head defines the relational schema of the table in
                                                                     terms of a name, and the number of columns.
and the following domain hierarchy:
                                                                    The body describes the content of the table in terms of
             PersNameString           v String                       a query over the Conceptual Model.


D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati                                                          16-4
                                                                                          SSN=SSNString                                  Name=DeptNameString

                                                                                                          1                      2
                               ServNo=ServNoInteger                                     Customer1               REG-AT1              Department1

    Name=PersNameString       Service0
          SSN=SSNString                                                                     1                         2
                                  2
                      1                           3                                                 PROMOTION1
  Customer0                   CONTRACT0                   Name=DeptNameString
                                                           Code=DeptCodeInteger
    DOB=Date
                      1                           2                                                              ServNo=ServNoInteger
                               REG-AT0                Department0
                                                                                                                Service2
                                                                                         Name=PersNameString                             Name=DeptNameString
                                                                                                                    2
                                                        PrDept0                                           1                      3
                                                                                        Customer2              CONTRACT2             Department2

                                                                                         DOB=Date


                                           Figure 2: Conceptual model of the application of Example 2

     The adornment declares the domains of the columns                           over the elements of the Conceptual Model. We need to
      of the table, and also which are the attributes of the ta-                  make it explicit how the objects of the conceptual repre-
      ble that are used to identify an entity of the Conceptual                   sentation are coded into values of the logical representa-
      Model.                                                                      tion. The notion of adorned query is introduced exactly for
                                                                                  this purpose.
   We now present in detail the notions introduced above.                            An adorned query is an expression of the form
3.1 Query over the Conceptual Model                                                             T (~x)         q(~x; ~y) j 1 ; : : : ;   n

Generally speaking, the connection to the Conceptual
Model is established by defining each table as a relational
                                                                                  where T is the name of the relational table, ~ are its at- x
                                                                                  tributes (observe that attributes denote values and not ob-
query over the elements of the Conceptual Model.
  A query q for a Conceptual Model M is a non-recursive
                                                                                          (x y)
                                                                                  jects), q ~ ; ~ is a query as defined above, and each i is
Datalog query, written in the form:
                                                                                                                                     x
                                                                                  an annotation on variables appearing in ~ . In particular:

      q(~x)        (x y ) OR OR conj (~x; ~y )
                    conj 1 ~ ; ~ 1                         m         m
                                                                                                          x
                                                                                   1. For each X 2 ~ , we have an annotation of the form

where each conj (~x; ~y ) is a conjunction of atoms or
                                                                                                                    X :: V
negated atoms, and ~x; ~y are all the variables appearing in
                          i       i

                                                                                       where V is a domain expression. Such an annotation
the conjunct. Each atom is either of the forms E (t) or of
                                  i

                                                                                       is used to specify how values bound to X are repre-
the form R(~t), where ~t, t, and t are variables in ~x; ~y or
                                              0


constants, and E and R, and entities and relationships of
                                                                            i          sented in the table at the logical level. For example,
                                                                                       which currency is used for a real value denoting an
M   respectively.                                                                      amount of money.
   The semantics of queries is as follows. Given an inter-
pretation I of a Conceptual Model M with interpretation                                                                   z x
                                                                                   2. For each tuple of variables ~  ~ that is used for iden-
             
domain I , a query q of arity n is interpreted as the set                                                                 y
                                                                                      tifying in T an object Y 2 ~ mentioned in q ~ ; ~ , we     (x y)
                     (                )
qI of n-tuples d1 ; : : : ; dn , with each di 2 I , such that,                       have an annotation of the form
when substituting each di for xi , the formula
                                                                                                                          ([z] )
                                                                                                               identify ~ ; Y
      9~y1 .conj 1 (~x; ~y1 ) OR    OR 9~y .conj (~x; ~y )
                                                      m        m        m
                                                                                       For example, the designer may assert that the
evaluates to true in I .                                                               attributes first name,    last name,      and
   The fact that a relation in a source is defined in terms                            date of birth in a table are used to identify
of a query over the Conceptual Model confirms that we are                              students.
following the local-as-view approach: each table is seen as
a view of the virtual database represented by the Concep-                            We point out that our method is able to cope with
tual Model.                                                                       several schematic differences that may be present in the
                                                                                  sources [SK92]. We illustrate this point with the help of
3.2 Adornment                                                                     an example.

To make the connection to the Conceptual Model precise,                           Example 3 Suppose that the Conceptual Model contains
it is not sufficient to define each table as a relational query                   a relationship Service with three attributes, Date,


D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati                                                                                        16-5
ServiceNo, and Price, where Service D; S; P            (        )   Example 2 (cont.) Assuming that in Source 1 a customer
means that at the date D the service S costs P Euro. Sup-           is actually identified by its social security number, and a
pose that Source S1 represents the same kind of informa-            department by its name, we can specify the relational table
tion only on Services v1 and v2 , by means of two tables:           TABLE1 by the following adorned query:
                      (        )
v1 and v2, where v1 D; P means that service v1 costs P
                                   (    )
Italian Lira at date D, and v2 D; P means that service v2
                                                                    TABLE1 (S; M; P )
                                                                                          :
                                                                      REG-AT1 (X; D); PROMOTION1 (X; D); P = false;
costs P Italian Lira at date D. Suppose that Source S2 rep-           SSN(X; S ); Name(D; M )
resents the same kind of information only on Services v3              OR
and v4 by means of a table Serv, where Serv X; Y; D    (        )     PROMOTION1 (X; D); P = true; SSN(X; S );
means that services v3 and v4 cost X and Y Euro respec-               Name1 (D; M )   j
tively at date D. Finally, suppose that Source S3 represents          identify ([S ]; X ); S :: SSNString;
the information only for a certain date d by means of an-             identify ([M ]; D); M :: DeptNameString;
other table Serv3 . The various tables in the three sources           P :: Boolean

can be specified by means of the following adorned queries:
                                                                    Additionally, we assume that in Source 2 the actual data
                                                                    can be described in terms of a relational table TABLE2 with
      v1(D; P )           Service(D;0 v10 ; P )   j                 four columns, two for the customer, one for the service the
                          P :: ItalianLira; D :: Date
                                                                    customer has registered, and one for the department. As
      v2(D; P )           Service(D;0 v20 ; P )   j                 in Source 1, in Source 2 departments are still identified
                          P :: ItalianLira; D :: Date
                                                                    by their name, but, differently from Source 1, customers
 Serv(X; Y; D)            Service(D;0 v30 ; X );                    are identified by their name and date of birth. Services are
                          Service(D;0 v40 ; Y )    j                identified by a unique service number. Hence the following
                          X :: Euro; Y :: Euro; D :: Date
                                                                    adorned query is used to specify TABLE2 :
  Serv3 (S 1; P )         Service(d; S 1; P );
                          Code(S; S 1)   j                           TABLE2 (N; B; I ; M )
                                                                       CONTRACT2 (X; S; D); Name(X; N ); ServNo(S; I );
                          P :: Euro;

                          identify ([S 1]; S ); S 1 :: String          Name(D; M )    j
                                                                       identify ([N; B ]; X ); N :: PersNameString;
                                                                                                 B :: Date

                                                                       identify ([I ]; S ); I :: ServNoInteger;
                                                                       identify ([M ]; D); M :: DeptNameString
    The above example illustrates a case where there are
various schematic differences, both among the sources, and
between the sources and the Conceptual Model. The mech-
anisms used in our methodology for specifying adorned
                                                                    4 Interschema Correspondences
queries is able to cope with such differences.                      We now describe how to define Interschema Correspon-
    The adorned query associated to a table in a source con-        dences, which are used to declaratively specify the cor-
tains a lot of information that can be profitably used in an-       respondences between data in different schemas (either
alyzing the quality of the Data Warehouse design process.           source schemas or data warehouse schema).
Indeed, the adorned query precisely formalizes the content             In our approach, Interschema Correspondences are de-
of a source table in terms of a query over the Conceptual           fined in terms of relational tables, similarly to the case of
Model, the domains of each attribute of the table, and the          the relations describing the sources at the logical level. The
attributes used to identify entities at the conceptual level.       difference with source relations is that we conceive inter-
One important check that we can carry out over the logical          schema correspondences as non-materialized relational ta-
specification of a source is whether the adorned query asso-        bles, in the sense that their extension is computed by an
ciated with a table in a source is consistent or not. Let Q be      associated program whenever it is needed. It follows that,
an adorned query and let B be its body. The query B is said         to each interschema correspondence, we associate a head,
to be inconsistent with respect to the Conceptual Model M,          a body, and an adornment. Differently from the case of
if for every database DB coherent with M, the evaluation            a source relation, the adorment specifies which is the pro-
of B with respect to DB is empty. An adorned query Q is             gram that is able to compute the extension of the virtual
inconsistent with respect to the Conceptual Model M ei-             table.
ther because the body B of Q is inconsistent with respect              We distinguish among three types of correspondences,
to M, or because the annotations are incoherent with re-            namely Conversion, Matching, and Reconciliation Corre-
spect to what specified in M. The inference techniques de-          spondences.
scribed in [CDGL+ 98d] allow us to check the consistency               Conversion Correspondences are used to specify that
of the relational tables defined for describing a source.           data in one source can be converted into data of a different


D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati                                                               16-6
source or of the data warehouse, and how this conversion                   versions are performed by the programs associated to the
is performed. They are used to anticipate several types of                 Conversion Correspondences.
data conflicts that may occur in loading data.                                Reconciliation Correspondences are used to assert how
   As an example, suppose that in a table of a source costs                we can reconcile data in different sources into data of
are represented in Italian Lira, while in a table of the Data              the data warehouse. A Reconciliation Correspondence
Warehouse we want to express them in Euro. Then, in or-                    reconcile has the following form:
der to use the source table in the rewriting of a query that
defines the Data Warehouse table, it is necessary to know                  reconcile   ([~x1 ]; : : : ; [~x ]; [~z])          (x          x z w)
                                                                                                                    conj ~ ; : : : ; ~ k ;~ ; ~
                                                                                                                               (x          x z w)
                                                                                                          k

about the possibility of converting each amount in Italian                                               through program ~ 1 ; : : : ; ~ k ;~ ; ~
Lira into an amount in Euro.
   A Conversion Correspondence convert has the follow-                     where conj specifies the conditions under which the recon-
ing form:                                                                  ciliation is applicable, and program is a predicate that we
                                                                           assume associated to a program that performs the reconcil-
    convert     ([~x]; [~y])              (x y z)
                                   conj ~ ; ~ ;~                           iation. Such correspondence specifies that the k tuples of
                                                        (x y z)
                                   through program ~ ; ~ ;~                       x            x
                                                                           values ~ 1 ; : : : ; ~ k coming from the sources are reconciled
where conj is a conjunctive query, which specifies the
                                                                                           z
                                                                           to the tuple ~ in the Data Warehouse. Therefore, the as-
                                                                           sociated program receives as input k tuples of values (and
conditions under which the conversion is applicable, and
                                                                           possibly the additional parameters in the condition) and re-
program is a predicate that we assume associated to a pro-
                                                                           turns a reconciled tuple.
gram that performs the conversion. In general, the program
                                                                               Again, a Reconciliation Correspondence could simply
needs to take into account the additional parameters spec-
                                                                           be defined as a combination of appropriate Matching and
ified in the condition to actually perform the conversion.
                                                                           Conversion Correspondences, e.g.,
The conversion has a direction. In particular, it operates
from a tuple of values satisfying the conditions specified
   x
for ~ in conj to a tuple of values satisfying the conditions
                                                                                             ([x] [y] [z])
                                                                                reconcile ~ ; ~ ; ~
                 y
specified for ~ . This means that the conversion program
                                                                                                ([x] [w ])
                                                                                   convert 1 ~ ; ~ 1 ; convert 2 ~ ; ~ 2      ([y] [w ]);
                                  x
receives as input a tuple ~ , and returns the corresponding
                                                                                              ([w ] [w ])
                                                                                   match 1 ~ 1 ; ~ 2 ; convert 3 ~ 1 ; ~      ([w ] [z]);
       y
tuple ~ , possibly using the additional parameter ~ to per-   z                            (x y w w z)
                                                                                   conj ~ ; ~ ; ~ 1 ; ~ 2 ;~
                                                                                   through none
form the conversion.
    Matching Correspondences are used to specify how data                      In practice, several of the Interschema Correspondences
in different sources can match. A Matching Correspon-                      that must be specified will have a very simple form, since
dence match has the following form:                                        they will correspond simply to equality in the case of a
match   ([~x1 ]; : : : ; [~x ])            (x        x z)
                                      conj ~ 1 ; : : : ; ~ k ;~
                                                                           matching and to identity in the case of a conversion. There-
                                                       (x         x z)
                          k

                                      through program ~ ; : : : ; ~ k ;~
                                                                           fore, in order to simplify the task of the designer in speci-
                                                                           fying the various interschema correspondences, we assume
where conj specifies the conditions under which the match-                 that several correspondences are automatically asserted by
ing is applicable, and program is a predicate that we as-                  default by the system. In particular, for each domain D
sume associated to a program that performs the matching.                   in the conceptual model, the following Interschema Corre-
The program receives as input k tuples of values satisfying                spondences are specified by default:
the conditions (and possibly the additional parameters in
the condition) and returns whether they match or not.                            convert       ([X ]; [Y ])            D ( X ); D ( Y )
   Note that already specified Interschema Correspon-                                                                                 ( )
                                                                                                                       through identity X; Y
dences may be used to define new ones. As an example,                              match       ([X ]; [Y ])            D(X ); D(Y ); X = Y
the designer may want to define a Matching Correspon-                                                                  through none
dence between two tuples by using two already defined
                                                                           reconcile   ([X ]; [Y ]; [Z ])              D(X ); D(Y ); D(Z );
Conversion Correspondences, which convert to a common
                                                                                                                       X=Y
                                                                                                                       through identity (X; Z )
representation, and then by using equality. In this case,
he could provide the following definition of the Matching
Correspondence:                                                            where identity is the program that computes the identity
match   ([~x]; [~y])                  ([x] [z]); convert 2([~y]; [~z]);
                            convert 1 ~ ; ~                                function for values of domain D, and the matching corre-
                                  (x y z w)
                            conj ~ ; ~ ;~ ; ~                              spondence has no associated program.
                                                                              The system allows the designer to inhibit the default
                            through none
                                                                           correspondences for a certain domain, simply by provid-
Observe that, in this case, the program associated to the                  ing an alternative interschema correspondence referring to
Matching Correspondence is empty, since the actual con-                    that domain.


D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati                                                                                16-7
   Moreover, we assume that for each Conversion Cor-
respondence convert i asserted by the designer, the sys-                and the Reconciliation Correspondences
tem automatically asserts a new Matching Correspondence
match i as follows:                                                                      reconcile 1;1 ([N; D]; [S1 ]; [S2 ])
                                                                                         reconcile 3;2 ([M1 ]; [M2 ]; [C ])
                ([x] [y])
      match i ~ ; ~                         ([x] [z]); ~y = ~z
                                  convert i ~ ; ~
                                  through none
Moreover, for each Conversion Correspondence convert i                  5 Specification of mediators
asserted by the designer and for each Matching Correspon-               As we said in the introduction, the problem of data inte-
dence match j asserted by the designer or by default, the               gration and reconciliation is crucial for the task of design-
system automatically asserts a new Reconciliation Corre-                ing the mediators that load the data in the Data Warehouse.
spondence reconcile i;j as follows:                                     Such a task aims at specifying, for every relation in the Data
     reconcile i;j   ([~x]; [~y]; [~z])             ([x] [y])
                                           match i ~ ; ~ ;
                                                                        Warehouse Schema, how the tuples of the relation should
                                                      ([x] [z])
                                           convert j ~ ; ~
                                                                        be constructed from a suitable set of tuples extracted from
                                                                        the sources.
                                           through none
                                                                           Suppose we have decided to materialize a new relation
Example 2 (cont.) The following Conversion Correspon-                   T in the Data Warehouse.1 Our goal is to support the de-
dence specifies that the name and date of birth of a person             signer in providing a formal specification for the design
can be converted into a Social Security Number through the              of the mediator used to extract the correct data from the
program name to SSN:                                                    sources, and to load such data in T . The methodology we
                                                                        propose is based on the following steps.
   convert 1 ([N; B ]; [S ])
     PersNameString(N ); Date(B ); DOB(N; B );                            1. We apply the method described in Section 3 to provide
     SSNString(S )                                                           the specification of the relation T . In other words, we
     through name to SSN(N; B; S )                                           specify T in terms of an adorned query
Moreover, we add the following Conversion Correspon-                                             q       q j c1 ; : : : ; c :
                                                                                                          0
                                                                                                                          n
dence, which represents the fact that a department name
can be converted into a department code through the pro-                      Note that the adorned query associated to a table in a
gram dept name to code:                                                       source is the result of a reverse engineering analysis of
                                                                              the source, whereas in this case the adorned query is a
  convert 2 ([M ]; [C ])                                                      specification of what we want to materialize in the ta-
    DeptNameString(M ); DeptCodeInteger(C )                                   ble of the Data Warehouse. Note also that we express
    through dept name to code(M; C )
                                                                              the semantics of T again in terms of the conceptual
                                                                              model. Not only the sources, but also the relations in
According to the above rules, the system asserts automat-
                                                                              the Data Warehouse are seen as views of such a con-
ically (among others) the Matching Correspondence and
                                                                              ceptual model.
Conversion Correspondences
                                                                          2. We look for a rewriting of q in terms of the queries
match 1 ([N; B ]; [S ])            convert 1 ([N; B ]; [S1 ]); S = S1        q1 ; : : : ; qs that correspond to the materialized views in
                                   through none                              the Data Warehouse. If a complete, equivalent rewrit-
                                                                             ing exists, then the new table can be derived from the
  match 2 ([M ]; [C ])             convert 2 ([M ]; [C1 ]); C = C1
                                   through none                              existing tables in the Data Warehouse. Otherwise, the
                                                                             algorithm is able to single out the part that cannot be
match 3 ([M1 ]; [M2 ])             DeptNameString(M1 );
                                                                             derived from the Data Warehouse, and that must be
                                   DeptNameString(M2 );
                                                                             loaded from the sources. In the following, q denotes
                                   M1 = M2

                                   through none                              such part.
convert 4 ([S1 ]; [S2 ])           SSNString(S1 );                        3. We look for a rewriting of q in terms of the queries cor-
                                   SSNString(S2 )                            responding to the tables in the sources. The rewriting
                                   through identity (S1 ; S2 )               aims at expressing the data in T in terms of a disjunc-
convert 5 ([P1 ]; [P2 ])           Boolean(P1 ); Boolean(P2 )                tion of conjunctive queries where each atom refers to
                                   through identity (P1 ; P2 )
                                                                                    a table in a source, or
 convert 6 ([I1 ]; [I2 ])          ServNoInteger(I1 );
                                   ServNoInteger(I2 )                       1 To see how DWQ addresses the issue of deciding what to materialize

                                   through identity (I1 ; I2 )          in the Data Warehouse, we refer to [TLS99].


D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati                                                                            16-8
            a matching, conversion, or reconciliation pred-            how to merge the various tuples denoted by tuple-
             icate defined in the Interschema Correspon-                spec1 ,. . . ,tuple-specn, and tuple-spect1 ,. . . ,tuple-
             dences.                                                    spectm denote the tuples in T resulting from the
                                                                        merging.
      In other words, we are trying to reformulate q in terms
      of the relations in the sources, and possibly in terms            We observe that the rewriting algorithm is able to gen-
      of the matching, conversion, and reconciliation predi-            erate one merging clause template for each pair of dis-
      cates. If there are different rewritings, then we choose          juncts that are not disjoint. Starting from such tem-
      the best rewriting r with respect to suitable quality pa-         plates, the designer may either specify the such that
      rameters. There are several criteria to be taken into             and the into parts, depending on the intended seman-
      consideration when evaluating the quality of a rewrit-            tics, or change the templates in order to specify a dif-
      ing, such as:                                                     ferent merging plan (for example for merging three
                                                                        disjuncts, rather than three pairs of disjuncts).
            Completeness of the rewriting. Obviously, the
             best situation is the one where the rewriting is
                                                                       The algorithm computes the maximally contained
             complete, in the sense that the rewritten query is
                                                                        rewriting (i.e., every other rewriting is included in the
             equivalent to the original query. Such a check
                                                                        one computed by the query), but we also want to in-
             can be done by exploiting the algorithm for
                                                                        form the designer whether such a rewriting is equiv-
             query containment.
                                                                        alent or not to the original query. Indeed, we have
            Accuracy, confidence, freshness, and availability          devised an effective method for checking equivalence
             of data in the source relations that the rewriting         between the original query and the computed rewrit-
             requires to access.                                        ing [CDGL98a].

   The resulting query is the specification for the design
                                                                        Besides the relational tables in the sources, our rewrit-
of the mediator associated to T . The most critical step
                                                                    

                                                                        ing algorithm takes into account the matching, con-
of the above method is the computation of the rewriting.
                                                                        version, and reconciliation predicates defined in the
Our rewriting algorithm is based on the method presented
                                                                        interschema correspondences.
in [DL97], modified to take into account the following as-
pects:
                                                                       Even when no rewriting exists for the query (i.e.,
     We deal with queries whose atoms refer to a concep-               when the maximally contained rewriting is empty), we
      tual model that includes ISA assertions and a limited             want to provide the designer with useful indications
      form of functional dependencies. Such constraints                 on whether there is a method for enriching the Inter-
      have to be considered in the computation of the rewrit-           schema Correspondences to get a non-empty rewrit-
      ing.                                                              ing. Indeed, our rewriting algorithm adopts a form
                                                                        of abductive reasoning that enables to single out the
     We deal with queries that are disjunctions of conjunc-            specification of which matching, conversion and rec-
      tions. It follows that the rewriting itself is in general         onciliation operations would allow to get a non-empty
      a disjunction, and therefore, we need to deal with the            rewriting. This indication can be profitably used by
      problem of merging the results of several queries. This           the designer to check whether she/he can add new In-
      problem is addressed by the notion of merging clause.             terschema Correspondences in order to make the com-
      In particular, if the query r computed by the rewriting           puted rewriting complete.
      is an or-query (i.e., it is constituted by more than one
      disjunct), then the algorithm associates to r a suitable
      set of so-called merging clauses, taking into account       Example 2 (cont.) Suppose we want to store in the Data
      that the answers to the different or-parts of the query     Warehouse a relation containing the information about cus-
      may contain objects and values that represent the same      tomers that have a contract for a certain service with a de-
      real world entity or the same value. A merging clause       partment at which they are also registered, or that are eli-
      is an expression of the form                                gible for a promotion. Independently from the fact that the
                                                                  customer has a contract, we want to include the information
         merging tuple-spec1 and    and tuple-specn            on whether he is eligible for a promotion. We can make use
         such that matching-condition                             of a relational table TDW with four components, defined by
         into tuple-spect1 and    and tuple-spectm             the following adorned query, where we have assumed that
                                                                  in the Data Warehouse we want to identify customers by
      where tuple-speci denotes a tuple returned by the           their SSN, services by their service number, and depart-
      i-th disjunct of r, matching-condition specifies            ments by their code:


D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati                                                               16-9
                                                                             SIGACT SIGMOD SIGART Sym. on Prin-
                                                                             ciples of Database Systems (PODS’98),
 TDW(S; I ; C; P )
                                                                             pages 254–265, 1998.
   CONTRACT0 (X; R; D); PROMOTION1 (X; D);
   SSN(X; S ); ServNo(R; I ); Code(D; C ); F = true
   OR                                                          [CDGL98a]     Diego Calvanese, Giuseppe De Giacomo,
                                                                             and Maurizio Lenzerini. On the decidability
   CONTRACT0 (X; R; D); REG-AT1 (X; D);
   :PROMOTION1 (X; D); SSN(X; S ); ServNo(R; I );                            of query containment under constraints. In
   Code(D; C ); P = false                                                    Proc. of the 17th ACM SIGACT SIGMOD
   OR                                                                        SIGART Sym. on Principles of Database
   PROMOTION1 (X; D); SSN(X; S ); Code(D; C );                               Systems (PODS’98), pages 149–158, 1998.
     = true; I = NULL        j
                                                               [CDGL+ 98b] Diego Calvanese, Giuseppe De Giacomo,
   P

   identify ([S ]; X ); S :: SSNString;
   identify ([I ]; R); I :: ServNoInteger;                                 Maurizio Lenzerini, Daniele Nardi, and
   identify ([C ]; D); C :: DeptCodeInteger;                               Riccardo Rosati. Description logic frame-
   P :: Boolean                                                            work for information integration. In Proc.
                                                                           of the 6th Int. Conf. on the Principles of
   Using the asserted and automatically derived Inter-                     Knowledge Representation and Reasoning
schema Correspondences, the system is able to rewrite the                  (KR’98), pages 2–13, 1998.
above query in terms of TABLE1 in Source 1 and TABLE2
                                                               [CDGL+ 98c] Diego Calvanese, Giuseppe De Giacomo,
in Source 2 (see Section 3) as follows:
                                                                           Maurizio Lenzerini, Daniele Nardi, and
   TDW(S0 ; I0 ; C; P0 )                                                   Riccardo Rosati. Schema and data integra-
     TABLE1 (S1 ; M1 ; P1 ); TABLE2 (N2 ; B; I2 ; M2 );                    tion methodology for dwq. Technical Re-
     reconcile 1;1 ([N2 ; B ]; [S1 ]; [S0 ]);                              port DWQ-UNIROMA-004, DWQ Consor-
     reconcile 3;2 ([M1 ]; [M2 ]; [C ]);                                   tium, September 1998.
     convert 5 ([P1 ]; [P0 ]); convert 6 ([I2 ]; [I0 ])
       OR                                                      [CDGL+ 98d] Diego Calvanese, Giuseppe De Giacomo,
                                          ^
       TABLE1 (S1 ; M1 ; P1 ); S1 = NULL P1 = true;                        Maurizio Lenzerini, Daniele Nardi, and
       convert 2 ([M1 ]; [C ]); convert 4 ([S1 ]; [S0 ]);                  Riccardo Rosati. Source integration in
       convert 5 ([P1 ]; [P0 ])                                            data warehousing. In Proc. of the 9th Int.
                                                                           Workshop on Database and Expert Systems
In this case the merging clause simply reduces to a disjunc-               Applications (DEXA’98), pages 192–197.
tion.                                                                      IEEE Computer Society Press, 1998.

                                                               [CL93]        Tiziana Catarci and Maurizio Lenzerini.
6 Conclusions                                                                Representing and using interschema knowl-
                                                                             edge in cooperative information systems. J.
We have described a new approach to data integration and
                                                                             of Intelligent and Cooperative Information
reconciliation in Data Warehousing. The approach is based
                                                                             Systems, 2(4):375–398, 1993.
on the availability of a Conceptual Model of the corpo-
rate data, and allows the designer to declaratively specify    [DL97]        Oliver M. Duschka and Alon Y. Levy. Re-
several types of correspondences between data in different                   cursive plans for information gathering. In
sources. Such correspondences are used by a query rewrit-                    Proc. of the 15th Int. Joint Conf. on Arti-
ing algorithm that supports the task of specifying the cor-                  ficial Intelligence (IJCAI’97), pages 778–
rect mediators for the loading of the materialized views of                  784, 1997.
the Data Warehouse.
   Based on the described methodology, we are currently        [GM95]        A. Gupta and I. S. Mumick. Maintenance of
implementing a design tool within the DWQ project. The                       materialized views: Problems, techniques,
tool is based on the Concept Base System [Jar92], and pro-                   and applications. IEEE Bulletin of the
vides support for both schema and data integration in Data                   Technical Committee on Data Engineering,
Warehousing.                                                                 18(2):3–18, 1995.

                                                               [GMS94]       Cheng Hian Goh, Stuart E. Madnick, and
References
                                                                             Michael Siegel. Context Interchange: Over-
[AD98]           Serge Abiteboul and Oliver Duschka. Com-                    coming the challenges of large-scale inter-
                 plexity of answering queries using materi-                  operable database systems in a dynamic en-
                 alized views. In Proc. of the 17th ACM                      vironment. In Proc. of the 3rd Int. Conf. on


D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati                                                     16-10
              Information and Knowledge Management            [SK92]     Amit Sheth and Vipul Kashyap. So far
              (CIKM’94), pages 337–346, 1994.                            (schematically) yet so near (semantically).
[HGMW+ 95] Joachim Hammer, Hector Garcia-Molina,
                                                                         In Proc. of the IFIP DS-5 Conf. on Seman-
                                                                         tics of Interoperable Database Systems. El-
           Jennifer Widom, Wilburt Labio, and Yue
                                                                         sevier Science Publishers (North-Holland),
           Zhuge. The Stanford data warehousing
                                                                         Amsterdam, 1992.
           project. IEEE Bulletin of the Technical
           Committee on Data Engineering, 18(2):41–           [TLS99]    Dimitri Theodoratos, Spyros Ligoudis-
           48, 1995.                                                     tianos, and Timos Sellis. Designing the
                                                                         global Data Warehouse with SPJ views. In
[Hul97]       Richard Hull. Managing semantic hetero-
                                                                         Proc. of the 11th Conf. on Advanced Infor-
              geneity in databases: A theoretical perspec-
                                                                         mation Systems Engineering (CAiSE’99),
              tive. In Proc. of the 16th ACM SIGACT
                                                                         1999.
              SIGMOD SIGART Sym. on Principles of
              Database Systems (PODS’97), 1997.               [Ull97]    Jeffrey D. Ullman.     Information inte-
[HZ96]        Richard Hull and Gang Zhou. A frame-                       gration using logical views.    In Proc.
              work for supporting data integration us-                   of the 6th Int. Conf. on Database The-
              ing the materialized and virtual approaches.               ory (ICDT’97), volume 1186 of Lecture
              In Proc. of the ACM SIGMOD Int. Conf.                      Notes in Computer Science, pages 19–40.
              on Management of Data, pages 481–492,                      Springer-Verlag, 1997.
              1996.                                           [Wid95]    Jennifer Widom. Special issue on materi-
[Inm96]       W. H. Inmon. Building the Data Ware-                       alized views and data warehousing. IEEE
              house. John Wiley & Sons, second edition,                  Bulletin on Data Engineering, 18(2), 1995.
              1996.                                           [Wie92]    Gio Wiederhold. Mediators in the architec-
[Jar92]       M. Jarke. Conceptbase V3.1 user man-                       ture of future information systems. IEEE
              ual.   Technical Report 92–17, Aach-                       Computer, 25(3):38–49, 1992.
              ener Informatik-Berichte, Aachen, Ger-
                                                              [ZHK96]    Gang Zhou, Richard Hull, and Roger King.
              many, 1992.
                                                                         Generating data integration mediators that
[JJQV98]      Matthias Jarke, Manfred A. Jeusfeld,                       use materializations. J. of Intelligent Infor-
              Christoph Quix, and Panos Vassiliadis.                     mation Systems, 6:199–221, 1996.
              Architecture and quality in data ware-
              houses. In Proc. of the 10th Conf. on           [ZHKF95]   Gang Zhou, Richard Hull, Roger King,
                                                                         and Jean-Claude Franchitti. Using object
              Advanced Information Systems Engineer-
              ing (CAiSE’98), volume 1413 of Lecture                     matching and materialization to integrate
              Notes in Computer Science, pages 93–113.                   heterogeneous databases. In Proc. of the
              Springer-Verlag, 1998.                                     3rd Int. Conf. on Cooperative Information
                                                                         Systems (CoopIS’95), pages 4–18, 1995.
[JLVV99]      Matthias Jarke, Maurizio Lenzerini, Yan-
              nis Vassiliou, and Panos Vassiliadis. Fun-
              damentals of Data Warehouses. Springer-
              Verlag, 1999. In Press.
[LMSS95]      Alon Y. Levy, Alberto O. Mendelzon,
              Yehoshua Sagiv, and Divesh Srivastava.
              Answering queries using views. In Proc. of
              the 14th ACM SIGACT SIGMOD SIGART
              Sym. on Principles of Database Systems
              (PODS’95), pages 95–104, 1995.
[PGMW95]      Yannis Papakonstantinou, Hector Garcia-
              Molina, and Jennifer Widom. Object ex-
              change across heterogeneous information
              sources. In Proc. of the 11th IEEE Int. Conf.
              on Data Engineering (ICDE’95), pages
              251–260, 1995.


D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati                                                  16-11