<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Principled Approach to Data Integration and Reconciliation in Data Warehousing</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, Daniele Nardi, Riccardo Rosati Dipartimento di Informatica e Sistemistica, Universita` di Roma “La Sapienza” Via Salaria 113</institution>
          ,
          <addr-line>00198 Roma</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Integration is one of the most important aspects of a Data Warehouse. When data passes from the sources of the application-oriented operational environment to the Data Warehouse, possible inconsistencies and redundancies should be resolved, so that the warehouse is able to provide an integrated and reconciled view of data of the organization. We describe a novel approach to data integration and reconciliation, based on a conceptual representation of the Data Warehouse application domain. The main idea is to declaratively specify suitable matching, conversion, and reconciliation operations to be used in order to solve possibile conflicts among data in different sources. Such a specification is provided in terms of the conceptual model of the application, and is effectively used during the design of the software modules that load the data from the sources into the Data Warehouse.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>(S. Gatziu, M. Jeusfeld, M. Staudt, Y. Vassiliou, eds.)
http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-19/
interest. The typical architecture of an integration systems
is described in terms of two types of modules: wrappers
and mediators [Wie92, Ull97]. The goal of a wrapper is
to access a source, extract the relevant data, and present
such data in a specified format. The role of a mediator is to
merge data produced by different wrappers (or mediators),
so as to meet a specific information need of the
integration system. The specification and the realization of
mediators is the core problem in the design of an integration
system. This problem has recently become a central issue
in several contexts, including multi-database systems, Data
Warehousing and information gathering from the Web.</p>
      <p>The constraints that are typical of Data Warehouse
applications restrict the large spectrum of approaches that are
being proposed [Hul97, Inm96, JLVV99]. First, while the
sources on the Web are often external, in a Data Warehouse
they are mostly internal to the organization. Second, a Data
Warehouse should reflect the informational needs of an
organization, and should therefore be defined in terms of a
global, corporate view of data. Third, such a view should
be provided in terms of conceptual representation
mechanism that is able to abstract from the physical and logical
organization of data in the sources. It follows that the need
and requirements for maintaining an integrated, conceptual
view of the corporate data in the organization are stronger
with respect to other contexts. A direct consequence of this
fact is that the data in the sources and in the Data
Warehouse should be defined in terms of the conceptual model,
and not the other way around. In other words, data
integration in Data Warehousing should follow the local as
view approach, where each table in a source and in the Data
Warehouse is defined as a view of a global model of the
corporate data. On the contrary, the global as view approach
requires, for each information need, to specify the
corresponding query in terms of the data at the sources, and is
therefore suited when no global, integrated view of the data
of the organization is available.</p>
      <p>The above considerations motivate the approach to
information integration proposed in [CDGL+98d], whose
distinguishing feature is to exploit the possibility of
representing the conceptual level of a Data Warehouse in a
very expressive language and use reasoning tools to support
the Data Warehouse construction, maintenance and
evolution. In fact, the idea is to balance the effort of building a
conceptual model of the Data Warehouse by improving the
capabilities of the system in maintaining the Data
Warehouse and support the incremental addition of information
sources. The proposed approach follows a local as view
paradigm, by explicitly requiring an enterprise conceptual
model which is therefore regarded as a unified view of the
data available within the organization.</p>
      <p>Most of the work on integration has been concerned
with the intensional/schema level, while less attention has
been devoted to the problem of data integration at the
extensional level. Integration of data is, nonetheless, at the heart
of Data Warehousing [Inm96]. When data passes from the
application-oriented operational environment to the
Warehouse, possible inconsistencies and redundancies should be
resolved, so that the Warehouse is able to provide an
integrated and reconciled view of data of the organization.
Thus, in the context of a Data Warehouse, data integration
and reconciliation is the process of acquiring data from the
sources and making them available within the Warehouse.</p>
      <p>Given a request for data (e.g., for materializing a new
relation in the Data Warehouse), which is formulated in
terms of the global view of the corporate data, (i.e., not
the language of the sources, but of the enterprise), there are
several steps that enable for the acquisition of data from the
sources:
1. Identification of the sources where the relevant
information resides. Note that this task is typical of the
local-as-view approach, and requires algorithms that
are generally both sophisticated and costly [AD98,
LMSS95].
2. Decomposition of the user request into queries to
individual sources that would return the data of interest.
3. Interpretation of the data provided by a source.
Interpreting data can be regarded as the task of casting
them into a common representation, which can
thereafter be used to manipulate the data.
4. Merging of the data. The data returned by various
sources need to be combined to provide the Data
Warehouse with the requested information.</p>
      <p>In commercial environments for Data Warehouse design
and management the above tasks are taken care of through
ad-hoc components [JLVV99]. In general, such a
component provides the user with the capability of specifying the
mapping between the sources and the Data Warehouse by
browsing through a meta-level description of the relations
of the sources. In addition, it generally provides both for
automatic code generators and for the possibility of
attaching procedures to accomplish ad hoc transformations and
filtering of the data. Even though there are powerful and
effective environments with the above features, their nature
is inherently procedural and close to the notion of global as
view, where the task of relating the sources with the Data
Warehouse is done on a query-by-query basis.</p>
      <p>Several recent research contributions address the same
problem from a more formal perspective [HGMW+95,
Wid95, GM95, HZ96, ZHK96, ZHKF95, PGMW95,
GMS94]. For example, a methodology for extracting,
comparing and matching data and objects located in different
sources is described in [PGMW95]. The methodology is
based on the Object Exchange Model, which requires the
explicit semantic labeling of the objects, to support object
exchange, and emphasizes the need for a tight interaction
between the system and the user. However, the method
remains of procedural nature, since it requires the user to
build and maintain the relationship between the sources and
the Data Warehouse on a query-by-query basis.</p>
      <p>The approach proposed in [GMS94] is more declarative
in nature. Suitable data structures for reconciling different
representations of the same data are represented in a
context theory, which is used by the system to transform the
queries as appropriate for gathering the data from the
various sources. In such a declarative approach, the user is not
directly concerned with the identification and resolution of
semantic conflicts when formulating the requests for data.
Rather, once the specification of the sources is available,
conflicts are detected by the system, and conversion and
filtering are automatically enforced. However, the method
still follows the global-as-view approach, and the context
theory is used as a description of reconciled data structures,
rather than as the conceptual model of the corporate data.</p>
      <p>In this paper we present the approach to data
integration and reconciliation proposed within the DWQ (Data
Warehouse Quality) project [JJQV98, JLVV99]. In DWQ,
the ultimate goal of source integration and data
reconciliation is to represent the migration of the data from the
sources to the Data Warehouse, in order to support the
design of materialized views that meet user requirements, and
have high quality with respect to correctness,
interpretability, usefulness, believability. The method for data
integration and reconciliation builds upon and extends the work
in [CDGL+98d], therefore relying on the availability of a
Conceptual Model to declaratively represent the
relationship between the sources and the Data Warehouse. The
declarative approach is further pursued in the task of data
integration and reconciliation, where the system is given a
declarative description of the data in the sources and
provides automatic support in satisfying the data requests for
populating the Data Warehouse.</p>
      <p>Compared with the existing proposals mentioned above,
16-2
Conceptual Model
Data Warehouse</p>
      <p>Schema</p>
    </sec>
    <sec id="sec-2">
      <title>Source</title>
      <p>Schema1</p>
    </sec>
    <sec id="sec-3">
      <title>Source Scheman</title>
    </sec>
    <sec id="sec-4">
      <title>Mediators</title>
    </sec>
    <sec id="sec-5">
      <title>Wrappers</title>
    </sec>
    <sec id="sec-6">
      <title>Data Warehouse Store</title>
      <p>conceptual level
logical level
conceptual/logical mapping
physical/logical mapping
data flow
physical level</p>
      <sec id="sec-6-1">
        <title>Sources</title>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Source</title>
      <p>Data Store1</p>
    </sec>
    <sec id="sec-8">
      <title>Source Data Storen</title>
      <p>the novelty of our approach stems from the following
features:</p>
      <p>It relies on the Conceptual Model of the corporate
data, which is expressed in an Entity-Relationship
formalism.</p>
      <sec id="sec-8-1">
        <title>It follows the local-as-view paradigm.</title>
        <p>It allows the designer to declaratively specify
several types of correspondences between data in
different schemas (either source schemas or Data
Warehouse schema). Three types of Interschema
Correspondences are taken into account, namely
Conversion, Matching, and Reconciliation Correspondences.
It uses such correspondences for supporting the task
of specifying the correct mediators for the loading of
the materialized views of the Data Warehouse.</p>
        <p>Our methodology relies on a novel query rewriting
algorithm, whose role is to reformulate the query that defines
the view to materialize in terms of both the source relations
and the interschema correspondences.</p>
        <p>The paper is organized as follows. In Section 2, we
summarize the relevant features of the proposed approach to
information integration. Section 3 illustrates the method
we use to describe the content of the sources at the logical
level. Section 4 is devoted to a discussion of the
meaning and the role of interschema correspondences. Section 5
describes the query rewriting algorithm at the basis of our
approach to the design of mediators. Section 6 concludes
the paper.
2</p>
        <sec id="sec-8-1-1">
          <title>The DWQ framework</title>
          <p>In this section we briefly describe the general framework
adopted in the DWQ project [CDGL+98d]. The proposed
framework allows one to explicitly model data and
information needs – i.e., a specification of the data that the
Data Warehouse provides to the user – at various
levels [CDGL+98b, CDGL+98d, CDGL+98c]:
The conceptual level contains a conceptual
representation of the corporate data.</p>
          <p>The logical level contains a representation in terms of
a logical data model of the sources and of the data
materialized in the Data Warehouse.</p>
          <p>The physical level contains a store for the
materialized data, wrappers for the sources and mediators for
loading the materialized data store.</p>
          <p>The relationship between the conceptual and the logical,
and between the logical and the physical level is
represented explicitly by specifying mappings between
corresponding objects of the different levels.</p>
          <p>We briefly describe the conceptual and logical levels,
referring to the abstract architecture of DWQ as depicted
in Figure 1.</p>
          <p>The Conceptual Model is a conceptual representation
of the data managed by the enterprise, including a
conceptual representation of the data residing in sources, and
of the global concepts and relationships that are of
interest to the Data Warehouse application. The
conceptual model is expressed in terms of an enriched
EntityRelationship model in which complex entity and
relationship expressions can be constructed and used, and in which
interdependencies between elements of different sources
and of the enterprise are captured using intermodel
assertions [CDGL+98b, CL93]. Intermodel assertions provide a
simple and effective declarative mechanism to express the
dependencies that hold between entities (i.e. classes and
relationships) in different models [Hul97]. The use of
intermodel assertions allows for an incremental approach to the
integration of the conceptual models of the sources and of
the enterprise. Due to space limitations, we cannot
consider this aspect in further detail, and refer the interested
reader to [CDGL+98b].</p>
          <p>The conceptual representation contains, besides entities
and relationships, also a description of domains, which are
used to typify the attributes of entities and relationships.
DeptNameString</p>
          <p>SSNString
DeptCodeInteger</p>
          <p>ServNoInteger</p>
          <p>String
String
Integer
Integer</p>
          <p>We provide an example of the form of the Conceptual
Model, and refer to [CDGL+98b] for a more detailed
description of the adopted formalism.</p>
          <p>Rather than considering only concrete domains, such as
strings, integers, and reals, our approach is based on the
use of abstract domains. An abstract domain may have
an underlying concrete domain, but allows the designer to
distinguish between the different meanings that a value of
the concrete domain may have. Additionally, also Boolean
combinations of domains and the possibility to construct an
ISA hierarchy between domains are supported.</p>
          <p>Example 1 Consider two attributes A1 in a source and
A2 in the Data Warehouse, both representing amounts of
money. Rather than specifying that both attributes have
values of type real, the designer may specify that the domain
of attribute A1 is MoneyInLire while the domain of
attribute A2 is MoneyInEuro, both of which have real
as the underlying concrete domain. In this way, it
becomes possible to specify declaratively the difference
between values of the two attributes, and take into account
such knowledge for loading data from the source to the
Data Warehouse.
Example 2 As our running example we consider an
enterprise and two sources containing information about
contracts between customers and departments for services, and
about registration of customers at departments. Source 1
contains information about customers registered at
publicrelations departments. Source 2 contains information about
contracts and complete information about services. Such
situation can be represented by means of the ER diagrams
shown in Figure 2, together with the following intermodel
assertions (v represents ISA while represents
equivalence):</p>
          <p>At the logical level, the logical content of each source,
called the Source Schema (see Section 3), is provided in
terms of a set of relational tables using the relational model.
The link between the logical representation and the
conceptual representation of the source is formally defined by
associating with each table a query that describes its content
in terms of a query over the Conceptual Model. In other
words, the logical content of a source table is described in
terms of a view over the Conceptual Model. To map
physical structures to logical structures we make use of
suitable wrappers, which encapsulate the sources. The wrapper
hides how the source actually stores its data, the data model
it adopts, etc., and presents the source as a set of relational
tables. In particular, we assume that all attributes in the
tables are of interest to the Data Warehouse application
(attributes that are not of interest are hidden by the wrapper).
The logical content of the materialized views constituting
the Data Warehouse, called the Data Warehouse Schema
(see Section 4), is provided in terms of a set of relational
tables. Similarly to the case of the sources, each table of
the Data Warehouse Schema is described in terms of a view
over the Conceptual Model. As we said before, the way in
which a view is actually materialized, starting from the data
in the sources, is specified by means of mediators.</p>
          <p>In such a framework, we have devised suitable inference
techniques, which allow for carrying out several reasoning
services on both the conceptual representation, such as
inferring inclusion between entities and relationships,
satisfiability of entities, etc. [CDGL+98d], and the logical
representation, such as query containment [CDGL98a], which
is at the basis of query rewriting. The possibilities offered
by such reasoning tools are used in the accomplishment of
several activities concerning both the design and the
operation of the Data Warehouse.
3</p>
        </sec>
        <sec id="sec-8-1-2">
          <title>Source schema description</title>
          <p>The head defines the relational schema of the table in
terms of a name, and the number of columns.</p>
          <p>The body describes the content of the table in terms of
a query over the Conceptual Model.</p>
          <p>Name=DeptNameString
ServNo=ServNoInteger</p>
          <p>Customer1</p>
          <p>Customer0</p>
          <p>We now present in detail the notions introduced above.
The adornment declares the domains of the columns
of the table, and also which are the attributes of the
table that are used to identify an entity of the Conceptual</p>
          <p>Model.
q M A query for a Conceptual Model is a non-recursive</p>
          <p>Generally speaking, the connection to the Conceptual
Model is established by defining each table as a relational
query over the elements of the Conceptual Model.</p>
          <p>Datalog query, written in the form:
over the elements of the Conceptual Model. We need to
make it explicit how the objects of the conceptual
representation are coded into values of the logical
representation. The notion of adorned query is introduced exactly for
this purpose.</p>
          <p>An adorned query is an expression of the form</p>
          <p>ServNo=ServNoInteger
evaluates to true in I.</p>
          <p>The fact that a relation in a source is defined in terms
of a query over the Conceptual Model confirms that we are
following the local-as-view approach: each table is seen as
a view of the virtual database represented by the
Conceptual Model.
3.2</p>
          <p>Adornment</p>
          <p>For example, the designer may assert that the
attributes first name, last name, and
date of birth in a table are used to identify
students.</p>
          <p>We point out that our method is able to cope with
several schematic differences that may be present in the
sources [SK92]. We illustrate this point with the help of
an example.</p>
          <p>To make the connection to the Conceptual Model precise,
it is not sufficient to define each table as a relational query</p>
          <p>Example 3 Suppose that the Conceptual Model contains
a relationship Service with three attributes, Date,
OR
D); P = S); PROMOTION1(X; true; SSN(X;
M; P ) TABLE1(S;
D); D); P = REG-AT1(X; :PROMOTION1(X; false;
S); M ) SSN(X; Name(D;
M ) j Name1(D;
Example 2 (cont.) Assuming that in Source 1 a customer
is actually identified by its social security number, and a
department by its name, we can specify the relational table
TABLE1 by the following adorned query:
S; P ) ServiceNo, and Price, where Service(D;
d the information only for a certain date by means of
anP ) P v1 and v2, where v1(D; means that service v1 costs
P costs Italian Lira at date D. Suppose that Source S2
repP ) Italian Lira at date D, and v2(D; means that service v2
X Y means that services v3 and v4 cost and Euro
respecD S P means that at the date the service costs Euro.
SupY; D) and v4 by means of a table Serv, where Serv(X;
pose that Source S1 represents the same kind of
information only on Services v1 and v2, by means of two tables:
resents the same kind of information only on Services v3
tively at date D. Finally, suppose that Source S3 represents
other table Serv3. The various tables in the three sources
can be specified by means of the following adorned queries:
B DB Q of with respect to is empty. An adorned query is
Q ciated with a table in a source is consistent or not. Let be
B Q ther because the body of is inconsistent with respect
B B an adorned query and let be its body. The query is said
M inconsistent with respect to the Conceptual Model
eiDB if for every database coherent with M, the evaluation</p>
          <p>The above example illustrates a case where there are
various schematic differences, both among the sources, and
between the sources and the Conceptual Model. The
mechanisms used in our methodology for specifying adorned
queries is able to cope with such differences.</p>
          <p>The adorned query associated to a table in a source
contains a lot of information that can be profitably used in
analyzing the quality of the Data Warehouse design process.</p>
          <p>Indeed, the adorned query precisely formalizes the content
of a source table in terms of a query over the Conceptual
Model, the domains of each attribute of the table, and the
attributes used to identify entities at the conceptual level.</p>
          <p>One important check that we can carry out over the logical
specification of a source is whether the adorned query
assoto be inconsistent with respect to the Conceptual Model M,
to M, or because the annotations are incoherent with
respect to what specified in M. The inference techniques
described in [CDGL+98d] allow us to check the consistency
of the relational tables defined for describing a source.</p>
          <p>Additionally, we assume that in Source 2 the actual data
can be described in terms of a relational table TABLE2 with
four columns, two for the customer, one for the service the
customer has registered, and one for the department. As
in Source 1, in Source 2 departments are still identified
by their name, but, differently from Source 1, customers
are identified by their name and date of birth. Services are
identified by a unique service number. Hence the following
adorned query is used to specify TABLE2:</p>
        </sec>
        <sec id="sec-8-1-3">
          <title>4 Interschema Correspondences</title>
          <p>We now describe how to define Interschema
Correspondences, which are used to declaratively specify the
correspondences between data in different schemas (either
source schemas or data warehouse schema).</p>
          <p>In our approach, Interschema Correspondences are
defined in terms of relational tables, similarly to the case of
the relations describing the sources at the logical level. The
difference with source relations is that we conceive
interschema correspondences as non-materialized relational
tables, in the sense that their extension is computed by an
associated program whenever it is needed. It follows that,
to each interschema correspondence, we associate a head,
a body, and an adornment. Differently from the case of
a source relation, the adorment specifies which is the
program that is able to compute the extension of the virtual
table.</p>
          <p>We distinguish among three types of correspondences,
namely Conversion, Matching, and Reconciliation
Correspondences.</p>
          <p>Conversion Correspondences are used to specify that
data in one source can be converted into data of a different
16-6
conj where is a conjunctive query, which specifies the
program is a predicate that we assume associated to a
proconj for ~x in to a tuple of values satisfying the conditions
match dence has the following form:
conditions under which the conversion is applicable, and
gram that performs the conversion. In general, the program
needs to take into account the additional parameters
specified in the condition to actually perform the conversion.</p>
          <p>The conversion has a direction. In particular, it operates
from a tuple of values satisfying the conditions specified
specified for ~y. This means that the conversion program
receives as input a tuple ~x, and returns the corresponding
tuple ~y, possibly using the additional parameter ~z to
perform the conversion.</p>
          <p>Matching Correspondences are used to specify how data
in different sources can match. A Matching
Corresponprogram ing is applicable, and is a predicate that we
asconj where specifies the conditions under which the
matchk The program receives as input tuples of values satisfying
sume associated to a program that performs the matching.
the conditions (and possibly the additional parameters in
the condition) and returns whether they match or not.</p>
          <p>Note that already specified Interschema
Correspondences may be used to define new ones. As an example,
the designer may want to define a Matching
Correspondence between two tuples by using two already defined
Conversion Correspondences, which convert to a common
representation, and then by using equality. In this case,
he could provide the following definition of the Matching</p>
          <p>Correspondence:
convert A Conversion Correspondence has the
followsource or of the data warehouse, and how this conversion
is performed. They are used to anticipate several types of
data conflicts that may occur in loading data.</p>
          <p>As an example, suppose that in a table of a source costs
are represented in Italian Lira, while in a table of the Data
Warehouse we want to express them in Euro. Then, in
order to use the source table in the rewriting of a query that
defines the Data Warehouse table, it is necessary to know
about the possibility of converting each amount in Italian
Lira into an amount in Euro.</p>
          <p>ing form:
conj (~x; ~y; ~z)</p>
          <p>through program(~x; ~y; ~z)
: : : ; through program(~x1; ~xk; ~z; w~)
: : : ; conj : : : ; reconcile([~x1]; [~xk]; [~z]) (~x; ~xk; ~z; w~)
: : : ; values ~x1; ~xk coming from the sources are reconciled
k iation. Such correspondence specifies that the tuples of
program ciliation is applicable, and is a predicate that we
k sociated program receives as input tuples of values (and
conj where specifies the conditions under which the
reconassume associated to a program that performs the
reconcilto the tuple ~z in the Data Warehouse. Therefore, the
aspossibly the additional parameters in the condition) and
returns a reconciled tuple.</p>
          <p>Again, a Reconciliation Correspondence could simply
be defined as a combination of appropriate Matching and</p>
          <p>Conversion Correspondences, e.g.,
Observe that, in this case, the program associated to the</p>
          <p>Matching Correspondence is empty, since the actual
conD default by the system. In particular, for each domain
In practice, several of the Interschema Correspondences
that must be specified will have a very simple form, since
they will correspond simply to equality in the case of a
matching and to identity in the case of a conversion.
Therefore, in order to simplify the task of the designer in
specifying the various interschema correspondences, we assume
that several correspondences are automatically asserted by
in the conceptual model, the following Interschema
Correspondences are specified by default:
identity where is the program that computes the identity
function for values of domain D, and the matching
correspondence has no associated program.</p>
          <p>The system allows the designer to inhibit the default
correspondences for a certain domain, simply by
providing an alternative interschema correspondence referring to
that domain.
= converti([~x]; [~z]); ~y ~z
none through
convert i respondence asserted by the designer, the
sys</p>
          <p>Moreover, we assume that for each Conversion
Cortem automatically asserts a new Matching Correspondence
matchi as follows:
D]; reconcile1;1([N; [S1]; [S2])
[C]) reconcile3;2([M1]; [M2];
and the Reconciliation Correspondences
Example 2 (cont.) The following Conversion
Correspondence specifies that the name and date of birth of a person
can be converted into a Social Security Number through the
program name to SSN:
convert i Moreover, for each Conversion Correspondence
asserted by the designer and for each Matching
Correspondence matchj asserted by the designer or by default, the
system automatically asserts a new Reconciliation
Correspondence reconcilei;j as follows:
none through
convert j ([~x]; [~z])</p>
          <p>matchi([~x]; [~y]);
T sources, and to load such data in . The methodology we
T in the Data Warehouse.1 Our goal is to support the
de</p>
        </sec>
        <sec id="sec-8-1-4">
          <title>5 Specification of mediators</title>
          <p>As we said in the introduction, the problem of data
integration and reconciliation is crucial for the task of
designing the mediators that load the data in the Data Warehouse.</p>
          <p>Such a task aims at specifying, for every relation in the Data
Warehouse Schema, how the tuples of the relation should
be constructed from a suitable set of tuples extracted from
the sources.</p>
          <p>Suppose we have decided to materialize a new relation
signer in providing a formal specification for the design
of the mediator used to extract the correct data from the
propose is based on the following steps.
T the semantics of again in terms of the conceptual
Note that the adorned query associated to a table in a
source is the result of a reverse engineering analysis of
the source, whereas in this case the adorned query is a
specification of what we want to materialize in the
table of the Data Warehouse. Note also that we express
model. Not only the sources, but also the relations in
the Data Warehouse are seen as views of such a
conceptual model.
Moreover, we add the following Conversion
Correspondence, which represents the fact that a department name
can be converted into a department code through the
program dept name to code:
According to the above rules, the system asserts
automatically (among others) the Matching Correspondence and
Conversion Correspondences
a table in a source, or
1To see how DWQ addresses the issue of deciding what to materialize
in the Data Warehouse, we refer to [TLS99].
16-8</p>
        </sec>
      </sec>
      <sec id="sec-8-2">
        <title>Completeness of the rewriting. Obviously, the</title>
        <p>best situation is the one where the rewriting is
complete, in the sense that the rewritten query is
equivalent to the original query. Such a check
can be done by exploiting the algorithm for
query containment.</p>
        <p>Accuracy, confidence, freshness, and availability
of data in the source relations that the rewriting
requires to access.
r the best rewriting with respect to suitable quality
paq In other words, we are trying to reformulate in terms
of the relations in the sources, and possibly in terms
of the matching, conversion, and reconciliation
predicates. If there are different rewritings, then we choose
rameters. There are several criteria to be taken into
consideration when evaluating the quality of a
rewriting, such as:
T of the mediator associated to . The most critical step</p>
        <p>The resulting query is the specification for the design
of the above method is the computation of the rewriting.</p>
        <p>Our rewriting algorithm is based on the method presented
in [DL97], modified to take into account the following
aspects:
r disjunct), then the algorithm associates to a suitable
r In particular, if the query computed by the rewriting
We deal with queries that are disjunctions of
conjunctions. It follows that the rewriting itself is in general
a disjunction, and therefore, we need to deal with the
problem of merging the results of several queries. This
problem is addressed by the notion of merging clause.
is an or-query (i.e., it is constituted by more than one
set of so-called merging clauses, taking into account
that the answers to the different or-parts of the query
may contain objects and values that represent the same
real world entity or the same value. A merging clause
is an expression of the form
We deal with queries whose atoms refer to a
conceptual model that includes ISA assertions and a limited
form of functional dependencies. Such constraints
have to be considered in the computation of the
rewriting.</p>
        <p>merging tuple-spec1 and
such that matching-condition
into tuple-spect1 and and tuple-spectm</p>
        <p>and tuple-specn
where tuple-speci denotes a tuple returned by the
i-th disjunct of r, matching-condition specifies
We observe that the rewriting algorithm is able to
generate one merging clause template for each pair of
disjuncts that are not disjoint. Starting from such
templates, the designer may either specify the such that
and the into parts, depending on the intended
semantics, or change the templates in order to specify a
different merging plan (for example for merging three
disjuncts, rather than three pairs of disjuncts).</p>
        <p>The algorithm computes the maximally contained
rewriting (i.e., every other rewriting is included in the
one computed by the query), but we also want to
inform the designer whether such a rewriting is
equivalent or not to the original query. Indeed, we have
devised an effective method for checking equivalence
between the original query and the computed
rewriting [CDGL98a].</p>
        <p>Besides the relational tables in the sources, our
rewriting algorithm takes into account the matching,
conversion, and reconciliation predicates defined in the
interschema correspondences.</p>
        <p>Even when no rewriting exists for the query (i.e.,
when the maximally contained rewriting is empty), we
want to provide the designer with useful indications
on whether there is a method for enriching the
Interschema Correspondences to get a non-empty
rewriting. Indeed, our rewriting algorithm adopts a form
of abductive reasoning that enables to single out the
specification of which matching, conversion and
reconciliation operations would allow to get a non-empty
rewriting. This indication can be profitably used by
the designer to check whether she/he can add new
Interschema Correspondences in order to make the
computed rewriting complete.</p>
        <p>Example 2 (cont.) Suppose we want to store in the Data
Warehouse a relation containing the information about
customers that have a contract for a certain service with a
department at which they are also registered, or that are
eligible for a promotion. Independently from the fact that the
customer has a contract, we want to include the information
on whether he is eligible for a promotion. We can make use
of a relational table TDW with four components, defined by
the following adorned query, where we have assumed that
in the Data Warehouse we want to identify customers by
their SSN, services by their service number, and
departments by their code:</p>
        <p>Using the asserted and automatically derived
Interschema Correspondences, the system is able to rewrite the
above query in terms of TABLE1 in Source 1 and TABLE2
in Source 2 (see Section 3) as follows:
In this case the merging clause simply reduces to a
disjunction.</p>
        <sec id="sec-8-2-1">
          <title>6 Conclusions</title>
          <p>We have described a new approach to data integration and
reconciliation in Data Warehousing. The approach is based
on the availability of a Conceptual Model of the
corporate data, and allows the designer to declaratively specify
several types of correspondences between data in different
sources. Such correspondences are used by a query
rewriting algorithm that supports the task of specifying the
correct mediators for the loading of the materialized views of
the Data Warehouse.</p>
          <p>Based on the described methodology, we are currently
implementing a design tool within the DWQ project. The
tool is based on the Concept Base System [Jar92], and
provides support for both schema and data integration in Data
Warehousing.
[AD98]</p>
        </sec>
      </sec>
      <sec id="sec-8-3">
        <title>Serge Abiteboul and Oliver Duschka. Com</title>
        <p>plexity of answering queries using
materialized views. In Proc. of the 17th ACM
[CDGL98a]</p>
        <p>SIGACT SIGMOD SIGART Sym. on
Principles of Database Systems (PODS’98),
pages 254–265, 1998.</p>
      </sec>
      <sec id="sec-8-4">
        <title>Diego Calvanese, Giuseppe De Giacomo, and Maurizio Lenzerini. On the decidability of query containment under constraints. In</title>
        <p>Proc. of the 17th ACM SIGACT SIGMOD
SIGART Sym. on Principles of Database</p>
        <p>Systems (PODS’98), pages 149–158, 1998.
[CDGL+98b] Diego Calvanese, Giuseppe De Giacomo,
Maurizio Lenzerini, Daniele Nardi, and
Riccardo Rosati. Description logic
framework for information integration. In Proc.
of the 6th Int. Conf. on the Principles of
Knowledge Representation and Reasoning
(KR’98), pages 2–13, 1998.
[CDGL+98c] Diego Calvanese, Giuseppe De Giacomo,
Maurizio Lenzerini, Daniele Nardi, and
Riccardo Rosati. Schema and data
integration methodology for dwq. Technical
Report DWQ-UNIROMA-004, DWQ
Consortium, September 1998.
[CDGL+98d] Diego Calvanese, Giuseppe De Giacomo,
Maurizio Lenzerini, Daniele Nardi, and
Riccardo Rosati. Source integration in
data warehousing. In Proc. of the 9th Int.</p>
        <p>Workshop on Database and Expert Systems
Applications (DEXA’98), pages 192–197.</p>
        <p>IEEE Computer Society Press, 1998.
[CL93]
[DL97]
[GM95]
[GMS94]</p>
      </sec>
      <sec id="sec-8-5">
        <title>Tiziana Catarci and Maurizio Lenzerini.</title>
        <p>Representing and using interschema
knowledge in cooperative information systems. J.
of Intelligent and Cooperative Information
Systems, 2(4):375–398, 1993.</p>
      </sec>
      <sec id="sec-8-6">
        <title>Oliver M. Duschka and Alon Y. Levy. Recursive plans for information gathering. In</title>
        <p>Proc. of the 15th Int. Joint Conf. on
Artificial Intelligence (IJCAI’97), pages 778–
784, 1997.</p>
      </sec>
      <sec id="sec-8-7">
        <title>A. Gupta and I. S. Mumick. Maintenance of</title>
        <p>materialized views: Problems, techniques,
and applications. IEEE Bulletin of the
Technical Committee on Data Engineering,
18(2):3–18, 1995.</p>
      </sec>
      <sec id="sec-8-8">
        <title>Cheng Hian Goh, Stuart E. Madnick, and</title>
        <p>Michael Siegel. Context Interchange:
Overcoming the challenges of large-scale
interoperable database systems in a dynamic
environment. In Proc. of the 3rd Int. Conf. on
16-10
[PGMW95]</p>
      </sec>
      <sec id="sec-8-9">
        <title>Richard Hull. Managing semantic hetero</title>
        <p>geneity in databases: A theoretical
perspective. In Proc. of the 16th ACM SIGACT
SIGMOD SIGART Sym. on Principles of
Database Systems (PODS’97), 1997.</p>
      </sec>
      <sec id="sec-8-10">
        <title>Richard Hull and Gang Zhou. A framework for supporting data integration using the materialized and virtual approaches.</title>
        <p>In Proc. of the ACM SIGMOD Int. Conf.
on Management of Data, pages 481–492,
1996.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>Information and Knowledge Management (CIKM'94)</source>
          , pages
          <fpage>337</fpage>
          -
          <lpage>346</lpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [HGMW+95]
          <string-name>
            <surname>Joachim</surname>
            <given-names>Hammer</given-names>
          </string-name>
          , Hector Garcia-Molina, Jennifer Widom, Wilburt Labio, and
          <string-name>
            <given-names>Yue</given-names>
            <surname>Zhuge</surname>
          </string-name>
          .
          <article-title>The Stanford data warehousing project</article-title>
          .
          <source>IEEE Bulletin of the Technical Committee on Data Engineering</source>
          ,
          <volume>18</volume>
          (
          <issue>2</issue>
          ):
          <fpage>41</fpage>
          -
          <lpage>48</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>[Hul97] [HZ96] [Inm96] [Jar92] [JJQV98] [JLVV99]</source>
          [LMSS95]
          <string-name>
            <given-names>W. H.</given-names>
            <surname>Inmon</surname>
          </string-name>
          .
          <article-title>Building the Data Warehouse</article-title>
          . John Wiley &amp; Sons, second edition,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Jarke</surname>
          </string-name>
          .
          <source>Conceptbase V3</source>
          .
          <article-title>1 user manual</article-title>
          .
          <source>Technical Report 92-17</source>
          ,
          <string-name>
            <given-names>Aachener</given-names>
            <surname>Informatik-Berichte</surname>
          </string-name>
          , Aachen, Germany,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>Architecture and quality in data warehouses</article-title>
          .
          <source>In Proc. of the 10th Conf. on Advanced Information Systems Engineering (CAiSE'98)</source>
          , volume
          <volume>1413</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>93</fpage>
          -
          <lpage>113</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          Springer-Verlag,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Matthias</given-names>
            <surname>Jarke</surname>
          </string-name>
          , Maurizio Lenzerini, Yannis Vassiliou, and
          <string-name>
            <given-names>Panos</given-names>
            <surname>Vassiliadis</surname>
          </string-name>
          .
          <source>Fundamentals of Data Warehouses. SpringerVerlag</source>
          ,
          <year>1999</year>
          . In Press.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>Answering queries using views</article-title>
          .
          <source>In Proc. of the 14th ACM SIGACT SIGMOD SIGART Sym. on Principles of Database Systems (PODS'95)</source>
          , pages
          <fpage>95</fpage>
          -
          <lpage>104</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Yannis</given-names>
            <surname>Papakonstantinou</surname>
          </string-name>
          , Hector GarciaMolina, and Jennifer Widom.
          <article-title>Object exchange across heterogeneous information sources</article-title>
          .
          <source>In Proc. of the 11th IEEE Int. Conf.</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>on Data Engineering (ICDE'95)</source>
          , pages
          <fpage>251</fpage>
          -
          <lpage>260</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [TLS99]
          <article-title>[Ull97] [Wid95] [Wie92] [ZHK96] [ZHKF95] Amit Sheth and Vipul Kashyap. So far (schematically) yet so near (semantically).</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>In Proc. of the IFIP DS-5 Conf. on Semantics of Interoperable Database Systems. Elsevier Science Publishers (North-Holland)</source>
          , Amsterdam,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Dimitri</given-names>
            <surname>Theodoratos</surname>
          </string-name>
          , Spyros Ligoudistianos, and
          <string-name>
            <given-names>Timos</given-names>
            <surname>Sellis</surname>
          </string-name>
          .
          <article-title>Designing the global Data Warehouse with SPJ views</article-title>
          .
          <source>In Proc. of the 11th Conf. on Advanced Information Systems Engineering (CAiSE'99)</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <source>of the 6th Int. Conf. on Database Theory (ICDT'97)</source>
          , volume
          <volume>1186</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>19</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          Springer-Verlag,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Jennifer</given-names>
            <surname>Widom</surname>
          </string-name>
          .
          <article-title>Special issue on materialized views and data warehousing</article-title>
          .
          <source>IEEE Bulletin on Data Engineering</source>
          ,
          <volume>18</volume>
          (
          <issue>2</issue>
          ),
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Gio</given-names>
            <surname>Wiederhold</surname>
          </string-name>
          .
          <article-title>Mediators in the architecture of future information systems</article-title>
          .
          <source>IEEE Computer</source>
          ,
          <volume>25</volume>
          (
          <issue>3</issue>
          ):
          <fpage>38</fpage>
          -
          <lpage>49</lpage>
          ,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <article-title>Generating data integration mediators that use materializations</article-title>
          .
          <source>J. of Intelligent Information Systems</source>
          ,
          <volume>6</volume>
          :
          <fpage>199</fpage>
          -
          <lpage>221</lpage>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>Gang</given-names>
            <surname>Zhou</surname>
          </string-name>
          , Richard Hull,
          <string-name>
            <given-names>Roger</given-names>
            <surname>King</surname>
          </string-name>
          , and
          <string-name>
            <surname>Jean-Claude Franchitti</surname>
          </string-name>
          .
          <article-title>Using object matching and materialization to integrate heterogeneous databases</article-title>
          .
          <source>In Proc. of the 3rd Int. Conf. on Cooperative Information Systems (CoopIS'95)</source>
          , pages
          <fpage>4</fpage>
          -
          <lpage>18</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>