=Paper=
{{Paper
|id=Vol-19/paper-17
|storemode=property
|title=A Principled Approach to Data Integration and Reconciliation in Data Warehousing
|pdfUrl=https://ceur-ws.org/Vol-19/paper16.pdf
|volume=Vol-19
|dblpUrl=https://dblp.org/rec/conf/dmdw/CalvaneseGLNR99
}}
==A Principled Approach to Data Integration and Reconciliation in Data Warehousing==
A Principled Approach to Data Integration and Reconciliation in
Data Warehousing
Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini,
Daniele Nardi, Riccardo Rosati
Dipartimento di Informatica e Sistemistica,
Università di Roma “La Sapienza”
Via Salaria 113, 00198 Roma, Italy
lastname@dis.uniroma1.it
http://www.dis.uniroma1.it/˜lastname
interest. The typical architecture of an integration systems
is described in terms of two types of modules: wrappers
Abstract and mediators [Wie92, Ull97]. The goal of a wrapper is
to access a source, extract the relevant data, and present
Integration is one of the most important aspects such data in a specified format. The role of a mediator is to
of a Data Warehouse. When data passes from the merge data produced by different wrappers (or mediators),
sources of the application-oriented operational en- so as to meet a specific information need of the integra-
vironment to the Data Warehouse, possible incon- tion system. The specification and the realization of me-
sistencies and redundancies should be resolved, so diators is the core problem in the design of an integration
that the warehouse is able to provide an integrated system. This problem has recently become a central issue
and reconciled view of data of the organization. in several contexts, including multi-database systems, Data
We describe a novel approach to data integration Warehousing and information gathering from the Web.
and reconciliation, based on a conceptual repre- The constraints that are typical of Data Warehouse ap-
sentation of the Data Warehouse application do- plications restrict the large spectrum of approaches that are
main. The main idea is to declaratively specify being proposed [Hul97, Inm96, JLVV99]. First, while the
suitable matching, conversion, and reconciliation sources on the Web are often external, in a Data Warehouse
operations to be used in order to solve possibile they are mostly internal to the organization. Second, a Data
conflicts among data in different sources. Such a Warehouse should reflect the informational needs of an or-
specification is provided in terms of the concep- ganization, and should therefore be defined in terms of a
tual model of the application, and is effectively global, corporate view of data. Third, such a view should
used during the design of the software modules be provided in terms of conceptual representation mecha-
that load the data from the sources into the Data nism that is able to abstract from the physical and logical
Warehouse. organization of data in the sources. It follows that the need
and requirements for maintaining an integrated, conceptual
1 Introduction view of the corporate data in the organization are stronger
with respect to other contexts. A direct consequence of this
Information Integration is the problem of acquiring data
fact is that the data in the sources and in the Data Ware-
from a set of sources that are available for the application of
house should be defined in terms of the conceptual model,
The copyright of this paper belongs to the paper’s authors. Permission to and not the other way around. In other words, data in-
copy without fee all or part of this material is granted provided that the tegration in Data Warehousing should follow the local as
copies are not made or distributed for direct commercial advantage. view approach, where each table in a source and in the Data
Proceedings of the International Workshop on Design and Warehouse is defined as a view of a global model of the cor-
Management of Data Warehouses (DMDW’99) porate data. On the contrary, the global as view approach
Heidelberg, Germany, 14. - 15.6. 1999 requires, for each information need, to specify the corre-
(S. Gatziu, M. Jeusfeld, M. Staudt, Y. Vassiliou, eds.) sponding query in terms of the data at the sources, and is
http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-19/ therefore suited when no global, integrated view of the data
D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati 16-1
of the organization is available. browsing through a meta-level description of the relations
The above considerations motivate the approach to in- of the sources. In addition, it generally provides both for
formation integration proposed in [CDGL+ 98d], whose automatic code generators and for the possibility of attach-
distinguishing feature is to exploit the possibility of rep- ing procedures to accomplish ad hoc transformations and
resenting the conceptual level of a Data Warehouse in a filtering of the data. Even though there are powerful and
very expressive language and use reasoning tools to support effective environments with the above features, their nature
the Data Warehouse construction, maintenance and evolu- is inherently procedural and close to the notion of global as
tion. In fact, the idea is to balance the effort of building a view, where the task of relating the sources with the Data
conceptual model of the Data Warehouse by improving the Warehouse is done on a query-by-query basis.
capabilities of the system in maintaining the Data Ware- Several recent research contributions address the same
house and support the incremental addition of information problem from a more formal perspective [HGMW+ 95,
sources. The proposed approach follows a local as view Wid95, GM95, HZ96, ZHK96, ZHKF95, PGMW95,
paradigm, by explicitly requiring an enterprise conceptual GMS94]. For example, a methodology for extracting, com-
model which is therefore regarded as a unified view of the paring and matching data and objects located in different
data available within the organization. sources is described in [PGMW95]. The methodology is
Most of the work on integration has been concerned based on the Object Exchange Model, which requires the
with the intensional/schema level, while less attention has explicit semantic labeling of the objects, to support object
been devoted to the problem of data integration at the exten- exchange, and emphasizes the need for a tight interaction
sional level. Integration of data is, nonetheless, at the heart between the system and the user. However, the method
of Data Warehousing [Inm96]. When data passes from the remains of procedural nature, since it requires the user to
application-oriented operational environment to the Ware- build and maintain the relationship between the sources and
house, possible inconsistencies and redundancies should be the Data Warehouse on a query-by-query basis.
resolved, so that the Warehouse is able to provide an in- The approach proposed in [GMS94] is more declarative
tegrated and reconciled view of data of the organization. in nature. Suitable data structures for reconciling different
Thus, in the context of a Data Warehouse, data integration representations of the same data are represented in a con-
and reconciliation is the process of acquiring data from the text theory, which is used by the system to transform the
sources and making them available within the Warehouse. queries as appropriate for gathering the data from the vari-
Given a request for data (e.g., for materializing a new ous sources. In such a declarative approach, the user is not
relation in the Data Warehouse), which is formulated in directly concerned with the identification and resolution of
terms of the global view of the corporate data, (i.e., not semantic conflicts when formulating the requests for data.
the language of the sources, but of the enterprise), there are Rather, once the specification of the sources is available,
several steps that enable for the acquisition of data from the conflicts are detected by the system, and conversion and
sources: filtering are automatically enforced. However, the method
still follows the global-as-view approach, and the context
1. Identification of the sources where the relevant infor-
theory is used as a description of reconciled data structures,
mation resides. Note that this task is typical of the
rather than as the conceptual model of the corporate data.
local-as-view approach, and requires algorithms that
In this paper we present the approach to data integra-
are generally both sophisticated and costly [AD98,
tion and reconciliation proposed within the DWQ (Data
LMSS95].
Warehouse Quality) project [JJQV98, JLVV99]. In DWQ,
2. Decomposition of the user request into queries to in- the ultimate goal of source integration and data reconcil-
dividual sources that would return the data of interest. iation is to represent the migration of the data from the
sources to the Data Warehouse, in order to support the de-
3. Interpretation of the data provided by a source. In- sign of materialized views that meet user requirements, and
terpreting data can be regarded as the task of casting have high quality with respect to correctness, interpretabil-
them into a common representation, which can there- ity, usefulness, believability. The method for data integra-
after be used to manipulate the data. tion and reconciliation builds upon and extends the work
in [CDGL+ 98d], therefore relying on the availability of a
4. Merging of the data. The data returned by various
Conceptual Model to declaratively represent the relation-
sources need to be combined to provide the Data
ship between the sources and the Data Warehouse. The
Warehouse with the requested information.
declarative approach is further pursued in the task of data
In commercial environments for Data Warehouse design integration and reconciliation, where the system is given a
and management the above tasks are taken care of through declarative description of the data in the sources and pro-
ad-hoc components [JLVV99]. In general, such a compo- vides automatic support in satisfying the data requests for
nent provides the user with the capability of specifying the populating the Data Warehouse.
mapping between the sources and the Data Warehouse by Compared with the existing proposals mentioned above,
D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati 16-2
Data Warehouse
Schema Mediators Data Warehouse
Store
Conceptual Model
Source Source
Schema1 Scheman Wrappers
conceptual level logical level physical level
conceptual/logical mapping
Sources
physical/logical mapping Source Source
data flow Data Store1 Data Storen
Figure 1: Architecture for Data Integration
the novelty of our approach stems from the following fea- The conceptual level contains a conceptual represen-
tures: tation of the corporate data.
It relies on the Conceptual Model of the corporate The logical level contains a representation in terms of
data, which is expressed in an Entity-Relationship for- a logical data model of the sources and of the data
malism. materialized in the Data Warehouse.
It follows the local-as-view paradigm. The physical level contains a store for the material-
It allows the designer to declaratively specify sev- ized data, wrappers for the sources and mediators for
eral types of correspondences between data in differ- loading the materialized data store.
ent schemas (either source schemas or Data Ware-
The relationship between the conceptual and the logical,
house schema). Three types of Interschema Corre-
and between the logical and the physical level is repre-
spondences are taken into account, namely Conver-
sented explicitly by specifying mappings between corre-
sion, Matching, and Reconciliation Correspondences.
sponding objects of the different levels.
It uses such correspondences for supporting the task We briefly describe the conceptual and logical levels,
of specifying the correct mediators for the loading of referring to the abstract architecture of DWQ as depicted
the materialized views of the Data Warehouse. in Figure 1.
The Conceptual Model is a conceptual representation
Our methodology relies on a novel query rewriting algo-
of the data managed by the enterprise, including a con-
rithm, whose role is to reformulate the query that defines
ceptual representation of the data residing in sources, and
the view to materialize in terms of both the source relations
of the global concepts and relationships that are of in-
and the interschema correspondences.
terest to the Data Warehouse application. The concep-
The paper is organized as follows. In Section 2, we sum-
tual model is expressed in terms of an enriched Entity-
marize the relevant features of the proposed approach to
Relationship model in which complex entity and relation-
information integration. Section 3 illustrates the method
ship expressions can be constructed and used, and in which
we use to describe the content of the sources at the logical
interdependencies between elements of different sources
level. Section 4 is devoted to a discussion of the mean-
and of the enterprise are captured using intermodel asser-
tions [CDGL+ 98b, CL93]. Intermodel assertions provide a
ing and the role of interschema correspondences. Section 5
describes the query rewriting algorithm at the basis of our
simple and effective declarative mechanism to express the
approach to the design of mediators. Section 6 concludes
dependencies that hold between entities (i.e. classes and re-
the paper.
lationships) in different models [Hul97]. The use of inter-
model assertions allows for an incremental approach to the
2 The DWQ framework integration of the conceptual models of the sources and of
In this section we briefly describe the general framework the enterprise. Due to space limitations, we cannot con-
adopted in the DWQ project [CDGL+ 98d]. The proposed sider this aspect in further detail, and refer the interested
framework allows one to explicitly model data and infor- reader to [CDGL+ 98b].
mation needs – i.e., a specification of the data that the The conceptual representation contains, besides entities
Data Warehouse provides to the user – at various lev- and relationships, also a description of domains, which are
els [CDGL+ 98b, CDGL+ 98d, CDGL+ 98c]: used to typify the attributes of entities and relationships.
D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati 16-3
Rather than considering only concrete domains, such as DeptNameString v String
strings, integers, and reals, our approach is based on the SSNString v String
use of abstract domains. An abstract domain may have DeptCodeInteger v Integer
ServNoInteger v Integer
an underlying concrete domain, but allows the designer to
distinguish between the different meanings that a value of
the concrete domain may have. Additionally, also Boolean
combinations of domains and the possibility to construct an At the logical level, the logical content of each source,
ISA hierarchy between domains are supported. called the Source Schema (see Section 3), is provided in
terms of a set of relational tables using the relational model.
Example 1 Consider two attributes A1 in a source and
The link between the logical representation and the concep-
A2 in the Data Warehouse, both representing amounts of tual representation of the source is formally defined by as-
money. Rather than specifying that both attributes have val-
sociating with each table a query that describes its content
ues of type real, the designer may specify that the domain
in terms of a query over the Conceptual Model. In other
of attribute A1 is MoneyInLire while the domain of at-
words, the logical content of a source table is described in
tribute A2 is MoneyInEuro, both of which have real
terms of a view over the Conceptual Model. To map phys-
as the underlying concrete domain. In this way, it be-
ical structures to logical structures we make use of suit-
comes possible to specify declaratively the difference be-
able wrappers, which encapsulate the sources. The wrapper
tween values of the two attributes, and take into account
hides how the source actually stores its data, the data model
such knowledge for loading data from the source to the
it adopts, etc., and presents the source as a set of relational
Data Warehouse.
tables. In particular, we assume that all attributes in the
We provide an example of the form of the Conceptual tables are of interest to the Data Warehouse application (at-
Model, and refer to [CDGL+ 98b] for a more detailed de- tributes that are not of interest are hidden by the wrapper).
scription of the adopted formalism. The logical content of the materialized views constituting
the Data Warehouse, called the Data Warehouse Schema
Example 2 As our running example we consider an en- (see Section 4), is provided in terms of a set of relational
terprise and two sources containing information about con- tables. Similarly to the case of the sources, each table of
tracts between customers and departments for services, and the Data Warehouse Schema is described in terms of a view
about registration of customers at departments. Source 1 over the Conceptual Model. As we said before, the way in
contains information about customers registered at public- which a view is actually materialized, starting from the data
relations departments. Source 2 contains information about in the sources, is specified by means of mediators.
contracts and complete information about services. Such In such a framework, we have devised suitable inference
situation can be represented by means of the ER diagrams techniques, which allow for carrying out several reasoning
shown in Figure 2, together with the following intermodel services on both the conceptual representation, such as in-
assertions (v represents ISA while represents equiva- ferring inclusion between entities and relationships, satis-
lence): fiability of entities, etc. [CDGL+ 98d], and the logical rep-
resentation, such as query containment [CDGL98a], which
is at the basis of query rewriting. The possibilities offered
Department1 PrDept0 by such reasoning tools are used in the accomplishment of
REG-AT1 v REG-AT0 several activities concerning both the design and the oper-
Customer1 Customer0 u ation of the Data Warehouse.
( 1 [$1](REG-AT0 u ($2: PrDept0 )))
Customer0 u 3 Source schema description
( 1 [$1]CONTRACT0 ) v ( 1 [$1]PROMOTION1 ) In this section we focus on the specification of the logical
Customer2 v Customer0 u schemas of the sources. Such schemas are intended to pro-
( 1 [$1]CONTRACT0 ) vide a structural description of the content of the sources,
Department2 v Department0 which are encapsulated by suitable wrappers.
Service2 Service0 We describe a source S by associating to each relational
CONTRACT2 v CONTRACT0
table T of S an adorned query that is constituted by a head,
a body, and an adornment:
Customer1 Customer0
Department1 Department0 The head defines the relational schema of the table in
terms of a name, and the number of columns.
and the following domain hierarchy:
The body describes the content of the table in terms of
PersNameString v String a query over the Conceptual Model.
D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati 16-4
SSN=SSNString Name=DeptNameString
1 2
ServNo=ServNoInteger Customer1 REG-AT1 Department1
Name=PersNameString Service0
SSN=SSNString 1 2
2
1 3 PROMOTION1
Customer0 CONTRACT0 Name=DeptNameString
Code=DeptCodeInteger
DOB=Date
1 2 ServNo=ServNoInteger
REG-AT0 Department0
Service2
Name=PersNameString Name=DeptNameString
2
PrDept0 1 3
Customer2 CONTRACT2 Department2
DOB=Date
Figure 2: Conceptual model of the application of Example 2
The adornment declares the domains of the columns over the elements of the Conceptual Model. We need to
of the table, and also which are the attributes of the ta- make it explicit how the objects of the conceptual repre-
ble that are used to identify an entity of the Conceptual sentation are coded into values of the logical representa-
Model. tion. The notion of adorned query is introduced exactly for
this purpose.
We now present in detail the notions introduced above. An adorned query is an expression of the form
3.1 Query over the Conceptual Model T (~x) q(~x; ~y) j 1 ; : : : ; n
Generally speaking, the connection to the Conceptual
Model is established by defining each table as a relational
where T is the name of the relational table, ~ are its at- x
tributes (observe that attributes denote values and not ob-
query over the elements of the Conceptual Model.
A query q for a Conceptual Model M is a non-recursive
(x y)
jects), q ~ ; ~ is a query as defined above, and each i is
Datalog query, written in the form:
x
an annotation on variables appearing in ~ . In particular:
q(~x) (x y ) OR OR conj (~x; ~y )
conj 1 ~ ; ~ 1 m m
x
1. For each X 2 ~ , we have an annotation of the form
where each conj (~x; ~y ) is a conjunction of atoms or
X :: V
negated atoms, and ~x; ~y are all the variables appearing in
i i
where V is a domain expression. Such an annotation
the conjunct. Each atom is either of the forms E (t) or of
i
is used to specify how values bound to X are repre-
the form R(~t), where ~t, t, and t are variables in ~x; ~y or
0
constants, and E and R, and entities and relationships of
i sented in the table at the logical level. For example,
which currency is used for a real value denoting an
M respectively. amount of money.
The semantics of queries is as follows. Given an inter-
pretation I of a Conceptual Model M with interpretation z x
2. For each tuple of variables ~ ~ that is used for iden-
domain I , a query q of arity n is interpreted as the set y
tifying in T an object Y 2 ~ mentioned in q ~ ; ~ , we (x y)
( )
qI of n-tuples d1 ; : : : ; dn , with each di 2 I , such that, have an annotation of the form
when substituting each di for xi , the formula
([z] )
identify ~ ; Y
9~y1 .conj 1 (~x; ~y1 ) OR OR 9~y .conj (~x; ~y )
m m m
For example, the designer may assert that the
evaluates to true in I . attributes first name, last name, and
The fact that a relation in a source is defined in terms date of birth in a table are used to identify
of a query over the Conceptual Model confirms that we are students.
following the local-as-view approach: each table is seen as
a view of the virtual database represented by the Concep- We point out that our method is able to cope with
tual Model. several schematic differences that may be present in the
sources [SK92]. We illustrate this point with the help of
3.2 Adornment an example.
To make the connection to the Conceptual Model precise, Example 3 Suppose that the Conceptual Model contains
it is not sufficient to define each table as a relational query a relationship Service with three attributes, Date,
D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati 16-5
ServiceNo, and Price, where Service D; S; P ( ) Example 2 (cont.) Assuming that in Source 1 a customer
means that at the date D the service S costs P Euro. Sup- is actually identified by its social security number, and a
pose that Source S1 represents the same kind of informa- department by its name, we can specify the relational table
tion only on Services v1 and v2 , by means of two tables: TABLE1 by the following adorned query:
( )
v1 and v2, where v1 D; P means that service v1 costs P
( )
Italian Lira at date D, and v2 D; P means that service v2
TABLE1 (S; M; P )
:
REG-AT1 (X; D); PROMOTION1 (X; D); P = false;
costs P Italian Lira at date D. Suppose that Source S2 rep- SSN(X; S ); Name(D; M )
resents the same kind of information only on Services v3 OR
and v4 by means of a table Serv, where Serv X; Y; D ( ) PROMOTION1 (X; D); P = true; SSN(X; S );
means that services v3 and v4 cost X and Y Euro respec- Name1 (D; M ) j
tively at date D. Finally, suppose that Source S3 represents identify ([S ]; X ); S :: SSNString;
the information only for a certain date d by means of an- identify ([M ]; D); M :: DeptNameString;
other table Serv3 . The various tables in the three sources P :: Boolean
can be specified by means of the following adorned queries:
Additionally, we assume that in Source 2 the actual data
can be described in terms of a relational table TABLE2 with
v1(D; P ) Service(D;0 v10 ; P ) j four columns, two for the customer, one for the service the
P :: ItalianLira; D :: Date
customer has registered, and one for the department. As
v2(D; P ) Service(D;0 v20 ; P ) j in Source 1, in Source 2 departments are still identified
P :: ItalianLira; D :: Date
by their name, but, differently from Source 1, customers
Serv(X; Y; D) Service(D;0 v30 ; X ); are identified by their name and date of birth. Services are
Service(D;0 v40 ; Y ) j identified by a unique service number. Hence the following
X :: Euro; Y :: Euro; D :: Date
adorned query is used to specify TABLE2 :
Serv3 (S 1; P ) Service(d; S 1; P );
Code(S; S 1) j TABLE2 (N; B; I ; M )
CONTRACT2 (X; S; D); Name(X; N ); ServNo(S; I );
P :: Euro;
identify ([S 1]; S ); S 1 :: String Name(D; M ) j
identify ([N; B ]; X ); N :: PersNameString;
B :: Date
identify ([I ]; S ); I :: ServNoInteger;
identify ([M ]; D); M :: DeptNameString
The above example illustrates a case where there are
various schematic differences, both among the sources, and
between the sources and the Conceptual Model. The mech-
anisms used in our methodology for specifying adorned
4 Interschema Correspondences
queries is able to cope with such differences. We now describe how to define Interschema Correspon-
The adorned query associated to a table in a source con- dences, which are used to declaratively specify the cor-
tains a lot of information that can be profitably used in an- respondences between data in different schemas (either
alyzing the quality of the Data Warehouse design process. source schemas or data warehouse schema).
Indeed, the adorned query precisely formalizes the content In our approach, Interschema Correspondences are de-
of a source table in terms of a query over the Conceptual fined in terms of relational tables, similarly to the case of
Model, the domains of each attribute of the table, and the the relations describing the sources at the logical level. The
attributes used to identify entities at the conceptual level. difference with source relations is that we conceive inter-
One important check that we can carry out over the logical schema correspondences as non-materialized relational ta-
specification of a source is whether the adorned query asso- bles, in the sense that their extension is computed by an
ciated with a table in a source is consistent or not. Let Q be associated program whenever it is needed. It follows that,
an adorned query and let B be its body. The query B is said to each interschema correspondence, we associate a head,
to be inconsistent with respect to the Conceptual Model M, a body, and an adornment. Differently from the case of
if for every database DB coherent with M, the evaluation a source relation, the adorment specifies which is the pro-
of B with respect to DB is empty. An adorned query Q is gram that is able to compute the extension of the virtual
inconsistent with respect to the Conceptual Model M ei- table.
ther because the body B of Q is inconsistent with respect We distinguish among three types of correspondences,
to M, or because the annotations are incoherent with re- namely Conversion, Matching, and Reconciliation Corre-
spect to what specified in M. The inference techniques de- spondences.
scribed in [CDGL+ 98d] allow us to check the consistency Conversion Correspondences are used to specify that
of the relational tables defined for describing a source. data in one source can be converted into data of a different
D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati 16-6
source or of the data warehouse, and how this conversion versions are performed by the programs associated to the
is performed. They are used to anticipate several types of Conversion Correspondences.
data conflicts that may occur in loading data. Reconciliation Correspondences are used to assert how
As an example, suppose that in a table of a source costs we can reconcile data in different sources into data of
are represented in Italian Lira, while in a table of the Data the data warehouse. A Reconciliation Correspondence
Warehouse we want to express them in Euro. Then, in or- reconcile has the following form:
der to use the source table in the rewriting of a query that
defines the Data Warehouse table, it is necessary to know reconcile ([~x1 ]; : : : ; [~x ]; [~z]) (x x z w)
conj ~ ; : : : ; ~ k ;~ ; ~
(x x z w)
k
about the possibility of converting each amount in Italian through program ~ 1 ; : : : ; ~ k ;~ ; ~
Lira into an amount in Euro.
A Conversion Correspondence convert has the follow- where conj specifies the conditions under which the recon-
ing form: ciliation is applicable, and program is a predicate that we
assume associated to a program that performs the reconcil-
convert ([~x]; [~y]) (x y z)
conj ~ ; ~ ;~ iation. Such correspondence specifies that the k tuples of
(x y z)
through program ~ ; ~ ;~ x x
values ~ 1 ; : : : ; ~ k coming from the sources are reconciled
where conj is a conjunctive query, which specifies the
z
to the tuple ~ in the Data Warehouse. Therefore, the as-
sociated program receives as input k tuples of values (and
conditions under which the conversion is applicable, and
possibly the additional parameters in the condition) and re-
program is a predicate that we assume associated to a pro-
turns a reconciled tuple.
gram that performs the conversion. In general, the program
Again, a Reconciliation Correspondence could simply
needs to take into account the additional parameters spec-
be defined as a combination of appropriate Matching and
ified in the condition to actually perform the conversion.
Conversion Correspondences, e.g.,
The conversion has a direction. In particular, it operates
from a tuple of values satisfying the conditions specified
x
for ~ in conj to a tuple of values satisfying the conditions
([x] [y] [z])
reconcile ~ ; ~ ; ~
y
specified for ~ . This means that the conversion program
([x] [w ])
convert 1 ~ ; ~ 1 ; convert 2 ~ ; ~ 2 ([y] [w ]);
x
receives as input a tuple ~ , and returns the corresponding
([w ] [w ])
match 1 ~ 1 ; ~ 2 ; convert 3 ~ 1 ; ~ ([w ] [z]);
y
tuple ~ , possibly using the additional parameter ~ to per- z (x y w w z)
conj ~ ; ~ ; ~ 1 ; ~ 2 ;~
through none
form the conversion.
Matching Correspondences are used to specify how data In practice, several of the Interschema Correspondences
in different sources can match. A Matching Correspon- that must be specified will have a very simple form, since
dence match has the following form: they will correspond simply to equality in the case of a
match ([~x1 ]; : : : ; [~x ]) (x x z)
conj ~ 1 ; : : : ; ~ k ;~
matching and to identity in the case of a conversion. There-
(x x z)
k
through program ~ ; : : : ; ~ k ;~
fore, in order to simplify the task of the designer in speci-
fying the various interschema correspondences, we assume
where conj specifies the conditions under which the match- that several correspondences are automatically asserted by
ing is applicable, and program is a predicate that we as- default by the system. In particular, for each domain D
sume associated to a program that performs the matching. in the conceptual model, the following Interschema Corre-
The program receives as input k tuples of values satisfying spondences are specified by default:
the conditions (and possibly the additional parameters in
the condition) and returns whether they match or not. convert ([X ]; [Y ]) D ( X ); D ( Y )
Note that already specified Interschema Correspon- ( )
through identity X; Y
dences may be used to define new ones. As an example, match ([X ]; [Y ]) D(X ); D(Y ); X = Y
the designer may want to define a Matching Correspon- through none
dence between two tuples by using two already defined
reconcile ([X ]; [Y ]; [Z ]) D(X ); D(Y ); D(Z );
Conversion Correspondences, which convert to a common
X=Y
through identity (X; Z )
representation, and then by using equality. In this case,
he could provide the following definition of the Matching
Correspondence: where identity is the program that computes the identity
match ([~x]; [~y]) ([x] [z]); convert 2([~y]; [~z]);
convert 1 ~ ; ~ function for values of domain D, and the matching corre-
(x y z w)
conj ~ ; ~ ;~ ; ~ spondence has no associated program.
The system allows the designer to inhibit the default
through none
correspondences for a certain domain, simply by provid-
Observe that, in this case, the program associated to the ing an alternative interschema correspondence referring to
Matching Correspondence is empty, since the actual con- that domain.
D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati 16-7
Moreover, we assume that for each Conversion Cor-
respondence convert i asserted by the designer, the sys- and the Reconciliation Correspondences
tem automatically asserts a new Matching Correspondence
match i as follows: reconcile 1;1 ([N; D]; [S1 ]; [S2 ])
reconcile 3;2 ([M1 ]; [M2 ]; [C ])
([x] [y])
match i ~ ; ~ ([x] [z]); ~y = ~z
convert i ~ ; ~
through none
Moreover, for each Conversion Correspondence convert i 5 Specification of mediators
asserted by the designer and for each Matching Correspon- As we said in the introduction, the problem of data inte-
dence match j asserted by the designer or by default, the gration and reconciliation is crucial for the task of design-
system automatically asserts a new Reconciliation Corre- ing the mediators that load the data in the Data Warehouse.
spondence reconcile i;j as follows: Such a task aims at specifying, for every relation in the Data
reconcile i;j ([~x]; [~y]; [~z]) ([x] [y])
match i ~ ; ~ ;
Warehouse Schema, how the tuples of the relation should
([x] [z])
convert j ~ ; ~
be constructed from a suitable set of tuples extracted from
the sources.
through none
Suppose we have decided to materialize a new relation
Example 2 (cont.) The following Conversion Correspon- T in the Data Warehouse.1 Our goal is to support the de-
dence specifies that the name and date of birth of a person signer in providing a formal specification for the design
can be converted into a Social Security Number through the of the mediator used to extract the correct data from the
program name to SSN: sources, and to load such data in T . The methodology we
propose is based on the following steps.
convert 1 ([N; B ]; [S ])
PersNameString(N ); Date(B ); DOB(N; B ); 1. We apply the method described in Section 3 to provide
SSNString(S ) the specification of the relation T . In other words, we
through name to SSN(N; B; S ) specify T in terms of an adorned query
Moreover, we add the following Conversion Correspon- q q j c1 ; : : : ; c :
0
n
dence, which represents the fact that a department name
can be converted into a department code through the pro- Note that the adorned query associated to a table in a
gram dept name to code: source is the result of a reverse engineering analysis of
the source, whereas in this case the adorned query is a
convert 2 ([M ]; [C ]) specification of what we want to materialize in the ta-
DeptNameString(M ); DeptCodeInteger(C ) ble of the Data Warehouse. Note also that we express
through dept name to code(M; C )
the semantics of T again in terms of the conceptual
model. Not only the sources, but also the relations in
According to the above rules, the system asserts automat-
the Data Warehouse are seen as views of such a con-
ically (among others) the Matching Correspondence and
ceptual model.
Conversion Correspondences
2. We look for a rewriting of q in terms of the queries
match 1 ([N; B ]; [S ]) convert 1 ([N; B ]; [S1 ]); S = S1 q1 ; : : : ; qs that correspond to the materialized views in
through none the Data Warehouse. If a complete, equivalent rewrit-
ing exists, then the new table can be derived from the
match 2 ([M ]; [C ]) convert 2 ([M ]; [C1 ]); C = C1
through none existing tables in the Data Warehouse. Otherwise, the
algorithm is able to single out the part that cannot be
match 3 ([M1 ]; [M2 ]) DeptNameString(M1 );
derived from the Data Warehouse, and that must be
DeptNameString(M2 );
loaded from the sources. In the following, q denotes
M1 = M2
through none such part.
convert 4 ([S1 ]; [S2 ]) SSNString(S1 ); 3. We look for a rewriting of q in terms of the queries cor-
SSNString(S2 ) responding to the tables in the sources. The rewriting
through identity (S1 ; S2 ) aims at expressing the data in T in terms of a disjunc-
convert 5 ([P1 ]; [P2 ]) Boolean(P1 ); Boolean(P2 ) tion of conjunctive queries where each atom refers to
through identity (P1 ; P2 )
a table in a source, or
convert 6 ([I1 ]; [I2 ]) ServNoInteger(I1 );
ServNoInteger(I2 ) 1 To see how DWQ addresses the issue of deciding what to materialize
through identity (I1 ; I2 ) in the Data Warehouse, we refer to [TLS99].
D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati 16-8
a matching, conversion, or reconciliation pred- how to merge the various tuples denoted by tuple-
icate defined in the Interschema Correspon- spec1 ,. . . ,tuple-specn, and tuple-spect1 ,. . . ,tuple-
dences. spectm denote the tuples in T resulting from the
merging.
In other words, we are trying to reformulate q in terms
of the relations in the sources, and possibly in terms We observe that the rewriting algorithm is able to gen-
of the matching, conversion, and reconciliation predi- erate one merging clause template for each pair of dis-
cates. If there are different rewritings, then we choose juncts that are not disjoint. Starting from such tem-
the best rewriting r with respect to suitable quality pa- plates, the designer may either specify the such that
rameters. There are several criteria to be taken into and the into parts, depending on the intended seman-
consideration when evaluating the quality of a rewrit- tics, or change the templates in order to specify a dif-
ing, such as: ferent merging plan (for example for merging three
disjuncts, rather than three pairs of disjuncts).
Completeness of the rewriting. Obviously, the
best situation is the one where the rewriting is
The algorithm computes the maximally contained
complete, in the sense that the rewritten query is
rewriting (i.e., every other rewriting is included in the
equivalent to the original query. Such a check
one computed by the query), but we also want to in-
can be done by exploiting the algorithm for
form the designer whether such a rewriting is equiv-
query containment.
alent or not to the original query. Indeed, we have
Accuracy, confidence, freshness, and availability devised an effective method for checking equivalence
of data in the source relations that the rewriting between the original query and the computed rewrit-
requires to access. ing [CDGL98a].
The resulting query is the specification for the design
Besides the relational tables in the sources, our rewrit-
of the mediator associated to T . The most critical step
ing algorithm takes into account the matching, con-
of the above method is the computation of the rewriting.
version, and reconciliation predicates defined in the
Our rewriting algorithm is based on the method presented
interschema correspondences.
in [DL97], modified to take into account the following as-
pects:
Even when no rewriting exists for the query (i.e.,
We deal with queries whose atoms refer to a concep- when the maximally contained rewriting is empty), we
tual model that includes ISA assertions and a limited want to provide the designer with useful indications
form of functional dependencies. Such constraints on whether there is a method for enriching the Inter-
have to be considered in the computation of the rewrit- schema Correspondences to get a non-empty rewrit-
ing. ing. Indeed, our rewriting algorithm adopts a form
of abductive reasoning that enables to single out the
We deal with queries that are disjunctions of conjunc- specification of which matching, conversion and rec-
tions. It follows that the rewriting itself is in general onciliation operations would allow to get a non-empty
a disjunction, and therefore, we need to deal with the rewriting. This indication can be profitably used by
problem of merging the results of several queries. This the designer to check whether she/he can add new In-
problem is addressed by the notion of merging clause. terschema Correspondences in order to make the com-
In particular, if the query r computed by the rewriting puted rewriting complete.
is an or-query (i.e., it is constituted by more than one
disjunct), then the algorithm associates to r a suitable
set of so-called merging clauses, taking into account Example 2 (cont.) Suppose we want to store in the Data
that the answers to the different or-parts of the query Warehouse a relation containing the information about cus-
may contain objects and values that represent the same tomers that have a contract for a certain service with a de-
real world entity or the same value. A merging clause partment at which they are also registered, or that are eli-
is an expression of the form gible for a promotion. Independently from the fact that the
customer has a contract, we want to include the information
merging tuple-spec1 and and tuple-specn on whether he is eligible for a promotion. We can make use
such that matching-condition of a relational table TDW with four components, defined by
into tuple-spect1 and and tuple-spectm the following adorned query, where we have assumed that
in the Data Warehouse we want to identify customers by
where tuple-speci denotes a tuple returned by the their SSN, services by their service number, and depart-
i-th disjunct of r, matching-condition specifies ments by their code:
D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati 16-9
SIGACT SIGMOD SIGART Sym. on Prin-
ciples of Database Systems (PODS’98),
TDW(S; I ; C; P )
pages 254–265, 1998.
CONTRACT0 (X; R; D); PROMOTION1 (X; D);
SSN(X; S ); ServNo(R; I ); Code(D; C ); F = true
OR [CDGL98a] Diego Calvanese, Giuseppe De Giacomo,
and Maurizio Lenzerini. On the decidability
CONTRACT0 (X; R; D); REG-AT1 (X; D);
:PROMOTION1 (X; D); SSN(X; S ); ServNo(R; I ); of query containment under constraints. In
Code(D; C ); P = false Proc. of the 17th ACM SIGACT SIGMOD
OR SIGART Sym. on Principles of Database
PROMOTION1 (X; D); SSN(X; S ); Code(D; C ); Systems (PODS’98), pages 149–158, 1998.
= true; I = NULL j
[CDGL+ 98b] Diego Calvanese, Giuseppe De Giacomo,
P
identify ([S ]; X ); S :: SSNString;
identify ([I ]; R); I :: ServNoInteger; Maurizio Lenzerini, Daniele Nardi, and
identify ([C ]; D); C :: DeptCodeInteger; Riccardo Rosati. Description logic frame-
P :: Boolean work for information integration. In Proc.
of the 6th Int. Conf. on the Principles of
Using the asserted and automatically derived Inter- Knowledge Representation and Reasoning
schema Correspondences, the system is able to rewrite the (KR’98), pages 2–13, 1998.
above query in terms of TABLE1 in Source 1 and TABLE2
[CDGL+ 98c] Diego Calvanese, Giuseppe De Giacomo,
in Source 2 (see Section 3) as follows:
Maurizio Lenzerini, Daniele Nardi, and
TDW(S0 ; I0 ; C; P0 ) Riccardo Rosati. Schema and data integra-
TABLE1 (S1 ; M1 ; P1 ); TABLE2 (N2 ; B; I2 ; M2 ); tion methodology for dwq. Technical Re-
reconcile 1;1 ([N2 ; B ]; [S1 ]; [S0 ]); port DWQ-UNIROMA-004, DWQ Consor-
reconcile 3;2 ([M1 ]; [M2 ]; [C ]); tium, September 1998.
convert 5 ([P1 ]; [P0 ]); convert 6 ([I2 ]; [I0 ])
OR [CDGL+ 98d] Diego Calvanese, Giuseppe De Giacomo,
^
TABLE1 (S1 ; M1 ; P1 ); S1 = NULL P1 = true; Maurizio Lenzerini, Daniele Nardi, and
convert 2 ([M1 ]; [C ]); convert 4 ([S1 ]; [S0 ]); Riccardo Rosati. Source integration in
convert 5 ([P1 ]; [P0 ]) data warehousing. In Proc. of the 9th Int.
Workshop on Database and Expert Systems
In this case the merging clause simply reduces to a disjunc- Applications (DEXA’98), pages 192–197.
tion. IEEE Computer Society Press, 1998.
[CL93] Tiziana Catarci and Maurizio Lenzerini.
6 Conclusions Representing and using interschema knowl-
edge in cooperative information systems. J.
We have described a new approach to data integration and
of Intelligent and Cooperative Information
reconciliation in Data Warehousing. The approach is based
Systems, 2(4):375–398, 1993.
on the availability of a Conceptual Model of the corpo-
rate data, and allows the designer to declaratively specify [DL97] Oliver M. Duschka and Alon Y. Levy. Re-
several types of correspondences between data in different cursive plans for information gathering. In
sources. Such correspondences are used by a query rewrit- Proc. of the 15th Int. Joint Conf. on Arti-
ing algorithm that supports the task of specifying the cor- ficial Intelligence (IJCAI’97), pages 778–
rect mediators for the loading of the materialized views of 784, 1997.
the Data Warehouse.
Based on the described methodology, we are currently [GM95] A. Gupta and I. S. Mumick. Maintenance of
implementing a design tool within the DWQ project. The materialized views: Problems, techniques,
tool is based on the Concept Base System [Jar92], and pro- and applications. IEEE Bulletin of the
vides support for both schema and data integration in Data Technical Committee on Data Engineering,
Warehousing. 18(2):3–18, 1995.
[GMS94] Cheng Hian Goh, Stuart E. Madnick, and
References
Michael Siegel. Context Interchange: Over-
[AD98] Serge Abiteboul and Oliver Duschka. Com- coming the challenges of large-scale inter-
plexity of answering queries using materi- operable database systems in a dynamic en-
alized views. In Proc. of the 17th ACM vironment. In Proc. of the 3rd Int. Conf. on
D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati 16-10
Information and Knowledge Management [SK92] Amit Sheth and Vipul Kashyap. So far
(CIKM’94), pages 337–346, 1994. (schematically) yet so near (semantically).
[HGMW+ 95] Joachim Hammer, Hector Garcia-Molina,
In Proc. of the IFIP DS-5 Conf. on Seman-
tics of Interoperable Database Systems. El-
Jennifer Widom, Wilburt Labio, and Yue
sevier Science Publishers (North-Holland),
Zhuge. The Stanford data warehousing
Amsterdam, 1992.
project. IEEE Bulletin of the Technical
Committee on Data Engineering, 18(2):41– [TLS99] Dimitri Theodoratos, Spyros Ligoudis-
48, 1995. tianos, and Timos Sellis. Designing the
global Data Warehouse with SPJ views. In
[Hul97] Richard Hull. Managing semantic hetero-
Proc. of the 11th Conf. on Advanced Infor-
geneity in databases: A theoretical perspec-
mation Systems Engineering (CAiSE’99),
tive. In Proc. of the 16th ACM SIGACT
1999.
SIGMOD SIGART Sym. on Principles of
Database Systems (PODS’97), 1997. [Ull97] Jeffrey D. Ullman. Information inte-
[HZ96] Richard Hull and Gang Zhou. A frame- gration using logical views. In Proc.
work for supporting data integration us- of the 6th Int. Conf. on Database The-
ing the materialized and virtual approaches. ory (ICDT’97), volume 1186 of Lecture
In Proc. of the ACM SIGMOD Int. Conf. Notes in Computer Science, pages 19–40.
on Management of Data, pages 481–492, Springer-Verlag, 1997.
1996. [Wid95] Jennifer Widom. Special issue on materi-
[Inm96] W. H. Inmon. Building the Data Ware- alized views and data warehousing. IEEE
house. John Wiley & Sons, second edition, Bulletin on Data Engineering, 18(2), 1995.
1996. [Wie92] Gio Wiederhold. Mediators in the architec-
[Jar92] M. Jarke. Conceptbase V3.1 user man- ture of future information systems. IEEE
ual. Technical Report 92–17, Aach- Computer, 25(3):38–49, 1992.
ener Informatik-Berichte, Aachen, Ger-
[ZHK96] Gang Zhou, Richard Hull, and Roger King.
many, 1992.
Generating data integration mediators that
[JJQV98] Matthias Jarke, Manfred A. Jeusfeld, use materializations. J. of Intelligent Infor-
Christoph Quix, and Panos Vassiliadis. mation Systems, 6:199–221, 1996.
Architecture and quality in data ware-
houses. In Proc. of the 10th Conf. on [ZHKF95] Gang Zhou, Richard Hull, Roger King,
and Jean-Claude Franchitti. Using object
Advanced Information Systems Engineer-
ing (CAiSE’98), volume 1413 of Lecture matching and materialization to integrate
Notes in Computer Science, pages 93–113. heterogeneous databases. In Proc. of the
Springer-Verlag, 1998. 3rd Int. Conf. on Cooperative Information
Systems (CoopIS’95), pages 4–18, 1995.
[JLVV99] Matthias Jarke, Maurizio Lenzerini, Yan-
nis Vassiliou, and Panos Vassiliadis. Fun-
damentals of Data Warehouses. Springer-
Verlag, 1999. In Press.
[LMSS95] Alon Y. Levy, Alberto O. Mendelzon,
Yehoshua Sagiv, and Divesh Srivastava.
Answering queries using views. In Proc. of
the 14th ACM SIGACT SIGMOD SIGART
Sym. on Principles of Database Systems
(PODS’95), pages 95–104, 1995.
[PGMW95] Yannis Papakonstantinou, Hector Garcia-
Molina, and Jennifer Widom. Object ex-
change across heterogeneous information
sources. In Proc. of the 11th IEEE Int. Conf.
on Data Engineering (ICDE’95), pages
251–260, 1995.
D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati 16-11