A Principled Approach to Data Integration and Reconciliation in Data Warehousing Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, Daniele Nardi, Riccardo Rosati Dipartimento di Informatica e Sistemistica, Università di Roma “La Sapienza” Via Salaria 113, 00198 Roma, Italy lastname@dis.uniroma1.it http://www.dis.uniroma1.it/˜lastname interest. The typical architecture of an integration systems is described in terms of two types of modules: wrappers Abstract and mediators [Wie92, Ull97]. The goal of a wrapper is to access a source, extract the relevant data, and present Integration is one of the most important aspects such data in a specified format. The role of a mediator is to of a Data Warehouse. When data passes from the merge data produced by different wrappers (or mediators), sources of the application-oriented operational en- so as to meet a specific information need of the integra- vironment to the Data Warehouse, possible incon- tion system. The specification and the realization of me- sistencies and redundancies should be resolved, so diators is the core problem in the design of an integration that the warehouse is able to provide an integrated system. This problem has recently become a central issue and reconciled view of data of the organization. in several contexts, including multi-database systems, Data We describe a novel approach to data integration Warehousing and information gathering from the Web. and reconciliation, based on a conceptual repre- The constraints that are typical of Data Warehouse ap- sentation of the Data Warehouse application do- plications restrict the large spectrum of approaches that are main. The main idea is to declaratively specify being proposed [Hul97, Inm96, JLVV99]. First, while the suitable matching, conversion, and reconciliation sources on the Web are often external, in a Data Warehouse operations to be used in order to solve possibile they are mostly internal to the organization. Second, a Data conflicts among data in different sources. Such a Warehouse should reflect the informational needs of an or- specification is provided in terms of the concep- ganization, and should therefore be defined in terms of a tual model of the application, and is effectively global, corporate view of data. Third, such a view should used during the design of the software modules be provided in terms of conceptual representation mecha- that load the data from the sources into the Data nism that is able to abstract from the physical and logical Warehouse. organization of data in the sources. It follows that the need and requirements for maintaining an integrated, conceptual 1 Introduction view of the corporate data in the organization are stronger with respect to other contexts. A direct consequence of this Information Integration is the problem of acquiring data fact is that the data in the sources and in the Data Ware- from a set of sources that are available for the application of house should be defined in terms of the conceptual model, The copyright of this paper belongs to the paper’s authors. Permission to and not the other way around. In other words, data in- copy without fee all or part of this material is granted provided that the tegration in Data Warehousing should follow the local as copies are not made or distributed for direct commercial advantage. view approach, where each table in a source and in the Data Proceedings of the International Workshop on Design and Warehouse is defined as a view of a global model of the cor- Management of Data Warehouses (DMDW’99) porate data. On the contrary, the global as view approach Heidelberg, Germany, 14. - 15.6. 1999 requires, for each information need, to specify the corre- (S. Gatziu, M. Jeusfeld, M. Staudt, Y. Vassiliou, eds.) sponding query in terms of the data at the sources, and is http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-19/ therefore suited when no global, integrated view of the data D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati 16-1 of the organization is available. browsing through a meta-level description of the relations The above considerations motivate the approach to in- of the sources. In addition, it generally provides both for formation integration proposed in [CDGL+ 98d], whose automatic code generators and for the possibility of attach- distinguishing feature is to exploit the possibility of rep- ing procedures to accomplish ad hoc transformations and resenting the conceptual level of a Data Warehouse in a filtering of the data. Even though there are powerful and very expressive language and use reasoning tools to support effective environments with the above features, their nature the Data Warehouse construction, maintenance and evolu- is inherently procedural and close to the notion of global as tion. In fact, the idea is to balance the effort of building a view, where the task of relating the sources with the Data conceptual model of the Data Warehouse by improving the Warehouse is done on a query-by-query basis. capabilities of the system in maintaining the Data Ware- Several recent research contributions address the same house and support the incremental addition of information problem from a more formal perspective [HGMW+ 95, sources. The proposed approach follows a local as view Wid95, GM95, HZ96, ZHK96, ZHKF95, PGMW95, paradigm, by explicitly requiring an enterprise conceptual GMS94]. For example, a methodology for extracting, com- model which is therefore regarded as a unified view of the paring and matching data and objects located in different data available within the organization. sources is described in [PGMW95]. The methodology is Most of the work on integration has been concerned based on the Object Exchange Model, which requires the with the intensional/schema level, while less attention has explicit semantic labeling of the objects, to support object been devoted to the problem of data integration at the exten- exchange, and emphasizes the need for a tight interaction sional level. Integration of data is, nonetheless, at the heart between the system and the user. However, the method of Data Warehousing [Inm96]. When data passes from the remains of procedural nature, since it requires the user to application-oriented operational environment to the Ware- build and maintain the relationship between the sources and house, possible inconsistencies and redundancies should be the Data Warehouse on a query-by-query basis. resolved, so that the Warehouse is able to provide an in- The approach proposed in [GMS94] is more declarative tegrated and reconciled view of data of the organization. in nature. Suitable data structures for reconciling different Thus, in the context of a Data Warehouse, data integration representations of the same data are represented in a con- and reconciliation is the process of acquiring data from the text theory, which is used by the system to transform the sources and making them available within the Warehouse. queries as appropriate for gathering the data from the vari- Given a request for data (e.g., for materializing a new ous sources. In such a declarative approach, the user is not relation in the Data Warehouse), which is formulated in directly concerned with the identification and resolution of terms of the global view of the corporate data, (i.e., not semantic conflicts when formulating the requests for data. the language of the sources, but of the enterprise), there are Rather, once the specification of the sources is available, several steps that enable for the acquisition of data from the conflicts are detected by the system, and conversion and sources: filtering are automatically enforced. However, the method still follows the global-as-view approach, and the context 1. Identification of the sources where the relevant infor- theory is used as a description of reconciled data structures, mation resides. Note that this task is typical of the rather than as the conceptual model of the corporate data. local-as-view approach, and requires algorithms that In this paper we present the approach to data integra- are generally both sophisticated and costly [AD98, tion and reconciliation proposed within the DWQ (Data LMSS95]. Warehouse Quality) project [JJQV98, JLVV99]. In DWQ, 2. Decomposition of the user request into queries to in- the ultimate goal of source integration and data reconcil- dividual sources that would return the data of interest. iation is to represent the migration of the data from the sources to the Data Warehouse, in order to support the de- 3. Interpretation of the data provided by a source. In- sign of materialized views that meet user requirements, and terpreting data can be regarded as the task of casting have high quality with respect to correctness, interpretabil- them into a common representation, which can there- ity, usefulness, believability. The method for data integra- after be used to manipulate the data. tion and reconciliation builds upon and extends the work in [CDGL+ 98d], therefore relying on the availability of a 4. Merging of the data. The data returned by various Conceptual Model to declaratively represent the relation- sources need to be combined to provide the Data ship between the sources and the Data Warehouse. The Warehouse with the requested information. declarative approach is further pursued in the task of data In commercial environments for Data Warehouse design integration and reconciliation, where the system is given a and management the above tasks are taken care of through declarative description of the data in the sources and pro- ad-hoc components [JLVV99]. In general, such a compo- vides automatic support in satisfying the data requests for nent provides the user with the capability of specifying the populating the Data Warehouse. mapping between the sources and the Data Warehouse by Compared with the existing proposals mentioned above, D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati 16-2 Data Warehouse Schema Mediators Data Warehouse Store Conceptual Model Source Source Schema1  Scheman Wrappers conceptual level logical level physical level conceptual/logical mapping Sources physical/logical mapping Source Source  data flow Data Store1 Data Storen Figure 1: Architecture for Data Integration the novelty of our approach stems from the following fea-  The conceptual level contains a conceptual represen- tures: tation of the corporate data.  It relies on the Conceptual Model of the corporate  The logical level contains a representation in terms of data, which is expressed in an Entity-Relationship for- a logical data model of the sources and of the data malism. materialized in the Data Warehouse.  It follows the local-as-view paradigm.  The physical level contains a store for the material-  It allows the designer to declaratively specify sev- ized data, wrappers for the sources and mediators for eral types of correspondences between data in differ- loading the materialized data store. ent schemas (either source schemas or Data Ware- The relationship between the conceptual and the logical, house schema). Three types of Interschema Corre- and between the logical and the physical level is repre- spondences are taken into account, namely Conver- sented explicitly by specifying mappings between corre- sion, Matching, and Reconciliation Correspondences. sponding objects of the different levels.  It uses such correspondences for supporting the task We briefly describe the conceptual and logical levels, of specifying the correct mediators for the loading of referring to the abstract architecture of DWQ as depicted the materialized views of the Data Warehouse. in Figure 1. The Conceptual Model is a conceptual representation Our methodology relies on a novel query rewriting algo- of the data managed by the enterprise, including a con- rithm, whose role is to reformulate the query that defines ceptual representation of the data residing in sources, and the view to materialize in terms of both the source relations of the global concepts and relationships that are of in- and the interschema correspondences. terest to the Data Warehouse application. The concep- The paper is organized as follows. In Section 2, we sum- tual model is expressed in terms of an enriched Entity- marize the relevant features of the proposed approach to Relationship model in which complex entity and relation- information integration. Section 3 illustrates the method ship expressions can be constructed and used, and in which we use to describe the content of the sources at the logical interdependencies between elements of different sources level. Section 4 is devoted to a discussion of the mean- and of the enterprise are captured using intermodel asser- tions [CDGL+ 98b, CL93]. Intermodel assertions provide a ing and the role of interschema correspondences. Section 5 describes the query rewriting algorithm at the basis of our simple and effective declarative mechanism to express the approach to the design of mediators. Section 6 concludes dependencies that hold between entities (i.e. classes and re- the paper. lationships) in different models [Hul97]. The use of inter- model assertions allows for an incremental approach to the 2 The DWQ framework integration of the conceptual models of the sources and of In this section we briefly describe the general framework the enterprise. Due to space limitations, we cannot con- adopted in the DWQ project [CDGL+ 98d]. The proposed sider this aspect in further detail, and refer the interested framework allows one to explicitly model data and infor- reader to [CDGL+ 98b]. mation needs – i.e., a specification of the data that the The conceptual representation contains, besides entities Data Warehouse provides to the user – at various lev- and relationships, also a description of domains, which are els [CDGL+ 98b, CDGL+ 98d, CDGL+ 98c]: used to typify the attributes of entities and relationships. D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati 16-3 Rather than considering only concrete domains, such as DeptNameString v String strings, integers, and reals, our approach is based on the SSNString v String use of abstract domains. An abstract domain may have DeptCodeInteger v Integer ServNoInteger v Integer an underlying concrete domain, but allows the designer to distinguish between the different meanings that a value of the concrete domain may have. Additionally, also Boolean combinations of domains and the possibility to construct an At the logical level, the logical content of each source, ISA hierarchy between domains are supported. called the Source Schema (see Section 3), is provided in terms of a set of relational tables using the relational model. Example 1 Consider two attributes A1 in a source and The link between the logical representation and the concep- A2 in the Data Warehouse, both representing amounts of tual representation of the source is formally defined by as- money. Rather than specifying that both attributes have val- sociating with each table a query that describes its content ues of type real, the designer may specify that the domain in terms of a query over the Conceptual Model. In other of attribute A1 is MoneyInLire while the domain of at- words, the logical content of a source table is described in tribute A2 is MoneyInEuro, both of which have real terms of a view over the Conceptual Model. To map phys- as the underlying concrete domain. In this way, it be- ical structures to logical structures we make use of suit- comes possible to specify declaratively the difference be- able wrappers, which encapsulate the sources. The wrapper tween values of the two attributes, and take into account hides how the source actually stores its data, the data model such knowledge for loading data from the source to the it adopts, etc., and presents the source as a set of relational Data Warehouse. tables. In particular, we assume that all attributes in the We provide an example of the form of the Conceptual tables are of interest to the Data Warehouse application (at- Model, and refer to [CDGL+ 98b] for a more detailed de- tributes that are not of interest are hidden by the wrapper). scription of the adopted formalism. The logical content of the materialized views constituting the Data Warehouse, called the Data Warehouse Schema Example 2 As our running example we consider an en- (see Section 4), is provided in terms of a set of relational terprise and two sources containing information about con- tables. Similarly to the case of the sources, each table of tracts between customers and departments for services, and the Data Warehouse Schema is described in terms of a view about registration of customers at departments. Source 1 over the Conceptual Model. As we said before, the way in contains information about customers registered at public- which a view is actually materialized, starting from the data relations departments. Source 2 contains information about in the sources, is specified by means of mediators. contracts and complete information about services. Such In such a framework, we have devised suitable inference situation can be represented by means of the ER diagrams techniques, which allow for carrying out several reasoning shown in Figure 2, together with the following intermodel services on both the conceptual representation, such as in- assertions (v represents ISA while  represents equiva- ferring inclusion between entities and relationships, satis- lence): fiability of entities, etc. [CDGL+ 98d], and the logical rep- resentation, such as query containment [CDGL98a], which is at the basis of query rewriting. The possibilities offered Department1  PrDept0 by such reasoning tools are used in the accomplishment of REG-AT1 v REG-AT0 several activities concerning both the design and the oper- Customer1  Customer0 u ation of the Data Warehouse. ( 1 [$1](REG-AT0 u ($2: PrDept0 ))) Customer0 u 3 Source schema description ( 1 [$1]CONTRACT0 ) v ( 1 [$1]PROMOTION1 ) In this section we focus on the specification of the logical Customer2 v Customer0 u schemas of the sources. Such schemas are intended to pro- ( 1 [$1]CONTRACT0 ) vide a structural description of the content of the sources, Department2 v Department0 which are encapsulated by suitable wrappers. Service2  Service0 We describe a source S by associating to each relational CONTRACT2 v CONTRACT0 table T of S an adorned query that is constituted by a head, a body, and an adornment: Customer1  Customer0 Department1  Department0  The head defines the relational schema of the table in terms of a name, and the number of columns. and the following domain hierarchy:  The body describes the content of the table in terms of PersNameString v String a query over the Conceptual Model. D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati 16-4 SSN=SSNString Name=DeptNameString 1 2 ServNo=ServNoInteger Customer1 REG-AT1 Department1 Name=PersNameString Service0 SSN=SSNString 1 2 2 1 3 PROMOTION1 Customer0 CONTRACT0 Name=DeptNameString Code=DeptCodeInteger DOB=Date 1 2 ServNo=ServNoInteger REG-AT0 Department0 Service2 Name=PersNameString Name=DeptNameString 2 PrDept0 1 3 Customer2 CONTRACT2 Department2 DOB=Date Figure 2: Conceptual model of the application of Example 2  The adornment declares the domains of the columns over the elements of the Conceptual Model. We need to of the table, and also which are the attributes of the ta- make it explicit how the objects of the conceptual repre- ble that are used to identify an entity of the Conceptual sentation are coded into values of the logical representa- Model. tion. The notion of adorned query is introduced exactly for this purpose. We now present in detail the notions introduced above. An adorned query is an expression of the form 3.1 Query over the Conceptual Model T (~x) q(~x; ~y) j 1 ; : : : ; n Generally speaking, the connection to the Conceptual Model is established by defining each table as a relational where T is the name of the relational table, ~ are its at- x tributes (observe that attributes denote values and not ob- query over the elements of the Conceptual Model. A query q for a Conceptual Model M is a non-recursive (x y) jects), q ~ ; ~ is a query as defined above, and each i is Datalog query, written in the form: x an annotation on variables appearing in ~ . In particular: q(~x) (x y ) OR OR conj (~x; ~y ) conj 1 ~ ; ~ 1  m m x 1. For each X 2 ~ , we have an annotation of the form where each conj (~x; ~y ) is a conjunction of atoms or X :: V negated atoms, and ~x; ~y are all the variables appearing in i i where V is a domain expression. Such an annotation the conjunct. Each atom is either of the forms E (t) or of i is used to specify how values bound to X are repre- the form R(~t), where ~t, t, and t are variables in ~x; ~y or 0 constants, and E and R, and entities and relationships of i sented in the table at the logical level. For example, which currency is used for a real value denoting an M respectively. amount of money. The semantics of queries is as follows. Given an inter- pretation I of a Conceptual Model M with interpretation z x 2. For each tuple of variables ~  ~ that is used for iden-  domain I , a query q of arity n is interpreted as the set y tifying in T an object Y 2 ~ mentioned in q ~ ; ~ , we (x y) ( ) qI of n-tuples d1 ; : : : ; dn , with each di 2 I , such that, have an annotation of the form when substituting each di for xi , the formula ([z] ) identify ~ ; Y 9~y1 .conj 1 (~x; ~y1 ) OR    OR 9~y .conj (~x; ~y ) m m m For example, the designer may assert that the evaluates to true in I . attributes first name, last name, and The fact that a relation in a source is defined in terms date of birth in a table are used to identify of a query over the Conceptual Model confirms that we are students. following the local-as-view approach: each table is seen as a view of the virtual database represented by the Concep- We point out that our method is able to cope with tual Model. several schematic differences that may be present in the sources [SK92]. We illustrate this point with the help of 3.2 Adornment an example. To make the connection to the Conceptual Model precise, Example 3 Suppose that the Conceptual Model contains it is not sufficient to define each table as a relational query a relationship Service with three attributes, Date, D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati 16-5 ServiceNo, and Price, where Service D; S; P ( ) Example 2 (cont.) Assuming that in Source 1 a customer means that at the date D the service S costs P Euro. Sup- is actually identified by its social security number, and a pose that Source S1 represents the same kind of informa- department by its name, we can specify the relational table tion only on Services v1 and v2 , by means of two tables: TABLE1 by the following adorned query: ( ) v1 and v2, where v1 D; P means that service v1 costs P ( ) Italian Lira at date D, and v2 D; P means that service v2 TABLE1 (S; M; P ) : REG-AT1 (X; D); PROMOTION1 (X; D); P = false; costs P Italian Lira at date D. Suppose that Source S2 rep- SSN(X; S ); Name(D; M ) resents the same kind of information only on Services v3 OR and v4 by means of a table Serv, where Serv X; Y; D ( ) PROMOTION1 (X; D); P = true; SSN(X; S ); means that services v3 and v4 cost X and Y Euro respec- Name1 (D; M ) j tively at date D. Finally, suppose that Source S3 represents identify ([S ]; X ); S :: SSNString; the information only for a certain date d by means of an- identify ([M ]; D); M :: DeptNameString; other table Serv3 . The various tables in the three sources P :: Boolean can be specified by means of the following adorned queries: Additionally, we assume that in Source 2 the actual data can be described in terms of a relational table TABLE2 with v1(D; P ) Service(D;0 v10 ; P ) j four columns, two for the customer, one for the service the P :: ItalianLira; D :: Date customer has registered, and one for the department. As v2(D; P ) Service(D;0 v20 ; P ) j in Source 1, in Source 2 departments are still identified P :: ItalianLira; D :: Date by their name, but, differently from Source 1, customers Serv(X; Y; D) Service(D;0 v30 ; X ); are identified by their name and date of birth. Services are Service(D;0 v40 ; Y ) j identified by a unique service number. Hence the following X :: Euro; Y :: Euro; D :: Date adorned query is used to specify TABLE2 : Serv3 (S 1; P ) Service(d; S 1; P ); Code(S; S 1) j TABLE2 (N; B; I ; M ) CONTRACT2 (X; S; D); Name(X; N ); ServNo(S; I ); P :: Euro; identify ([S 1]; S ); S 1 :: String Name(D; M ) j identify ([N; B ]; X ); N :: PersNameString; B :: Date identify ([I ]; S ); I :: ServNoInteger; identify ([M ]; D); M :: DeptNameString The above example illustrates a case where there are various schematic differences, both among the sources, and between the sources and the Conceptual Model. The mech- anisms used in our methodology for specifying adorned 4 Interschema Correspondences queries is able to cope with such differences. We now describe how to define Interschema Correspon- The adorned query associated to a table in a source con- dences, which are used to declaratively specify the cor- tains a lot of information that can be profitably used in an- respondences between data in different schemas (either alyzing the quality of the Data Warehouse design process. source schemas or data warehouse schema). Indeed, the adorned query precisely formalizes the content In our approach, Interschema Correspondences are de- of a source table in terms of a query over the Conceptual fined in terms of relational tables, similarly to the case of Model, the domains of each attribute of the table, and the the relations describing the sources at the logical level. The attributes used to identify entities at the conceptual level. difference with source relations is that we conceive inter- One important check that we can carry out over the logical schema correspondences as non-materialized relational ta- specification of a source is whether the adorned query asso- bles, in the sense that their extension is computed by an ciated with a table in a source is consistent or not. Let Q be associated program whenever it is needed. It follows that, an adorned query and let B be its body. The query B is said to each interschema correspondence, we associate a head, to be inconsistent with respect to the Conceptual Model M, a body, and an adornment. Differently from the case of if for every database DB coherent with M, the evaluation a source relation, the adorment specifies which is the pro- of B with respect to DB is empty. An adorned query Q is gram that is able to compute the extension of the virtual inconsistent with respect to the Conceptual Model M ei- table. ther because the body B of Q is inconsistent with respect We distinguish among three types of correspondences, to M, or because the annotations are incoherent with re- namely Conversion, Matching, and Reconciliation Corre- spect to what specified in M. The inference techniques de- spondences. scribed in [CDGL+ 98d] allow us to check the consistency Conversion Correspondences are used to specify that of the relational tables defined for describing a source. data in one source can be converted into data of a different D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati 16-6 source or of the data warehouse, and how this conversion versions are performed by the programs associated to the is performed. They are used to anticipate several types of Conversion Correspondences. data conflicts that may occur in loading data. Reconciliation Correspondences are used to assert how As an example, suppose that in a table of a source costs we can reconcile data in different sources into data of are represented in Italian Lira, while in a table of the Data the data warehouse. A Reconciliation Correspondence Warehouse we want to express them in Euro. Then, in or- reconcile has the following form: der to use the source table in the rewriting of a query that defines the Data Warehouse table, it is necessary to know reconcile ([~x1 ]; : : : ; [~x ]; [~z]) (x x z w) conj ~ ; : : : ; ~ k ;~ ; ~ (x x z w) k about the possibility of converting each amount in Italian through program ~ 1 ; : : : ; ~ k ;~ ; ~ Lira into an amount in Euro. A Conversion Correspondence convert has the follow- where conj specifies the conditions under which the recon- ing form: ciliation is applicable, and program is a predicate that we assume associated to a program that performs the reconcil- convert ([~x]; [~y]) (x y z) conj ~ ; ~ ;~ iation. Such correspondence specifies that the k tuples of (x y z) through program ~ ; ~ ;~ x x values ~ 1 ; : : : ; ~ k coming from the sources are reconciled where conj is a conjunctive query, which specifies the z to the tuple ~ in the Data Warehouse. Therefore, the as- sociated program receives as input k tuples of values (and conditions under which the conversion is applicable, and possibly the additional parameters in the condition) and re- program is a predicate that we assume associated to a pro- turns a reconciled tuple. gram that performs the conversion. In general, the program Again, a Reconciliation Correspondence could simply needs to take into account the additional parameters spec- be defined as a combination of appropriate Matching and ified in the condition to actually perform the conversion. Conversion Correspondences, e.g., The conversion has a direction. In particular, it operates from a tuple of values satisfying the conditions specified x for ~ in conj to a tuple of values satisfying the conditions ([x] [y] [z]) reconcile ~ ; ~ ; ~ y specified for ~ . This means that the conversion program ([x] [w ]) convert 1 ~ ; ~ 1 ; convert 2 ~ ; ~ 2 ([y] [w ]); x receives as input a tuple ~ , and returns the corresponding ([w ] [w ]) match 1 ~ 1 ; ~ 2 ; convert 3 ~ 1 ; ~ ([w ] [z]); y tuple ~ , possibly using the additional parameter ~ to per- z (x y w w z) conj ~ ; ~ ; ~ 1 ; ~ 2 ;~ through none form the conversion. Matching Correspondences are used to specify how data In practice, several of the Interschema Correspondences in different sources can match. A Matching Correspon- that must be specified will have a very simple form, since dence match has the following form: they will correspond simply to equality in the case of a match ([~x1 ]; : : : ; [~x ]) (x x z) conj ~ 1 ; : : : ; ~ k ;~ matching and to identity in the case of a conversion. There- (x x z) k through program ~ ; : : : ; ~ k ;~ fore, in order to simplify the task of the designer in speci- fying the various interschema correspondences, we assume where conj specifies the conditions under which the match- that several correspondences are automatically asserted by ing is applicable, and program is a predicate that we as- default by the system. In particular, for each domain D sume associated to a program that performs the matching. in the conceptual model, the following Interschema Corre- The program receives as input k tuples of values satisfying spondences are specified by default: the conditions (and possibly the additional parameters in the condition) and returns whether they match or not. convert ([X ]; [Y ]) D ( X ); D ( Y ) Note that already specified Interschema Correspon- ( ) through identity X; Y dences may be used to define new ones. As an example, match ([X ]; [Y ]) D(X ); D(Y ); X = Y the designer may want to define a Matching Correspon- through none dence between two tuples by using two already defined reconcile ([X ]; [Y ]; [Z ]) D(X ); D(Y ); D(Z ); Conversion Correspondences, which convert to a common X=Y through identity (X; Z ) representation, and then by using equality. In this case, he could provide the following definition of the Matching Correspondence: where identity is the program that computes the identity match ([~x]; [~y]) ([x] [z]); convert 2([~y]; [~z]); convert 1 ~ ; ~ function for values of domain D, and the matching corre- (x y z w) conj ~ ; ~ ;~ ; ~ spondence has no associated program. The system allows the designer to inhibit the default through none correspondences for a certain domain, simply by provid- Observe that, in this case, the program associated to the ing an alternative interschema correspondence referring to Matching Correspondence is empty, since the actual con- that domain. D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati 16-7 Moreover, we assume that for each Conversion Cor- respondence convert i asserted by the designer, the sys- and the Reconciliation Correspondences tem automatically asserts a new Matching Correspondence match i as follows: reconcile 1;1 ([N; D]; [S1 ]; [S2 ]) reconcile 3;2 ([M1 ]; [M2 ]; [C ]) ([x] [y]) match i ~ ; ~ ([x] [z]); ~y = ~z convert i ~ ; ~ through none Moreover, for each Conversion Correspondence convert i 5 Specification of mediators asserted by the designer and for each Matching Correspon- As we said in the introduction, the problem of data inte- dence match j asserted by the designer or by default, the gration and reconciliation is crucial for the task of design- system automatically asserts a new Reconciliation Corre- ing the mediators that load the data in the Data Warehouse. spondence reconcile i;j as follows: Such a task aims at specifying, for every relation in the Data reconcile i;j ([~x]; [~y]; [~z]) ([x] [y]) match i ~ ; ~ ; Warehouse Schema, how the tuples of the relation should ([x] [z]) convert j ~ ; ~ be constructed from a suitable set of tuples extracted from the sources. through none Suppose we have decided to materialize a new relation Example 2 (cont.) The following Conversion Correspon- T in the Data Warehouse.1 Our goal is to support the de- dence specifies that the name and date of birth of a person signer in providing a formal specification for the design can be converted into a Social Security Number through the of the mediator used to extract the correct data from the program name to SSN: sources, and to load such data in T . The methodology we propose is based on the following steps. convert 1 ([N; B ]; [S ]) PersNameString(N ); Date(B ); DOB(N; B ); 1. We apply the method described in Section 3 to provide SSNString(S ) the specification of the relation T . In other words, we through name to SSN(N; B; S ) specify T in terms of an adorned query Moreover, we add the following Conversion Correspon- q q j c1 ; : : : ; c : 0 n dence, which represents the fact that a department name can be converted into a department code through the pro- Note that the adorned query associated to a table in a gram dept name to code: source is the result of a reverse engineering analysis of the source, whereas in this case the adorned query is a convert 2 ([M ]; [C ]) specification of what we want to materialize in the ta- DeptNameString(M ); DeptCodeInteger(C ) ble of the Data Warehouse. Note also that we express through dept name to code(M; C ) the semantics of T again in terms of the conceptual model. Not only the sources, but also the relations in According to the above rules, the system asserts automat- the Data Warehouse are seen as views of such a con- ically (among others) the Matching Correspondence and ceptual model. Conversion Correspondences 2. We look for a rewriting of q in terms of the queries match 1 ([N; B ]; [S ]) convert 1 ([N; B ]; [S1 ]); S = S1 q1 ; : : : ; qs that correspond to the materialized views in through none the Data Warehouse. If a complete, equivalent rewrit- ing exists, then the new table can be derived from the match 2 ([M ]; [C ]) convert 2 ([M ]; [C1 ]); C = C1 through none existing tables in the Data Warehouse. Otherwise, the algorithm is able to single out the part that cannot be match 3 ([M1 ]; [M2 ]) DeptNameString(M1 ); derived from the Data Warehouse, and that must be DeptNameString(M2 ); loaded from the sources. In the following, q denotes M1 = M2 through none such part. convert 4 ([S1 ]; [S2 ]) SSNString(S1 ); 3. We look for a rewriting of q in terms of the queries cor- SSNString(S2 ) responding to the tables in the sources. The rewriting through identity (S1 ; S2 ) aims at expressing the data in T in terms of a disjunc- convert 5 ([P1 ]; [P2 ]) Boolean(P1 ); Boolean(P2 ) tion of conjunctive queries where each atom refers to through identity (P1 ; P2 )  a table in a source, or convert 6 ([I1 ]; [I2 ]) ServNoInteger(I1 ); ServNoInteger(I2 ) 1 To see how DWQ addresses the issue of deciding what to materialize through identity (I1 ; I2 ) in the Data Warehouse, we refer to [TLS99]. D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati 16-8  a matching, conversion, or reconciliation pred- how to merge the various tuples denoted by tuple- icate defined in the Interschema Correspon- spec1 ,. . . ,tuple-specn, and tuple-spect1 ,. . . ,tuple- dences. spectm denote the tuples in T resulting from the merging. In other words, we are trying to reformulate q in terms of the relations in the sources, and possibly in terms We observe that the rewriting algorithm is able to gen- of the matching, conversion, and reconciliation predi- erate one merging clause template for each pair of dis- cates. If there are different rewritings, then we choose juncts that are not disjoint. Starting from such tem- the best rewriting r with respect to suitable quality pa- plates, the designer may either specify the such that rameters. There are several criteria to be taken into and the into parts, depending on the intended seman- consideration when evaluating the quality of a rewrit- tics, or change the templates in order to specify a dif- ing, such as: ferent merging plan (for example for merging three disjuncts, rather than three pairs of disjuncts).  Completeness of the rewriting. Obviously, the best situation is the one where the rewriting is  The algorithm computes the maximally contained complete, in the sense that the rewritten query is rewriting (i.e., every other rewriting is included in the equivalent to the original query. Such a check one computed by the query), but we also want to in- can be done by exploiting the algorithm for form the designer whether such a rewriting is equiv- query containment. alent or not to the original query. Indeed, we have  Accuracy, confidence, freshness, and availability devised an effective method for checking equivalence of data in the source relations that the rewriting between the original query and the computed rewrit- requires to access. ing [CDGL98a]. The resulting query is the specification for the design Besides the relational tables in the sources, our rewrit- of the mediator associated to T . The most critical step  ing algorithm takes into account the matching, con- of the above method is the computation of the rewriting. version, and reconciliation predicates defined in the Our rewriting algorithm is based on the method presented interschema correspondences. in [DL97], modified to take into account the following as- pects:  Even when no rewriting exists for the query (i.e.,  We deal with queries whose atoms refer to a concep- when the maximally contained rewriting is empty), we tual model that includes ISA assertions and a limited want to provide the designer with useful indications form of functional dependencies. Such constraints on whether there is a method for enriching the Inter- have to be considered in the computation of the rewrit- schema Correspondences to get a non-empty rewrit- ing. ing. Indeed, our rewriting algorithm adopts a form of abductive reasoning that enables to single out the  We deal with queries that are disjunctions of conjunc- specification of which matching, conversion and rec- tions. It follows that the rewriting itself is in general onciliation operations would allow to get a non-empty a disjunction, and therefore, we need to deal with the rewriting. This indication can be profitably used by problem of merging the results of several queries. This the designer to check whether she/he can add new In- problem is addressed by the notion of merging clause. terschema Correspondences in order to make the com- In particular, if the query r computed by the rewriting puted rewriting complete. is an or-query (i.e., it is constituted by more than one disjunct), then the algorithm associates to r a suitable set of so-called merging clauses, taking into account Example 2 (cont.) Suppose we want to store in the Data that the answers to the different or-parts of the query Warehouse a relation containing the information about cus- may contain objects and values that represent the same tomers that have a contract for a certain service with a de- real world entity or the same value. A merging clause partment at which they are also registered, or that are eli- is an expression of the form gible for a promotion. Independently from the fact that the customer has a contract, we want to include the information merging tuple-spec1 and    and tuple-specn on whether he is eligible for a promotion. We can make use such that matching-condition of a relational table TDW with four components, defined by into tuple-spect1 and    and tuple-spectm the following adorned query, where we have assumed that in the Data Warehouse we want to identify customers by where tuple-speci denotes a tuple returned by the their SSN, services by their service number, and depart- i-th disjunct of r, matching-condition specifies ments by their code: D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati 16-9 SIGACT SIGMOD SIGART Sym. on Prin- ciples of Database Systems (PODS’98), TDW(S; I ; C; P ) pages 254–265, 1998. CONTRACT0 (X; R; D); PROMOTION1 (X; D); SSN(X; S ); ServNo(R; I ); Code(D; C ); F = true OR [CDGL98a] Diego Calvanese, Giuseppe De Giacomo, and Maurizio Lenzerini. On the decidability CONTRACT0 (X; R; D); REG-AT1 (X; D); :PROMOTION1 (X; D); SSN(X; S ); ServNo(R; I ); of query containment under constraints. In Code(D; C ); P = false Proc. of the 17th ACM SIGACT SIGMOD OR SIGART Sym. on Principles of Database PROMOTION1 (X; D); SSN(X; S ); Code(D; C ); Systems (PODS’98), pages 149–158, 1998. = true; I = NULL j [CDGL+ 98b] Diego Calvanese, Giuseppe De Giacomo, P identify ([S ]; X ); S :: SSNString; identify ([I ]; R); I :: ServNoInteger; Maurizio Lenzerini, Daniele Nardi, and identify ([C ]; D); C :: DeptCodeInteger; Riccardo Rosati. Description logic frame- P :: Boolean work for information integration. In Proc. of the 6th Int. Conf. on the Principles of Using the asserted and automatically derived Inter- Knowledge Representation and Reasoning schema Correspondences, the system is able to rewrite the (KR’98), pages 2–13, 1998. above query in terms of TABLE1 in Source 1 and TABLE2 [CDGL+ 98c] Diego Calvanese, Giuseppe De Giacomo, in Source 2 (see Section 3) as follows: Maurizio Lenzerini, Daniele Nardi, and TDW(S0 ; I0 ; C; P0 ) Riccardo Rosati. Schema and data integra- TABLE1 (S1 ; M1 ; P1 ); TABLE2 (N2 ; B; I2 ; M2 ); tion methodology for dwq. Technical Re- reconcile 1;1 ([N2 ; B ]; [S1 ]; [S0 ]); port DWQ-UNIROMA-004, DWQ Consor- reconcile 3;2 ([M1 ]; [M2 ]; [C ]); tium, September 1998. convert 5 ([P1 ]; [P0 ]); convert 6 ([I2 ]; [I0 ]) OR [CDGL+ 98d] Diego Calvanese, Giuseppe De Giacomo, ^ TABLE1 (S1 ; M1 ; P1 ); S1 = NULL P1 = true; Maurizio Lenzerini, Daniele Nardi, and convert 2 ([M1 ]; [C ]); convert 4 ([S1 ]; [S0 ]); Riccardo Rosati. Source integration in convert 5 ([P1 ]; [P0 ]) data warehousing. In Proc. of the 9th Int. Workshop on Database and Expert Systems In this case the merging clause simply reduces to a disjunc- Applications (DEXA’98), pages 192–197. tion. IEEE Computer Society Press, 1998. [CL93] Tiziana Catarci and Maurizio Lenzerini. 6 Conclusions Representing and using interschema knowl- edge in cooperative information systems. J. We have described a new approach to data integration and of Intelligent and Cooperative Information reconciliation in Data Warehousing. The approach is based Systems, 2(4):375–398, 1993. on the availability of a Conceptual Model of the corpo- rate data, and allows the designer to declaratively specify [DL97] Oliver M. Duschka and Alon Y. Levy. Re- several types of correspondences between data in different cursive plans for information gathering. In sources. Such correspondences are used by a query rewrit- Proc. of the 15th Int. Joint Conf. on Arti- ing algorithm that supports the task of specifying the cor- ficial Intelligence (IJCAI’97), pages 778– rect mediators for the loading of the materialized views of 784, 1997. the Data Warehouse. Based on the described methodology, we are currently [GM95] A. Gupta and I. S. Mumick. Maintenance of implementing a design tool within the DWQ project. The materialized views: Problems, techniques, tool is based on the Concept Base System [Jar92], and pro- and applications. IEEE Bulletin of the vides support for both schema and data integration in Data Technical Committee on Data Engineering, Warehousing. 18(2):3–18, 1995. [GMS94] Cheng Hian Goh, Stuart E. Madnick, and References Michael Siegel. Context Interchange: Over- [AD98] Serge Abiteboul and Oliver Duschka. Com- coming the challenges of large-scale inter- plexity of answering queries using materi- operable database systems in a dynamic en- alized views. In Proc. of the 17th ACM vironment. In Proc. of the 3rd Int. Conf. on D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati 16-10 Information and Knowledge Management [SK92] Amit Sheth and Vipul Kashyap. So far (CIKM’94), pages 337–346, 1994. (schematically) yet so near (semantically). [HGMW+ 95] Joachim Hammer, Hector Garcia-Molina, In Proc. of the IFIP DS-5 Conf. on Seman- tics of Interoperable Database Systems. El- Jennifer Widom, Wilburt Labio, and Yue sevier Science Publishers (North-Holland), Zhuge. The Stanford data warehousing Amsterdam, 1992. project. IEEE Bulletin of the Technical Committee on Data Engineering, 18(2):41– [TLS99] Dimitri Theodoratos, Spyros Ligoudis- 48, 1995. tianos, and Timos Sellis. Designing the global Data Warehouse with SPJ views. In [Hul97] Richard Hull. Managing semantic hetero- Proc. of the 11th Conf. on Advanced Infor- geneity in databases: A theoretical perspec- mation Systems Engineering (CAiSE’99), tive. In Proc. of the 16th ACM SIGACT 1999. SIGMOD SIGART Sym. on Principles of Database Systems (PODS’97), 1997. [Ull97] Jeffrey D. Ullman. Information inte- [HZ96] Richard Hull and Gang Zhou. A frame- gration using logical views. In Proc. work for supporting data integration us- of the 6th Int. Conf. on Database The- ing the materialized and virtual approaches. ory (ICDT’97), volume 1186 of Lecture In Proc. of the ACM SIGMOD Int. Conf. Notes in Computer Science, pages 19–40. on Management of Data, pages 481–492, Springer-Verlag, 1997. 1996. [Wid95] Jennifer Widom. Special issue on materi- [Inm96] W. H. Inmon. Building the Data Ware- alized views and data warehousing. IEEE house. John Wiley & Sons, second edition, Bulletin on Data Engineering, 18(2), 1995. 1996. [Wie92] Gio Wiederhold. Mediators in the architec- [Jar92] M. Jarke. Conceptbase V3.1 user man- ture of future information systems. IEEE ual. Technical Report 92–17, Aach- Computer, 25(3):38–49, 1992. ener Informatik-Berichte, Aachen, Ger- [ZHK96] Gang Zhou, Richard Hull, and Roger King. many, 1992. Generating data integration mediators that [JJQV98] Matthias Jarke, Manfred A. Jeusfeld, use materializations. J. of Intelligent Infor- Christoph Quix, and Panos Vassiliadis. mation Systems, 6:199–221, 1996. Architecture and quality in data ware- houses. In Proc. of the 10th Conf. on [ZHKF95] Gang Zhou, Richard Hull, Roger King, and Jean-Claude Franchitti. Using object Advanced Information Systems Engineer- ing (CAiSE’98), volume 1413 of Lecture matching and materialization to integrate Notes in Computer Science, pages 93–113. heterogeneous databases. In Proc. of the Springer-Verlag, 1998. 3rd Int. Conf. on Cooperative Information Systems (CoopIS’95), pages 4–18, 1995. [JLVV99] Matthias Jarke, Maurizio Lenzerini, Yan- nis Vassiliou, and Panos Vassiliadis. Fun- damentals of Data Warehouses. Springer- Verlag, 1999. In Press. [LMSS95] Alon Y. Levy, Alberto O. Mendelzon, Yehoshua Sagiv, and Divesh Srivastava. Answering queries using views. In Proc. of the 14th ACM SIGACT SIGMOD SIGART Sym. on Principles of Database Systems (PODS’95), pages 95–104, 1995. [PGMW95] Yannis Papakonstantinou, Hector Garcia- Molina, and Jennifer Widom. Object ex- change across heterogeneous information sources. In Proc. of the 11th IEEE Int. Conf. on Data Engineering (ICDE’95), pages 251–260, 1995. D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, R. Rosati 16-11