GDE: General Data Exchange with Schema and Data Level Mappings Rana Awada and Iluju Kiringa University of Ottawa, SITE rawad049@uottawa.ca kiringa@site.uottawa.ca 1 Introduction Data exchange (DE) [5, 3] and data coordination [1, 2, 6] are two important set- tings that were introduced previously in the literature to resolve the problem of integrating information that resides in different sources. A DE setting moves data residing in independent applications, which refer to the same object using the same name, and accesses it through a new target schema. However, a data coordination setting allows the access of data residing in independent sources and that possibly belong to different sets of vocabularies, without necessarily exchanging it and while maintaining autonomy. Although a data coordination setting provides users with an amalgamated view of related information, this solution is not enough for applications that require a view of related information using a unified set of vocabularies for periodic reporting and decision making. We introduce a general data exchange (GDE) setting that extends DE settings to allow collaboration at the instance level, using a mapping table M , that specifies for each constant value in the source, the set of related (or corresponding) constant values in the target.1 We show in this paper that a GDE setting can be formalized using the knowl- edge exchange framework introduced in [4]. It allows us to store a target knowl- edge base (KB) which consists of a subset of the explicit data exchanged that is necessary to infer the full set of exchanged information using a set Σt of FO sentences. We identify in our work the class of “best” KBs to materialize and we define the set of certain answers. 2 Preliminaries A (DE) setting [5, 3] is a tuple S = (S, T, Σst ), where S is a source schema, T is a target schema, S and T do not have predicate symbols in common, and Σst consists of a set of source-to-target tuple-generating dependencies (st-tgds) that establish the relationship between source and target schemas. A st-tgd is a FO- sentence of the form: ∀x̄∀ȳ (φ(x̄, ȳ) → ∃z̄ ψ(x̄, z̄)), where φ(x̄, ȳ) and ψ(x̄, z̄) are conjunctions of relational atoms over S and T respectively. Let Const and Var be infinite and disjoint sets of constants and nulls, respectively. We consider in our 1 We consider in this work a particular interpretation of related data in a mapping table; that is, a source element is always uniquely identified by at least one target element. work “complete” source instances I of S, where it holds that dom(I) ⊆ Const and do not contain missing data in the form of nulls. However, a target instance J of T, is allowed to contain null values, and it holds that dom(J) ⊆ Const ∪ Var; A knowledge base [4] over a schema R is a pair (K, Σ), where K is an instance of R (the explicit data) and Σ is a set of logical sentences over R (the implicit data). The set of models of (K, Σ), denoted by Mod(K, Σ), is defined as the set of instances of R that contain the explicit data in K and the implicit data in Σ; that is, Mod(K, Σ) corresponds to the set {K 0 | K 0 is an instance of R, K ⊆ K 0 and K 0 |= Σ }. From now on, KR0 denotes the restriction of instance K to a subset R0 of its schema R. Mapping tables [6] are mechanisms that establish how values from different domains correspond. In its simplest form, given two domains D1 and D2 , not necessarily disjoint, a mapping table over (D1 , D2 ) is a subset of D1 × D2 . Let ConstS and ConstT be the sets of source and target constants respectively. We consider in our work mapping tables with the following property: for each value a ∈ ConstS ∩ dom(M ), there exists at least a single target value a0 ∈ ConstT ∩ dom(M ) such that M (a, a0 ) holds, and there does not exist a source value b ∈ ConstS ∩ dom(M ), where b is different than a and M (b, a0 ) holds. We say a0 uniquely identifies a in M . We define C as the set of values in dom(M ) ∩ ConstT that uniquely identify source values mapped in M . 3 GDE a Knowledge Exchange System A GDE setting S = (S, T, M, Σst ) extends a DE setting with (1) a binary relation symbol M that appears neither in S nor in T, and that is called a source-to-target mapping; and (2) Σst that consists of a set of mapping st-tgds, which are FO sentences of the form: ∀x̄∀ȳ∀z̄ (φ(x̄, ȳ) ∧ µ(x̄, z̄) → ∃w̄ ψ(z̄, w̄)), where (a) φ(x̄, ȳ) and ψ(z̄, w̄) are conjunctions of relation symbols over S and T respectively, and (b) µ(x̄, z̄) is a conjunction of relation symbols that only use the st-mapping relation symbol M. We denote st-mapping tables by M . In a GDE setting, source KBs are of the form ((I ∪ {M }), Σs = ∅), which correspond to data in the source instance I and the st-mapping table M . On the other hand, the target KBs are of the form ((J ∪ {M }), Σt ) where Σt is a set of FO sentences, of type f ull tgds (which are tgds that do not use existen- tial quantication). We formalize the notion of a (universal) GDE KB-solution, extending the notion of knowledge exchange (universal) solution in [4] to allow coordinating the source and target information provided by M , as follows: 1. J is a GDE KB-solution for I and M under S, if for every K ∈ Mod((J ∪ {M }), Σt ) there is K 0 ∈ Mod((I ∪ {M }), Σs = ∅)) such that the following 0 hold: (a) KM ⊆ KM , and (b) ((KS0 ∪ KM0 ), KT )  Σst . 2. Also, J is a universal GDE KB-solution (UGDE) for I and M under S, if J is a GDE KB-solution, and for every K 0 ∈ Mod((I ∪{M }), Σs = ∅) there is K ∈ 0 Mod((J ∪{M }, Σt ) such that (a) KM ⊆ KM , and (b) ((KS0 ∪KM 0 ), KT )  Σst . Intuitively, in a GDE setting S, C is the sole set of target values that can capture correctly the set of source values exchanged to a target instance. There- fore, intuitively a GDE KB-solution J in S has a domain dom(J) ⊆ C ∪ Var. We define Σt as the following set of full tgds over a schema T ∪ {M, Related}, where Related is a fresh binary table: 1. For each T ∈ T ∪ {M} of arity n and 1 ≤ i ≤ n: ∀x1 · · · ∀xn (T (x1 , . . . , xi , . . . , xn ) → Related(xi , xi )). 2. ∀x∀y∀z(M(z, x) ∧ M(z, y) ∧ C(x) → Related(x, y)). 3. For each T ∈ T of arity n: Vn ∀x1 , y1 · · · ∀xn , yn (T (x1 , . . . , xn ) ∧ i=1 Related(xi , yi ) → T (y1 , . . . , yn )). In a GDE setting, we define “best” solutions formally following [4] as: Let S be a GDE setting, I be a source instance, M an st-mapping table, and J a UGDE solution for I and M under S. Then J is a minimal UGDE solution, if (1) there is no proper subset J 0 of J such that J 0 is a UGDE solution for I and M under S, and (2) there is no UGDE solution J 0 such that dom(J 0 ) ∩ ConstT is properly contained in dom(J) ∩ ConstT . Also, given a fixed GDE setting, generating UGDE solutions and minimal UGDE solutions is in Logspace. 4 Query Answering We adapt the notion of a certain answer in the usual DE setting to the GDE setting. Formally, let S be a GDE setting, I a source instance, M an st-mapping table, and Q a conjunctive query over T. The set of certain answers of Q over I and M and under S, denoted certainS ((I ∪ {M }), Q), corresponds to the set of tuples of constants that belong to the evaluation of Q over KT , for each GDE KB-solution J for I and M and K ∈ Mod((J ∪ {M}), Σt ). Finally, generating certainS ((I ∪ {M }), Q) is in Logspace. 5 Future Work An interesting extension for this work would be defining a GDE setting with a target that contains egds and tgds constraints. Also, investigating GDE in a peer-to-peer setting might add interesting challenges to the problem. References 1. Lawrence, M., Pottinger, R., Staub-French, S.: Data Coordination: Supporting Con- tingent Updates. In: VLDB (2011) 2. Philip A. Bernstein , Fausto Giunchiglia , Anastasios Kementsietsidis , John My- lopoulos , Luciano Serafini , llya Zaihrayeu : Data Management for Peer-to-Peer Computing: A Vision. pp. 89–94 (2002) 3. Arenas, M., Barceló, P. , Libkin, L.,Murlak, F.: Relational and XML Data Exchange. Morgan & Claypool Publishers, (2010). 4. Arenas, M. , Perez, J. ,Reutter, J.: Data exchange beyond complete data. In : PODS, pp. 83–94 (2011) 5. Fagin,R., Kolaitis, P. G. , Miller,R. J. , Popa, L.: Data exchange: semantics and query answering. In: Theoretical Computer Science, pp. 89–124 (2005) 31(4), pp. 761–791 (1984). 6. Kementsietsidis, A., Arenas, M., Miller, R. J. : Mapping data in peer-to-peer sys- tems: Semantics and algorithmic issues. In: SIGMOD, pp. 325–336 (2003).