GDE: General Data Exchange with Schema and
           Data Level Mappings

                               Rana Awada and Iluju Kiringa

                               University of Ottawa, SITE
                       rawad049@uottawa.ca kiringa@site.uottawa.ca


1     Introduction
Data exchange (DE) [5, 3] and data coordination [1, 2, 6] are two important set-
tings that were introduced previously in the literature to resolve the problem
of integrating information that resides in different sources. A DE setting moves
data residing in independent applications, which refer to the same object using
the same name, and accesses it through a new target schema. However, a data
coordination setting allows the access of data residing in independent sources
and that possibly belong to different sets of vocabularies, without necessarily
exchanging it and while maintaining autonomy.
    Although a data coordination setting provides users with an amalgamated
view of related information, this solution is not enough for applications that
require a view of related information using a unified set of vocabularies for
periodic reporting and decision making. We introduce a general data exchange
(GDE) setting that extends DE settings to allow collaboration at the instance
level, using a mapping table M , that specifies for each constant value in the
source, the set of related (or corresponding) constant values in the target.1
    We show in this paper that a GDE setting can be formalized using the knowl-
edge exchange framework introduced in [4]. It allows us to store a target knowl-
edge base (KB) which consists of a subset of the explicit data exchanged that
is necessary to infer the full set of exchanged information using a set Σt of FO
sentences. We identify in our work the class of “best” KBs to materialize and
we define the set of certain answers.

2     Preliminaries
A (DE) setting [5, 3] is a tuple S = (S, T, Σst ), where S is a source schema, T
is a target schema, S and T do not have predicate symbols in common, and Σst
consists of a set of source-to-target tuple-generating dependencies (st-tgds) that
establish the relationship between source and target schemas. A st-tgd is a FO-
sentence of the form: ∀x̄∀ȳ (φ(x̄, ȳ) → ∃z̄ ψ(x̄, z̄)), where φ(x̄, ȳ) and ψ(x̄, z̄) are
conjunctions of relational atoms over S and T respectively. Let Const and Var be
infinite and disjoint sets of constants and nulls, respectively. We consider in our
1
    We consider in this work a particular interpretation of related data in a mapping table; that is,
    a source element is always uniquely identified by at least one target element.
work “complete” source instances I of S, where it holds that dom(I) ⊆ Const
and do not contain missing data in the form of nulls. However, a target instance
J of T, is allowed to contain null values, and it holds that dom(J) ⊆ Const ∪ Var;
    A knowledge base [4] over a schema R is a pair (K, Σ), where K is an instance
of R (the explicit data) and Σ is a set of logical sentences over R (the implicit
data). The set of models of (K, Σ), denoted by Mod(K, Σ), is defined as the set
of instances of R that contain the explicit data in K and the implicit data in Σ;
that is, Mod(K, Σ) corresponds to the set {K 0 | K 0 is an instance of R, K ⊆ K 0
and K 0 |= Σ }. From now on, KR0 denotes the restriction of instance K to a
subset R0 of its schema R.
    Mapping tables [6] are mechanisms that establish how values from different
domains correspond. In its simplest form, given two domains D1 and D2 , not
necessarily disjoint, a mapping table over (D1 , D2 ) is a subset of D1 × D2 . Let
ConstS and ConstT be the sets of source and target constants respectively. We
consider in our work mapping tables with the following property: for each value
a ∈ ConstS ∩ dom(M ), there exists at least a single target value a0 ∈ ConstT ∩
dom(M ) such that M (a, a0 ) holds, and there does not exist a source value b ∈
ConstS ∩ dom(M ), where b is different than a and M (b, a0 ) holds. We say a0
uniquely identifies a in M . We define C as the set of values in dom(M ) ∩ ConstT
that uniquely identify source values mapped in M .

3    GDE a Knowledge Exchange System
A GDE setting S = (S, T, M, Σst ) extends a DE setting with (1) a binary
relation symbol M that appears neither in S nor in T, and that is called a
source-to-target mapping; and (2) Σst that consists of a set of mapping st-tgds,
which are FO sentences of the form: ∀x̄∀ȳ∀z̄ (φ(x̄, ȳ) ∧ µ(x̄, z̄) → ∃w̄ ψ(z̄, w̄)),
where (a) φ(x̄, ȳ) and ψ(z̄, w̄) are conjunctions of relation symbols over S and T
respectively, and (b) µ(x̄, z̄) is a conjunction of relation symbols that only use
the st-mapping relation symbol M. We denote st-mapping tables by M .
    In a GDE setting, source KBs are of the form ((I ∪ {M }), Σs = ∅), which
correspond to data in the source instance I and the st-mapping table M . On
the other hand, the target KBs are of the form ((J ∪ {M }), Σt ) where Σt is a
set of FO sentences, of type f ull tgds (which are tgds that do not use existen-
tial quantication). We formalize the notion of a (universal) GDE KB-solution,
extending the notion of knowledge exchange (universal) solution in [4] to allow
coordinating the source and target information provided by M , as follows:
 1. J is a GDE KB-solution for I and M under S, if for every K ∈ Mod((J ∪
    {M }), Σt ) there is K 0 ∈ Mod((I ∪ {M }), Σs = ∅)) such that the following
                 0
    hold: (a) KM   ⊆ KM , and (b) ((KS0 ∪ KM0
                                               ), KT )  Σst .
 2. Also, J is a universal GDE KB-solution (UGDE) for I and M under S, if J is
    a GDE KB-solution, and for every K 0 ∈ Mod((I ∪{M }), Σs = ∅) there is K ∈
                                              0
    Mod((J ∪{M }, Σt ) such that (a) KM ⊆ KM     , and (b) ((KS0 ∪KM
                                                                   0
                                                                     ), KT )  Σst .
   Intuitively, in a GDE setting S, C is the sole set of target values that can
capture correctly the set of source values exchanged to a target instance. There-
fore, intuitively a GDE KB-solution J in S has a domain dom(J) ⊆ C ∪ Var.
We define Σt as the following set of full tgds over a schema T ∪ {M, Related},
where Related is a fresh binary table:
 1. For each T ∈ T ∪ {M} of arity n and 1 ≤ i ≤ n:
    ∀x1 · · · ∀xn (T (x1 , . . . , xi , . . . , xn ) → Related(xi , xi )).
 2. ∀x∀y∀z(M(z, x) ∧ M(z, y) ∧ C(x) → Related(x, y)).
 3. For each T ∈ T of arity n:                        Vn
    ∀x1 , y1 · · · ∀xn , yn (T (x1 , . . . , xn ) ∧ i=1 Related(xi , yi ) → T (y1 , . . . , yn )).
    In a GDE setting, we define “best” solutions formally following [4] as: Let
S be a GDE setting, I be a source instance, M an st-mapping table, and J a
UGDE solution for I and M under S. Then J is a minimal UGDE solution, if
(1) there is no proper subset J 0 of J such that J 0 is a UGDE solution for I and
M under S, and (2) there is no UGDE solution J 0 such that dom(J 0 ) ∩ ConstT
is properly contained in dom(J) ∩ ConstT . Also, given a fixed GDE setting,
generating UGDE solutions and minimal UGDE solutions is in Logspace.

4    Query Answering
We adapt the notion of a certain answer in the usual DE setting to the GDE
setting. Formally, let S be a GDE setting, I a source instance, M an st-mapping
table, and Q a conjunctive query over T. The set of certain answers of Q over I
and M and under S, denoted certainS ((I ∪ {M }), Q), corresponds to the set of
tuples of constants that belong to the evaluation of Q over KT , for each GDE
KB-solution J for I and M and K ∈ Mod((J ∪ {M}), Σt ). Finally, generating
certainS ((I ∪ {M }), Q) is in Logspace.

5    Future Work
An interesting extension for this work would be defining a GDE setting with
a target that contains egds and tgds constraints. Also, investigating GDE in a
peer-to-peer setting might add interesting challenges to the problem.

References
1. Lawrence, M., Pottinger, R., Staub-French, S.: Data Coordination: Supporting Con-
   tingent Updates. In: VLDB (2011)
2. Philip A. Bernstein , Fausto Giunchiglia , Anastasios Kementsietsidis , John My-
   lopoulos , Luciano Serafini , llya Zaihrayeu : Data Management for Peer-to-Peer
   Computing: A Vision. pp. 89–94 (2002)
3. Arenas, M., Barceló, P. , Libkin, L.,Murlak, F.: Relational and XML Data Exchange.
   Morgan & Claypool Publishers, (2010).
4. Arenas, M. , Perez, J. ,Reutter, J.: Data exchange beyond complete data. In : PODS,
   pp. 83–94 (2011)
5. Fagin,R., Kolaitis, P. G. , Miller,R. J. , Popa, L.: Data exchange: semantics and
   query answering. In: Theoretical Computer Science, pp. 89–124 (2005)
   31(4), pp. 761–791 (1984).
6. Kementsietsidis, A., Arenas, M., Miller, R. J. : Mapping data in peer-to-peer sys-
   tems: Semantics and algorithmic issues. In: SIGMOD, pp. 325–336 (2003).