Expressiveness and Complexity of Bidirectional Constraints for Data Exchange Marcelo Arenas1 , Gabriel Diéguez1 , and Jorge Pérez2 1 Department of Computer Science, Pontificia Universidad Católica de Chile 2 Department of Computer Science, Universidad de Chile 1 Introduction A schema mapping is a high-level specification that describes how data from a source schema is to be mapped to a target schema. Schema mappings are of fundamental importance in data management today. In particular, they have proved to be the essential building block for several data-interoperability tasks such as data exchange [8], data integration [11] and peer data management [7]. Most of the research on schema mappings has focused on mappings specified by the so called source-to-target tuple-generating-dependencies (st-tgds), motivated by the highly influential formalization of data exchange proposed by Fagin et al. in 2003. Although natural and simple to specify, mappings given by st-tgds fail to impose enough conditions to unambiguously define what are the instances that should be materialized when exchanging data. This raises several issues, among others, what is a good semantics for data exchange with st-tgds, how to choose a good target instance when exchanging data, and how to answer target queries. For instance, a very successful line of research has been the definition of a closed world semantics for st-tgds. Although this new semantics departs from a classical First Order Logic (FO) semantics, it has proved to have good properties in terms of materialization of target instances [12]. Other lines of research include the study of alternative notions for answering target queries, in particular, non-monotone queries [10] and aggregate queries [1]. We argue that one can follow a different approach and, instead of using ad-hoc solutions for each of the issues mentioned above, one can use a mapping-specification language that imposes enough constraints over possible target instances in order to minimize the uncertainty when exchanging data. In that way one can use standard FO notions to define the semantics of mappings, and to define the possible target solutions as well as the process of answering target queries. We propose to use what we call bidirectional constraints to specify mappings. These specifications impose at the same time constraints over the source and target instances participating in a mapping, and have the potential to minimize the ambiguity in the description of the target instances that should be materialized when exchanging  data. Bidirectional constraints are formulas of the form ∀x̄ ϕ(x̄) ↔ ψ(x̄) where ϕ(x̄) is a formula over the source schema, and ψ(x̄) is a formula over the target schema. One can obtain several different languages of bidirectional constraints depending on the formulas allowed in the source and target parts. Although natural and useful in many scenarios, mappings given by bidirectional constraints have been almost disregarded in the literature on the foundations of schema mappings and data exchange. In this paper we present our ongoing work on the study of schema mappings defined by bidirectional constraints. We show that given a mapping specified by st-tgds without existential quantifiers, one can construct a set of bidirectional constraints that exactly defines the canonical universal solutions of the initial mapping. Canonical universal solutions are considered as the preferred solutions when exchanging data with st-tgds [8]. This result presents evidence in favor of our hypothesis that bidirectional constraints can be useful to unambiguously define exchange scenarios. We prove that in order to achieve this result, disjunctions and equalities are strictly necessary in the definition of bidirectional constraints. Moreover, we also study the existence-of-solutions problem for bidirectional constraints, showing that the complexity depends on the use of existential quantification in the target formulas: the problem can be solved in polynomial time if no existential quantification is allowed, while it may be intractable in the general setting. The rest of the paper is organized as follows. Section 2 presents some notation and preliminary notions. Section 3 shows how the canonical-universal-solution semantics of st-tgds can be specified with bidirectional constraints. In Section 4 we study some complexity issues regarding bidirectional constraints. Section 5 presents some concluding remarks. For the sake of space all the proofs are in the extended version available online at http://db.ing.puc.cl/˜gdieguez/papers/amw2014-ext.pdf. 2 Preliminaries We assume some familiarity with database theory and data exchange [8]. We also assume that data is represented in the relational model. A relational schema, or just schema, is a finite set {R1 , . . . , Rn } of relation symbols, each relation having a fixed arity. Given a schema R, we denote by Inst (R) the set of all instances of R. Schema mappings Schema mappings are used to define a semantic relationship between two schemas. In this paper, we use a general definition of a schema mapping; given two schemas with no relation symbol in common, S and T, a schema mapping (or just a mapping) M between S and T is a set of pairs (I, J), where I is an instance of S, and J is an instance of T. That is, a mapping M is just a subset of Inst (S) × Inst (T). Given an instance I of S, a mapping M associates to I a set of possible solutions for I, denoted by S OLM (I), given by the set S OLM (I) = {J ∈ Inst (T) | (I, J) ∈ M}. In practice, schema mappings are represented by using logical formulas. In this paper we focus on using fragments of first-order logic (FO) to specify mappings (we assume some familiarity with FO). Given two disjoint relational schemas S and T, and a set Σ of FO sentences over vocabulary S ∪ T, we say that a mapping M is specified by Σ if for every pair of instances (I, J) ∈ Inst (S) × Inst (T) it holds that (I, J) ∈ M if and only if (I, J) |= Φ for every Φ ∈ Σ, where |= denotes the standard satisfaction of FO formulas. Besides FO, the main query languages that we consider in this paper are the languages of conjunctive queries (CQ), unions of CQ (UCQ), and the languages obtained from them by adding the equality predicate (CQ= , UCQ= and FO= ). 2 Schema mappings, bidirectional constraints, and source-to-target dependencies Assume that we have two disjoint schemas S and T. In this paper, we study mappings specified by sets of formulas of the following form  ∀x̄ ϕ(x̄) ↔ ψ(x̄) , (1) where ϕ(x̄) is an FO formula over S and ψ(x̄) is an FO formula over T, both formulas with x̄ as tuple of free variables. We call this formula a bidirectional constraint. We usually drop the outermost universal quantification when specifying these constraints, and thus we only write ϕ(x̄) ↔ ψ(x̄) for formula (1). Depending on which fragments of FO we use to define formulas ϕ(x̄) and ψ(x̄), we obtain a wide range of possible fragments of bidirectional constraints. Given fragments L1 and L2 of FO, we say that a sentence Φ is an hL1 , L2 i formula between S and T, if Φ is a bidirectional constraint of the form (1) in which ϕ(x̄) is an L1 formula over S, and ψ(x̄) is an L2 formula over T. When the source and target schemas are clear from the context we will only talk about hL1 , L2 i formulas. For example, consider schemas S = {Mother(·, ·), Father(·, ·)} and T = {Parent(·, ·)}. Then the following sentence ( Father(x, y) ∨ Mother(x, y) ) ↔ Parent(x, y) (2) is an example of a hUCQ, CQi formula. An hL, CQi formula in which the target part is a CQ without existential quantifiers is called a full hL, CQi formula. In this paper we also consider the language of source-to-target dependencies, which are formulas of the form ∀x̄(ϕ(x̄) → ψ(x̄)) where ϕ(x̄) and ψ(x̄) are formulas over the source and the target schema, respectively (we also usually drop the universal quantification in this case). An L1 - TO -L2 dependency is a formula of the above form in which ϕ(x̄) is in L1 and ψ(x̄) is in L2 . Similarly, as for the case of bidirectional constraints, we call full L- TO -CQ dependencies to formulas in which the target part is a CQ without existential quantifiers. We usually refer to CQ- TO -CQ dependencies as source-to-target tuple-generating dependencies (st-tgds). A mapping defined by source- to-target dependencies is called an st-mapping. Given a mapping M specified by FO- TO -CQ dependencies and a source instance I, one can define a particular class of solutions in S OLM (I) called universal solutions. These solutions are the most general among all the possible solutions for I under M [8]. Moreover, a particular class of universal solutions, called canonical universal solutions, can be generated (in polynomial time) by means of the classical chase procedure. For the sake of space, we refer the reader to [8] for precise definitions of these notions. We call US OLM (I) and CUSM (I), the set of universal solutions and canonical universal solutions for I under M, respectively. In general, for source-to-target mappings, we have that CUSM (I) ( US OLM (I) ( S OLM (I). 3 Characterization of semantics for data exchange One of our main goals in our ongoing project is to study the relationship between bidirectional constraints and the different semantics for data exchange. Let M be an st- mapping and I a source instance. The standard FO semantics for data exchange defines S OLM (I) as the set of all possible target instances considered when materializing 3 data or answering queries after the exchange [8]. Given that S OLM (I) can have many (potentially infinite) non desired solutions, several works have proposed to consider only solutions in US OLM (I) or CUSM (I) (or even other more ad-hoc sets of solutions) for exchanging or answering queries [12,1,10]. A natural question is if one can obtain these alternative semantics by considering the standard FO semantics but for alternative mappings. More precisely, given an st-mapping M, is there another mapping M0 such that either US OLM (I) = S OLM0 (I) or CUSM (I) = S OLM0 (I), for every possible source instance I? In this section we present a positive result for the above mentioned scenario using bidirectional constraints. In particular, we show that for an st-mapping M specified by full FO- TO -CQ dependencies, one can always construct a mapping M↔ specified by bidirectional constraints such that CUSM (I) = S OLM↔ (I). The bidirectional constraints used are hUCQ= , CQi plus formulas of the form ∀x̄(⊥ ↔ P (x̄)), where ⊥ is an arbitrary contradiction. Notice that these last formulas can be specified in hFO, CQi but we have decided to leave them separated as they are used only to state that some target predicates must remain empty in the solutions. Algorithm 1 presents this procedure. In the algorithm we use a procedure Q UERY R EWRITING ATOM that given a mapping M = (S, T, Σ) specified by st-tgds and a target atom R(x̄), constructs a formula α(x̄) in UCQ= such that for every tuple ā the following holds: I |= α(ā) if and only if for every solution J ∈ S OLM (I) it holds that J |= R(ā). That is, α(x̄) is a rewriting of R(x̄) over the source [4].It was shown in [4] that Q UERY R EWRITING ATOM can be implemented in quadratic time w.r.t. the size of Σ. We also consider a restriction in the input of the algorithm. We assume that the set of dependencies given as input satisfies that each dependency in the set has a single atom in its right-hand side, since it is well known that every set of full FO- TO -CQ dependencies is equivalent to a set that satisfies this restriction. Algorithm 1 B IDIRECTIONAL M APPING F ULL (M) Input: an st-mapping M = (S, T, Σ), where Σ is a set of full FO- TO -CQ dependencies, each dependency with a single atom in its right-hand side. Output: a mapping M↔ = (S, T, ∆), where ∆ is a set of full hUCQ= , CQi dependencies plus dependencies of the form ∀x̄(⊥ ↔ P (x̄)). 1: Start with ∆ as the empty set. 2: for all dependencies in Σ of the form ϕ(x̄) → R(x̄) do 3: Assume that n is the arity of predicate R and let ȳ be an n-tuple of distinct fresh variables. 4: Use Q UERY R EWRITING ATOM(M, R(ȳ)) to compute a formula α(ȳ) in UCQ= that is a rewriting of R(ȳ) over the source. 5: Add dependency α(ȳ) ↔ R(ȳ) to ∆. 6: end for 7: for all target relation P that is not mentioned in any dependency in Σ do 8: Add dependency ⊥ ↔ P (ȳ) to ∆, where ȳ is a tuple of distinct fresh variables of the same arity as P . 9: end for 10: return M↔ = (S, T, ∆) 4 Theorem 1. Given an st-mapping M = (S, T, Σ) with Σ a set of full FO- TO -CQ de- pendencies with a single atom in its target side, B IDIRECTIONAL M APPING F ULL(M) computes in time O(kΣk2 ) a mapping M↔ = (S, T, ∆), where ∆ is a set of full hUCQ= , CQi dependencies plus formulas of the form ∀x̄(⊥ ↔ P (x̄)), such that S OLM↔ (I) = CUSM (I) for every source instance I. Algorithm 1 generates a mapping specified by hUCQ= , CQi dependencies. An important question is whether all the features used in this fragment are needed in the output of the algorithm. The following result shows that the use of equality in the left-hand side of the dependencies is needed for Theorem 1 to hold, even if the full power of FO is allowed in the source side and disjunctions and equalities are allowed in the target side. Theorem 2. There exists an st-mapping M = (S, T, Σ), where Σ is a set of full st- tgds, such that there is no mapping M0 = (S, T, ∆), with ∆ a set of hFO, UCQ= i dependencies, such that S OLM0 (I) = CUSM (I) for every source instance I. Finally the following result shows that disjunctions in the left-hand side are also necessary in the output of algorithm B IDIRECTIONAL M APPING F ULL. Theorem 3. There exists an st-mapping M = (S, T, Σ), where Σ is a set of full st-tgds, such that there is no mapping M0 = (S, T, ∆), with ∆ a set of hCQ= , FOi dependencies, such that S OLM0 (I) = CUSM (I) for every source instance I. 4 Complexity Given that the language of hUCQ= , CQi dependencies has an interesting expressive power for data exchange, it is worth studying some of its fundamental properties. In this section we study the complexity of the language; in particular, the complexity of the existence-of-solutions problem which is defined as follows. Given a fixed mapping M specified by hUCQ= , CQi dependencies, the input of the problem is a source instance I, and the question is whether there exists an instance J such that J ∈ S OLM (I) (that is, whether S OLM (I) 6= ∅). Notice that this problem is trivial for the case of st-tgds, but it becomes interesting for the case of bidirectional constraints. For example, consider a mapping M specified by the set {A(x) ↔ P (x), B(x) ↔ P (x)}, then for the instance I = {A(1)} we have that S OLM (I) = ∅. Our first result establishes the complexity of the existence-of-solutions problem in the general case. In particular, we show that the problem may be intractable. Theorem 4. 1. For every mapping M = (S, T, ∆), where ∆ is a set of hUCQ= , CQi dependencies, the existence-of-solutions problem is in NP. 2. There exists a mapping M = (S, T, ∆), where ∆ is a set of hUCQ= , CQi depen- dencies, for which the existence-of-solutions problem is NP-complete. Moreover, this even holds if ∆ is a set of hCQ, CQi dependencies. In contrast, if the dependencies do not have existential quantification in the target side, then the existence-of-solutions problem can be efficiently solved. Theorem 5. For every mapping M = (S, T, ∆), where ∆ is a set of full hUCQ= , CQi dependencies, the existence-of-solutions problem is solvable in polynomial time. 5 5 Concluding remarks In this paper we have started the formal study of bidirectional constraints, in particular, their expressive power in terms of their capacity for defining some alternative semantics for data exchange with st-tgds, and the complexity of the existence-of-solutions problem. There has been some related work driven from practical motivations, most notably [13] and [5], where bidirectional constraints based on project-select relational algebra ex- pressions are considered in an object-to-relational mapping system. Nevertheless, to the best of our knowledge this paper presents the first theoretical results on the fundamental properties of bidirectional constraints. Besides the motivation that comes from defining more strictly what are the possible target instances in data exchange, mappings specified by bidirectional constraints have several other applications. Most notably, these specifications are expressive enough to define how source constraints should be transformed into target constraints. If one has a source schema with, for example, key constraints, and creates a new schema and a data exchange setting to migrate the data, it is natural to expect the source key constraints to be somehow reflected in the new schema. This issue is closely related with the recently raised topic of exchanging (or transforming) knowledge bases [2]. Mappings specified by st-tgds are not expressive enough to describe these type of transformations. We argue that bidirectional constraints are a good formalism to study these complex transformation scenarios, and it would be very interesting to explore these possibilities in future work. Another motivation comes from schema mapping management, where the inputs for schema mapping operators such as the merge, extract and diff operators are naturally defined using bidirectional mappings [6,3]. This is also part of our future work. References 1. F. Afrati and P. Kolaitis. Answering aggregate queries in data exchange. PODS, 129-138, 2008. 2. M. Arenas, J. Pérez, and J. Reutter. Data exchange beyond complete data. JACM, 60(4), 2013. 3. M. Arenas, J. Pérez, J. L. Reutter, and C. Riveros. Foundations of schema mapping management. PODS, 227-238, 2010. 4. M. Arenas, J. Pérez, and C. Riveros. The recovery of a schema mapping: Bringing exchanged data back. TODS, 34(4), 2009. 5. P. A. Bernstein, M. Jacob, J. Pérez, G. Rull, and J. F. Terwilliger. Incremental mapping compilation in an object-to-relational mapping system. SIGMOD, 1269-1280, 2013. 6. P. A. Bernstein and S. Melnik. Model management 2.0: manipulating richer mappings. SIG- MOD, 1-12, 2007. 7. G. de Giacomo, D. Lembo, M. Lenzerini, and R. Rosati. On reconciling data exchange, data integration, and peer data management. PODS, 133-142, 2007. 8. R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data exchange: semantics and query answering. TCS, 336(1): 89-124, 2005. 9. A. Fuxman, P. Kolaitis, R. Miller, and W.-C. Tan. Peer Data Exchange. TODS, 31(4), 2006. 10. A. Hernich. Semantics for Non-Monotone Queries in Data Exchange and Data Integration. Data Exchange, Information, and Streams, 5: 161-184, 2013. 11. M. Lenzerini. Data integration: a theoretical perspective. PODS, 233-246, 2002. 12. L. Libkin. Data exchange and incomplete information. PODS, 60-69, 2006. 13. S. Melnik, A. Adya, and P. A. Bernstein. Compiling mappings to bridge applications and databases. TODS, 33(4), 2008. 6