1 Introduction

On the Undecidability of the Equivalence of Second-Order Tuple Generating Dependencies ?

Ingo Feinerer

Reinhard Pichler

Emanuel Sallinger

Vadim Savenkov

savenkovg@dbai.tuwien.ac.at 0 0 Vienna University of Technology

Second-Order tuple generating dependencies (SO tgds) were introduced by Fagin et al. to capture the composition of simple schema mappings. Testing the equivalence of SO tgds would be important for applications like model management and mapping optimization. However, we prove the undecidability of the logical equivalence of SO tgds. Moreover, under weak additional assumptions, we also show the undecidability of a relaxed notion of equivalence between two SO tgds, namely the so-called conjunctive query equivalence.

1 Introduction

Schema mappings play an important role in several areas of database research, notably in data integration [ 13 ], data exchange [ 9 ], peer data management [ 14 ], and model management [ 6 ]. A schema mapping is given by two schemas, called the source schema and the target schema, as well as a set of dependencies describing the relationship between the source and target schema. The most fundamental form of schema mappings are mappings defined by a set of source-to-target tuple generating dependencies (s-t tgds): They are First-Order formulae of the form 8x('(x) ! 9y (x; y)), where the antecedent '(x) is a conjunctive query (CQ) over the source schema and the conclusion (x; y) is a CQ over the target schema. Intuitively, such an s-t tgd defines a constraint that the presence of certain tuples in the source database I (namely those in the image of some homomorphism h from '(x) to I) enforce the presence of certain tuples in the target database J (s.t. h can be extended to a homomorphism from (x; y) to J ).

Several algebraic operators [ 6, 15 ] on schema mappings have been intensively studied in recent time like computing inverses [ 8, 3, 2 ] and composing schema mappings [ 7, 12, 14, 16 ]. Our work is rather related to the composition operator. Fagin et al. proved that s-t tgds are not powerful enough to express the composition of two mappings defined by s-t tgds [ 12 ]. To remedy this defect, so-called Second-Order tuple generating dependencies (SO tgds) were introduced in [ 12 ]. SO tgds extend s-t tgds by existentially quantified function-variables and equalities of (possibly functional) terms. Details and formal definitions are given in Section 2. It was shown in [ 12 ] that SO tgds capture exactly the closure under composition of mappings defined by s-t tgds. Example 1 ([ 12 ]). Consider the following three schemas. Let S1 consist of the unary relation symbol Emp( ) of employees. Schema S2 consists of a single binary relation ? This work was supported by the Vienna Science and Technology Fund (WWTF), project

ICT08-032. Savenkov was supported by the Erasmus Mundus ECW Program of the EU. symbol Mgr0( ; ) that associates each employee with a manager. Schema S3 consists of a similar binary relation symbol Mgr( ; ) that is intended to provide a copy of Mgr0 and an additional unary relation symbol SelfMgr( ) to store employees who are their own manager. Consider the mappings M12 = (S1; S2; 12) and M23 = (S2; S3; 23) with 12 = f8e(Emp(e) ! 9m Mgr0(e; m))g and 23 = f8e; m(Mgr0(e; m) ! Mgr(e; m)), 8e(Mgr0(e; e) ! SelfMgr(e))g. We are looking for the composition of M12 and M23. It can be verified that this composition can be expressed by the SO tgd = 9f (8e (Emp(e) ! Mgr(e; f (e))) ^ 8e (Emp(e) ^ (e = f (e)) ! SelfMgr(e))):

In this paper, we want to study the equivalence of SO tgds. Note that the question of equivalence naturally arises in several scenarios. Figure 1 illustrates a model evolution scenario, where data structured under some schema S is first migrated to a database with schema T and then further transformed to meet schema U . Now suppose that there exists an alternative migration path from σ schema S via T 0 to schema U . The question

if the two migration paths yield the same re

T sult comes down to checking if the dependenΣ12 Σ23 cies and 0 (which represent the respective mapping compositions) are equivalent. ActuS U ally, Figure 1 can also be thought of as illusΣ‘12 Σ‘23 trating a peer data management scenario, where T‘ some peer with data structured according to S

provides part of its data to some other peer σ‘ with schema T (resp. T 0). The latter peer in

turn passes this data on to yet another peer with Fig. 1. Mapping compositions. schema U . Now suppose that a user may access the data only at the peer with schema U . What happens if some link in this peer data network is broken, say the one corresponding to mapping 23? Will the path of mappings from S via T 0 to U still give the user full access to the data provided by the peer with schema S? Testing the equivalence of and 0 is thus crucial for answering questions of redundancy and reliability in a peer data network.

The equivalence of mappings is also fundamental to mapping optimization. As mentioned in [ 10 ], optimizing a mapping ultimately means replacing the mapping by an “equivalent” one with better (computational) properties. This raises the question of how the “equivalence” of two mappings should be defined. Since dependencies are logical formulae, the most natural notion of equivalence is logical equivalence. In this paper, we show that logical equivalence of SO tgds is undecidable. In order to allow for more flexibility in optimizing mappings, Fagin et al. introduced relaxed notions of equivalence [ 10 ]. In particular, the potential of conjunctive query (CQ) equivalence for optimizing several kinds of mappings was studied in [ 10 ]. Intuitively, two mappings are CQ-equivalent, if conjunctive queries posed against the target database yield the same result for both mappings (for details, see Section 2). For instance, it was shown in [ 12 ] that the composition of the mappings M12 and M23 in Example 1 cannot be represented by an SO tgd without equalities in the antecedent. On the other hand, in Example 1 is CQ-equivalent to 0 = 9f (8e (Emp(e) ! Mgr(e; f (e))). Under the weak additional assumption that the source schema may have key dependencies, we shall prove the undecidability of CQ-equivalence of SO tgds.

Organization of the paper and summary of results. In Section 2, we recall some basic

notions and results. A conclusion and an outlook to future work are given in Section 5. The main results of the paper are detailed in Sections 3 and 4, namely:

Logical equivalence. In Section 3 we prove that the logical equivalence of SO tgds is undecidable. This result is easily obtained via a previous undecidability result in the context of inverse schema mappings [ 3 ].

Logical equivalence vs. CQ-equivalence. In Section 3, we also present a different proof strategy for the undecidability of the logical equivalence of two SO tgds in order to identify an important difference between logical equivalence and CQ-equivalence. More precisely, we show that logical equivalence of two SO tgds remains undecidable even if the two SO tgds are already known to be CQ-equivalent.

CQ-equivalence. In Section 4 we prove the undecidability of CQ-equivalence of SO tgds if the source schema may have key dependencies. The proof is by reduction from the domino problem. As a by-product of this proof, we also get the undecidability of logical equivalence of SO tgds without equalities.

Due to lack of space, proofs are sketched. Details will be provided in the full version. 2

Preliminaries

Schemas and instances. A schema R = fR1; : : : ; Rng is a set of relation symbols Ri each of a fixed arity. An instance I over a schema R consists of a relation for each relation symbol in R, s.t. both have the same arity. For a relation symbol R, we write RI to denote the relation of R in I. We only consider finite instances here. Schema mappings. Let S = fS1; : : : ; Sng and T = fT1; : : : ; Tmg be schemas with no relation symbols in common. A schema mapping is given by a triple M = (S; T; ) where S is the source schema, T is the target schema, and is a set of logical formulae called dependencies expressing the relationship between S and T.

Instances over S (resp. T) are called source (resp. target) instances. We write hS; Ti to denote the schema fS1; : : : ; Sn; T1; : : : ; Tmg. If I is a source instance and J a target instance, then hI; J i is an instance of the schema hS; Ti.

Given a (ground) source instance I, a target instance J is called a solution for I under M if hI; J i j= . The set of all solutions for I under M is denoted by Sol (I; M). Dependencies. Source-to-target tuple generating dependencies (as already defined in the introduction) are logical formulae of the form 8x('(x) ! 9y (x; y)). We write x for a tuple (x1; : : : ; xn). However, by slight abuse of notation, we also refer to the set fx1; : : : ; xng as x. Hence, we may use expressions like xi 2 x or x X, etc.

A key dependency over schema R is of the form 8x(R(u; y; v) ^ R(u; z; w) ! y = z) where R denotes a relation symbol in R with u; v; w x and y; z 2 x.

A second-order tuple generating dependency (SO tgd) is a logical formula of the form 9f ((8x1('1 ! 1)) ^ ^ (8xn('n ! n))) where (1) each member of f is a function symbol, (2) each 'i is a conjunction of atomic formulas of the form S(y1; : : : ; yk) (with S 2 S and yj 2 xi), and equalities of the form t = t0 (with t and t0 terms based on xi and f ), (3) each i is a conjunction of atomic formulas of the form T (t1; : : : ; t`) (with T 2 T where t1, . . . , t` are terms based on xi and f ), and (4) each variable in xi appears in some atomic formula of 'i.

When dealing with instances hI; J i in the context of SO tgds the domain of source instances I is allowed to consist only of constants (i.e., grounded) whereas target instances J may contain functional terms which can be treated like fresh variables, also called labelled nulls. [ 12 ] Homomorphisms. Let I, I0 be instances. A homomorphism h: I ! I0 is a mapping s.t. (1) whenever R(x) 2 I, then R(h(x)) 2 I0, and (2) for every constant c, h(c) = c. If such h exists, we write I ! I0. Moreover, if I $ I0 then we say that I and I0 are homomorphically equivalent.

If a homomorphism h: I ! I0 is invertible, s.t. h 1 is also a homomorphism from I0 to I, then h is called an isomorphism, denoted I = I0. An endomorphism is a homomorphism I ! I. An endomorphism is proper if it is not surjective (for finite instances, this is equivalent to being not injective), i.e., if it reduces the domain of I.

If I is an instance, and I0 I is such that I ! I0 holds but for no other I00 I0: I ! I00 (that is, I0 cannot be further “shrunk” by a proper endomorphism), then I0 is called a core of I. The core is unique up to isomorphism. Hence, we may speak about the core of I. Cores have the following important property: for arbitrary instances J and J 0, J $ J 0 iff core(J ) = core(J 0).

If J 2 Sol (I; M) is such that J ! J 0 holds for any other solution J 0 2 Sol (I; M), then J is called a universal solution. Since the universal solutions for a source instance I are homomorphically equivalent, their core is unique up to isomorphism [ 11 ]. We write UnivSol (I; M) to denote the set of universal solutions for I under mapping M and we write core(I; M) to denote the core of the universal solutions.

A substitution is a mapping which sends variables to other domain elements (i.e., variables or constants). We write = fx1 a1; : : : ; xn ang if maps each xi to ai and is the identity outside fx1; : : : ; xng. The application of a substitution is usually denoted in postfix notation, e.g.: x denotes the image of x under . For an expression '(x) (e.g., a conjunctive query with variables in x ), we write '(x ) to denote the result of replacing every occurrence of every variable x 2 x by x .

Chase. A target instance for a given source instance and a schema mapping can be computed by the chase [ 4 ]. For SO tgds, the original chase procedure of [ 4 ] has to be modified since homomorphisms now have to take the equalities in the antecedents into account. Formally, a mapping h from a conjunct Ci = 8x('i ! i) of an SO tgd to an instance I is a homomorphism if for every relational atom S(y1; : : : ; yk) in 'i the tuple (h(y1); : : : ; h(yk)) is in SI and for every equality t = t0 we have h(t) = h(t0).

Let hI; J i be an instance of a schema hS; Ti and let Ci = 8x('i ! i) be an SO conjunct. If there is a homomorphism h of 'i into the instance I, then the conjunct Ci can be applied to hI; J1i with homomorphism h. I.e., for every target atom T (t1; : : : ; t`) of i the tuple (h(t1); : : : ; h(t`)) is added to the T relation and the union with J1 is taken yielding a new instance J2. The instance hI; J2i is now called the result of applying Ci to hI; J1i with h. The actual application is called a chase step.

A chase sequence is a finite sequence of chase steps. The chase of hI; ;i with an SO tgd is a chase sequence until no conjuncts can be applied anymore. The result of this chase is denoted by chase(I; ). Chasing hI; ;i with SO tgds can be done in polynomial time w.r.t. the size of the source instance and results in a universal solution [ 12 ]. Equivalence of schema mappings. Let M = (S; T; ) and M0 = (S, T; 0) be two schema mappings.

M and M0 are logically equivalent (denoted as M M0) if, for every source instance I and target instance J , the equivalence hI; J i j= , hI; J i j= 0 holds. This is the case iff Sol (I; M) = Sol (I; M0) holds for every source instance I.

M and M0 are conjunctive-query equivalent (denoted as M CQ M0) if, for every source instance I, either Sol (I; M) = ; = Sol (I; M0) or core(I; M) = core(I; M0). Note that the original definition of M CQ M0 is that, for every source instance I, any conjunctive query posed against the target schema yields the same certain answers for the mappings M and M0. The above characterization via the core, which is more convenient for our purposes here, was proved to be equivalent in [ 10 ]. 3

Undecidability of Logical Equivalence

The undecidability of logical equivalence of two SO tgds follows easily from a previous undecidability result in the context of inverse schema mappings [ 3 ].

Theorem 1. Let and 0 be two SO tgds. It is undecidable whether is logically equivalent to 0. 0, i.e. that Proof. (Sketch).1 In [ 8 ], a schema mapping M2 is defined to be an inverse of another mapping M1 if their composition M1 M2 is logically equivalent to the identity mapping, i.e., a set of s-t tgds of the form (8x) P (x) ! P^(x) for every source relation P . In [3, Corollary 9.5] the authors show that the following problem is undecidable: Given mappings M1 and M2 specified by s-t tgds, check whether M2 is an inverse of M1.

In other words, checking if the composition M1 M2 is equivalent to the identity mapping is undecidable. Clearly, a set of s-t tgds can be represented as an SO tgd. Let denote the SO tgd corresponding to the set of s-t tgds of the identity mapping. Likewise, M1 M2 can be represented as an SO tgd, say 0. Then 0 holds iff M2 is an inverse of M1. Hence, the logical equivalence of two SO tgds is undecidable.

We now want to strengthen the above result by relating logical equivalence to CQequivalence. Below, we show that logical equivalence of two SO tgds remains undecidable even if the two SO tgds are already known to be CQ-equivalent. To this end, we adopt a different proof strategy which goes by reduction from the following variant of the Halting problem: Given a Turing machine TM, does TM halt when run on empty input? Given an arbitrary instance TM of the Halting problem, we have to construct two SO tgds and 0, s.t. the following equivalence holds: TM halts on empty input , 6 0. A major challenge of this reduction is that SO tgds do not directly provide us with recursion or with a means to enforce equalities. 1 We thank the anonymous referee of AMW 2011 who pointed out this proof. Schemas. Let TM be a Turing machine (Q; A; ; q0) with set of states Q = f0; : : : ; mg, tape alphabet A = f0; : : : ; ng, transition function : Q A ! Q A f ; !g, and initial state q0. W.l.o.g., assume that q0 = 0 and let 1 denote the halting state. Moreover, let 1 denote the tape starting symbol . and let 0 denote the blank symbol t.

We represent the state, tape and cursor of a Turing machine configuration at time t and tape location l by the function symbols stateq(t), tapea(t; l), and cursor(t; l) (where q 2 Q, a 2 A). Each of these functions is intended to have Boolean range, e.g. tapea(t; l) = 1 indicates that at time t, the symbol a is stored on the tape at location l, and tapea(t; l) = 0 otherwise. We will ensure this Boolean range using our SO tgds.

As source schema, we use root( ) and chain( ; ), which is intended to describe a linear order starting at root(r) and continued by chain( ; ) atoms, i.e. source instance I should have the form I = froot(0); chain(0; 1); chain(1; 2); : : :g. Still, we must make sure through our SO tgds that what is described by root and chain does not deviate from a linear order in a way that adversely affects our simulation.

We define our target schema to consist of the relation symbol range( ), which is intended to describe the range of the functions we defined above. Furthermore, we use the target relation symbol halt to indicate a halting computation as well as to signify an error condition. That is, an intended target instance should have the form J = frange(0); range(1)g and possibly include halt().

Dependencies. Given an arbitrary Turing machine TM, we define the SO tgds and 0 as a “big” conjunction consisting of several groups of conjuncts (i.e., implications). As we will see, it is possible to construct an SO tgd that simulates a computation of TM.

We start by describing initialization conjuncts which are used to encode the initial configuration of the TM on the empty input tape. Since we cannot enforce equalities on the right-hand side of implications, all the implications are formulated in such a way that taking the wrong value will lead to an error condition indicated by halt(). E.g., we want to enforce the initial state state0(r) = 1. Therefore, in the case that state0(r) = 0 holds, we enforce halt() in the target instance to indicate this error condition: root(r) ^ state0(r) = 0 ! halt() Note that this does not yet cover the case that state0(r) 62 f0; 1g. We construct other implications which define the rest of the initial configuration analogously.

We now define transition conjuncts, which implement the transition function of TM. In what follows, we write dom(z) as a short-hand for a predicate that is satisfied by instantiating z to an arbitrary constant occurring in the source instance, i.e., dom(z) has to be replaced by root(z), chain(z; ), or chain( ; z). Such a predicate is needed to fulfil the safety condition of SO tgds. We first define a common part of the following implications, matching the current state q and the current tape symbol a.

' := stateq(t) = 1 ^ cursor(t; l) = 1 ^ tapea(t; p) = 1 ^ chain(t; t0) ^ dom(l) Suppose that the transition function contains the transition (q; a) = (q0; a0; the SO tgds and 0 contain the following conjuncts: ). Then ' ^ stateq0 (t0) = 0 ! halt() ' ^ tapea0 (t0; l) = 0 ! halt() ' ^ chain(l0; l) ^ cursor(t0; l0) = 0 ! halt()

While the transition conjuncts define what changes, we now construct inertia conjuncts which describe the function values of what remains the same. In particular, we can construct (but for space reasons omit here) conjuncts, which define that the cursor is not located at positions that are (a) more than one move away from the previous cursor location, (b) at the same location, or (c) at the location opposite to the cursor movement.

Furthermore, we need to state that the tape symbol remains the same wherever the cursor is not located. That is, for each q 2 Q, a 2 A: stateq(t) = 1 ^ cursor(t; l) = 0 ^ tapea(t; l) = 1 ^ dom(l) ^

chain(t; t0) ^ tapea(t0; l) = 0 ! halt() Finally, we also have to construct uniqueness conjuncts, ensuring that at each time t, there is exactly one stateq(t) and at each time t and location l exactly one tapea(t; l) whose function value is 1.

Guarding conjuncts. It is clear that, since the domain is in general non-Boolean, we cannot prevent functions stateq, tapea and cursor to take a range outside of f0; 1g for arbitrary interpretations. However, the structure of our reduction will show that we have to either, for a given interpretation of the function symbols, define the interpretation of the relation symbols, or for a given interpretation of the relation symbols, define the interpretation of the function symbols.

So if we define the functions, we can guarantee Boolean domain ourselves. If we define the relations, and in particular the target relations, we can use function guarding conjuncts to materialize the range in the target:

dom(x) ^ dom(y) ^ state (x) = y ! range(y) Analogous conjuncts are introduced for tape and cursor. This ensures the Boolean range of our functions, given appropriately constructed target relations (rangeJ = f0; 1g).

Now that we have sufficient control over the functions, we still need to gain some control over the source instance, since our simulation is intended to work on a linear order described by root and chain. We therefore define source guarding conjuncts. What we can directly enforce is that the chain relation includes no self-loops and never returns to any root using the following conjuncts: chain(x; x) ! halt() root(r) ^ chain( ; r) ! halt()

However, what we cannot directly exclude is that some element x of the chain has two distinct predecessors px 6= px0 with chain(px; x) and chain(px0; x). In particular, we do not have equality on the right-hand side, which would allow us to require px = px0. Nor can we use a derivation of halt() when there are two predecessors chain(px; x) and chain(px0; x) because we cannot guarantee that px 6= px0 on the left-hand side.

But what we can do is control the effect of such multiple predecessors (or successors). Notably, if these multiple predecessors or successors actually have no effect on our simulation, then we have no problem with these discrepancies from linear order, since the simulation is doing the correct thing anyway. Altogether, we cannot detect all deviations from a linear order, e.g., two distinct root(r) atoms cannot be ruled out. But we can gain just enough control to ensure the correctness of our reduction. Halting detection. We now construct the two SO tgds for our reduction, namely such that TM halts , 6 0. The first one, , simply consists of all the conjuncts we constructed up to this point, that is simulation conjuncts and guarding conjuncts. The function symbols tape , state and cursor are in the scope of an existential quantifier over the entire conjunction. We now define the halting detection conjunct: dom(t) ^ state1(t) = 1 ! halt() and construct 0 to be the same as plus this halting detection conjunct. So overall, both and 0 simulate the computation of Turing machine TM, but only 0 indicates a halting state by enforcing halt().

Theorem 2. Let and 0 be two SO tgds. It is undecidable whether 0 (i.e. is logically equivalent to 0), even when CQ 0 (i.e. is CQ-equivalent to 0). Proof. (Sketch). We constructed 0 as with an additional conjunct, therefore it is sufficient to show that TM halts , 6 0. We show both directions separately. The challenging part of the reduction is that, in one direction we have control over the functions, but not over the relations, and the converse for the other direction.

Assume that TM halts. Then we have to find an instance K = (I; J ) such that K but K 6 0. Since TM halts, there is a computation of length n that ends in the halting state. We construct the instance K = (I; J ) as

I = froot(1); chain(1; 2); : : : ; chain(n 1; n)g J = frange(0); range(1)g Then K is clearly the case. What needs to be shown is that K 6 0. In particular, K 6 0 means that for arbitrary f , we have (K; f ) 6 '0. That is, we constructed K, but must deal with arbitrary f for '0. We proceed by case distinction, assuming vt;l;a is the correct value of tapea(t; l) according to the transition function of TM: 1. tapea(t; l) = vt;l;a for all t; l; a, that is, it correctly represents the TM computation 2. tapea(t; l) = :vt;l;a for some t; l; a, that is, the simulation violates TM 3. For some tapea(t; p) = y we have that y 62 f0; 1g, i.e., non-Boolean range and in each case show K; f 6 0. This also holds for state and cursor. Assume that TM does not halt. Given an arbitrary instance K = (I; J ), we have to show that K implies K 0. That is, we deal with arbitrary K, but only have to find a single f for '0. Assume that TM does not halt. Then let K = (I; J ) be an arbitrary instance. We proceed by case distinction based on K: 1. J already contains an atom halt() 2. Source instance I deviates from a linear order significantly for our simulation 3. J contains some range(c) atom with c 62 f0; 1g, i.e., allows non-Boolean range 4. All range(c) atoms in J have c 2 f0; 1g and in each case show K; f 0.

Finally, note that the SO tgds and 0 in the above problem reduction are CQ-equivalent. The claim of the theorem thus follows.

Equalities are fundamental in SO tgds in order to define the composition of two schema mappings [12, Theorem 5.4]. Equalities also play a central role in our undecidability proof of logical equivalence for SO tgds in Theorem 2 above. However, if we consider CQ-equivalence rather than logical equivalence, then the equalities are not strictly necessary anymore and can be removed: In [ 2 ] it is proved that every SO tgd is CQ-equivalent to a plain SO tgd (a class of SO tgds which do not contain equalities). As a consequence the previous proof idea cannot be reused and we need a set of new proof techniques to show undecidability also for CQ-equivalence.

Undecidability of Conjunctive Query Equivalence

We now show the undecidability of CQ-equivalence of SO tgds by a reduction from the Domino Problem (DP), which was first shown undecidable by Berger [ 5 ]. We use the following, slightly modified version of DP [1, p.414].

Definition 1. A domino system (DS) is given by a set D of domino types defined as follows. Let C be a set of colours. Then, each domino type is a quadruple hl; t; r; bi of values from C, corresponding respectively to the left, top, right and bottom colour of a square domino piece. Let d:side denote an accessor function D ! C returning the colour of d, corresponding to side. Then, the Domino Problem (DP) given by D asks if for each m; n 2 N a consistent tiling of an m n grid with the domino types from D exists. That is, if there is a function tm;n: f1 : : : mg f1 : : : ng ! D, such that the equality tm;n(i; j):right = tm;n(i + 1; j):left holds for 1 i < m and 1 j n and tm;n(k; l):bottom = tm;n(k; l + 1):top for 1 k m and 1 l < n.

Chains, proper instances, and coordinates. We model a grid using the source schema

S consisting of two binary relations Ch and Cv intended to define a horizontal and vertical successor relation: For an instance I of S, ChI and CvI denote the relations of I. Their intended form is f(a0; a1); (a1; a2) : : : (an 1; an)g, with ai 6= aj , for every 0 i < j n. We define a < b w.r.t. an order C, if (a; b) 2 C or 9x; C(x; b) ^ a < x. Taking the tuples of C as edges of a directed graph (which we call the dual graph), the order defined above corresponds to an acyclic chain. If each relation of an instance I over S contains a single acyclic chain, I is called proper. The size (m; n) of such a proper instance I is defined as (jChj; jCvj).

Domino tilings are modelled by the target relation D=4. Each of its four arguments is populated by a functional term cs(x; y) where c corresponds to the colour, s to the side of the domino piece, and the pair (x; y) defines the position in the grid. To save notation, we will omit parentheses when writing the terms in the relation D (e.g. the above term will be written as csxy). Given a fact d 2 D, we define the accessor function d:n, returning the term in the nth place of d, 1 n 4.

Definition 2. Let D be a DS with types d1 : : : dn. A simulation mapping D for D over the source schema S = fCh( ; ); Cv( ; )g and the target schema T = fD( ; ; ; )g consists of an SO tgd D and four source key dependencies: 1: Ch(x; x1) ^ Ch(x; x2) ! x1 = x2 3: Cv(x; x1) ^ Cv(x; x2) ! x1 = x2 2: Ch(x1; x) ^ Ch(x2; x) ! x1 = x2 4: Cv(x1; x) ^ Cv(x2; x) ! x1 = x2

The SO tgd D has the form 9c9ph9pv9qh9qv( d1 ^ : : : ^ dn ^ 1 ^ 2), where d1 : : : dn encode the corresponding domino types:

di : Ch(x1; x2) ^ Cv(y1; y2) ! D(lh x1 y2; tv x2 y1; rh x2 y2; bv x2 y2); l; t; r; b denoting respectively the left, top, right and bottom colours of di. Superscripts distinguish the horizontal and vertical dimensions, so that there are at most two distinct functions in c for each colour in D. Moreover, c contains neither phjv nor qhjv.

The conjuncts 1 and 2 are called guards and have a form similar to di : h: Ch(x0; x1) ^ Ch(x1; x2) ^ Cv(y1; y2) ! D(ph x1 y2; pv x2 y1; ph x2 y2; pv x2 y2) v: Ch(x1; x2) ^ Cv(y0; y1) ^ Cv(y1; y2) ! D(qh x1 y2; qv x2 y1; qh x2 y2; qv x2 y2) wv1 0 Informally, given a proper source instance of size (m; n), the conjuncts d1 : : : djDj create a stack of jDj domino tiles at each position in the m n grid. Additionally, the guards h and v create two more tiles with unique colours p and q on each position (i; j) with i; j > 1, or a single tile if i = 1 or j = 1. The guards tackle source instances with cycles instead of the linear order, as will be explained later.

For a source instance I, a D-fact d enforced by the SO tgd in a simulation mapping (that is, d 2 chase(I; D)) has a form D(lhxp y; tvxyp; rhxy; bvxy), such that (xp; x) 2 ChI , (yp; y) 2 CvI .

Definition 3. Let D be a DS and I a proper instance I of size (m; n). A fact d = D(lhxpy; tvxyp; rhxy; bvxy) 2 chase(I; D) is associated with the domino type hl; t; r; bi, and with the pair (x; y) of I-values, occurring as arguments in its two last terms. Moreover, let i be the ordinal of x w.r.t. the order defined by ChI and j be the ordinal of y w.r.t. CvI . We call the pair (i; j) the coordinates of d defining its grid position. Property 1. Let I be a proper source instance, and D the simulation mapping of a DS D. Moreover, let d1 and d2 be two facts in chase(I; D), d1 associated with the pair of I-values (x1; y1), and d2 with (x2; y2). Then, the following properties hold true: 1. d1:3 = d2:1 ) (x1; x2) 2 CI

h ^ y1 = y2 ^ 8i 6= 3 8j 6= 1 d1:i 6= d2:j. 2. d1:2 = d2:4 ) (y1; y2) 2 CvI ^ x1 = x2 ^ 8 6 i = 4 8j 6= 2 d1:i 6= d2:j. 3. 9i 2 f1:::4g d1:i = d2:i ) x1 = x2 ^ y1 = y2. 4. d1:1 6= d2:2 ^ d1:3 6= d2:4 ^ d1:1 6= d2:4.

Claim 1 stipulates, that whenever a term is shared between the 3rd place of d1 and the 1st place of d2, the facts correspond to horizontally adjacent grid positions (i.e., have first coordinates i and i + 1 and the same second coordinate) and can have no further terms in common. Claim 2 is an analogue of Claim 1 for the equality d1:4 = d2:2 and vertical adjacency: d1 and d2 must have coordinates (i; j) and (i; j + 1). Claim 3 states that terms at respective positions can be only shared by facts with the same coordinates. Finally, Claim 4 restricts term equalities to the above cases.

Example 2. Consider a DS D containing the four types a; b; c; d in Figure 2. In the simulation mapping D, the first type is modelled by the following implicational conjunct in the SO tgd a: Ch(x1; x2)^Cv(y1; y2) ! D(wh x1 y2; wv x2 y1; gh x2 y2; gv x2 y2). Here, w and g stand for white and gray colours, respectively.

The 3 2 grid is represented by a proper source instance I (see top of Figure 2). Six dotted rectangles represent a consistent tiling of this grid with D. Each rectangle bears the four terms of the corresponding D-fact in chase(I; D): E.g., the topmost leftmost tile (the one at position (1,1), that is) is encoded by the fact D(wh01; wv10; gh11; gv11). Problem reduction. Let D be a DS and D its simulation mapping. Furthermore, consider a singleton colour set Cb = fbg (where b stands for “black”) and a corresponding domino set B = fhb; b; b; big with the simulation mapping B. We claim that D has a soluTtihoen ifSfO DtgdC Qin B Bholhdas.s the following form: B: 9bh9bv9ph9pv9qh9qv

Ch(x1; x2) ^ Cv(y1; y2) ! D(bh x1 y2; bv x2 y1; bh x2 y2; bv x2 y2) ^ h ^ v . It is easy to see that the conjuncts h and v are redundant in B. Hence, in the following we assume B to contain a single implicational conjunct and no guards. Moreover, the following relationship holds between the chase result of B and D: Property 2. Let D be a DS with simulation mapping D, let (i1; j1) and (i2; j2) be two distinct grid positions, and let d1; d2 2 chase(I; D) be facts at positions (i1; j1) resp. (i2; j2). Then also chase(I; B) contains two facts d01; d02 at these grid positions. Moreover, for every 1 k; l 4, if d1:k = d2:l then d01:k = d02:l, i.e., if two terms in d1 and d2 are equal, then the terms at these places in d01 and d02 are also equal.

This property uses an observation, that every implicational conjunct di encoding a domino type di in the SO tgd of a simulation mapping produces exactly one D-fact for each grid position, and that B 2 B uses the minimal set of functional symbols.

For DPs that have a solution (of which B is obviously an example) the following lemma, which is crucial for our construction, holds: Lemma 1. Let I be a proper instance of size (m; n) and let D be a DS. Then, there is a consistent tiling of an m n grid with D iff core(I; D) = core(I; B) holds. Example 3. It is easy to check, that the six D-facts in Figure 2 indeed form core(I; D). For I, B yields a solution with the isomorphic core: each fact in core(I; B) can be obtained from some fact of core(I; D) by replacing leading symbols w and g with b.

To lift the equivalence between solvability of D and existence of an isomorphism core(I; D) = core(I; B) to unrestricted source instances I, we will first show, that whenever ChI or CvI contains a single cycle (that is, its dual graph is connected, each vertex having a single incoming and a single outgoing edge), then any simulation mapping yields an instance with the same core as chase(I; B).

Lemma 2. Let I contain a single cycle in Ch or in Cv. Then, for arbitrary DS D, core(I; D) = core(I; B) holds.

Proof (Sketch). W.l.o.g. assume that the single cycle is contained in Ch. For such I, the guard h in the SO tgd D 2 D is indistinguishable from B in B. Indeed, the conclusions of these implications coincide up to renaming of functional terms. The antecedent 'h of h has an extra atom C(x0; x1), that prevents it from being satisfied by a Ch-fact at the beginning of the chain. However, if ChI is a cycle, the relations q1j2 defined as q1(x1; x2) $ Ch(x1; x2) and q2(x1; x2) $ 9x0 Ch(x0; x1) ^ Ch(x1; x2), coincide, and so do B and h. Hence, J = chase(I; D) contains a subinstance JB isomorphic to chase(I; B). Using Property 2, J ! JB can be shown.

Next, we address the source instances I containing multiple connected components iGnKChhIajvs. tThheefaccotnsnoefctKednaesssveirstidceefis,necdonansefcotleldowbsy: aGnivuenndairnecintesdtaendcgeeKw,htheneefvaecrt cgorarrpehsponding facts have a common term. The facts f1; f2 2 K are connected, if there is a path between the corresponding vertices in GK . The following property can be shown by induction on the length of the path between two D-facts: Lemma 3. Let D be a simulation mapping, I a source instance over S, and let d1; d2 be facts in J = chase(I; D) containing respectively terms (x1; y1) and (x2; y2). Assume that x1 is a term in a fact c0h 2 ChI , and x2 in c0h0 2 ChI . Likewise, let y1 and y2 occur, respectively, in the facts c0v; c0v0 2 CvI . Then, d1 and d2 are connected in J only if the facts c0h and c0h0 are connected in ChI , and so are c0v and c0v0 in CvI .

The above lemma implies, that if the source I contains lh and lv connected components in ChI and CvI respectively, then for any simulation mapping , chase(I; ) can be decomposed into lhlv pairwise disconnected instances. Furthermore, if dual graphs of the connected components of CI

hjv are acyclic chains or cycles, then by applying one of the Lemmas 1, 2 in each of the lhlv cases, the reduction established by these lemmas can be lifted to the the case when relations CI

Finally, we observe that union of chahinjvs caonndtaciyncmleuslitsiptlheecoonnlnyectytepde coofmdupaolngenratsp.hs which is relevant. Indeed, if the dual graph of a source instance I contains a V-formed subgraph fC(x; y1); C(x; y2)g or fC(y1; x); C(y2; x)g with y1 6= y2, then the key dependencies ensure that for any simulation mapping (including B) the set Sol (I; ) is empty. This concludes our reduction from the domino problem to CQ-equivalence of SO tgds. We may thus formulate the main result of this section.

Theorem 3. The CQ-equivalence of mappings consisting of SO tgds and source key dependencies is undecidable.

For the simulation mappings D and

easily proved that D CQ B holds iff

B in the above problem reduction, it can be D B holds. We thus immediately get

Theorem 4. The logical equivalence of mappings consisting of SO tgds without equalities in the antecedents and source key dependencies is undecidable. 5

Conclusion

Testing the equivalence of schema mappings is a fundamental task. For mappings defined by s-t tgds and target dependencies in the form of tgds and equality generating dependencies (egds) this problem has been well studied. In the general case, the (logical) equivalence of such mappings is undecidable. If the target tgds admit a finite chase, then the problem becomes decidable [ 4 ]. In [ 10 ], two relaxed notions of equivalence were introduced, namely the CQ-equivalence studied here and data exchange equivalence (DE-equivalence). Moreover, in [ 10 ], the undecidability of CQ-equivalence was proved for mappings consisting of s-t tgds and target tgds. The undecidability was extended in [ 17 ] in two directions, namely to mappings with target egds and from CQ-equivalence to DE-equivalence. The equivalence of SO tgds has been unexplored so far.

In this paper, we have proved the undecidability of logical equivalence. For the case of source schemas with key dependencies, we proved the undecidability of CQequivalence. The status of CQ-equivalence without any assumptions on the source schema has been left open. Closing this gap is on top of the agenda for future work. Moreover, the search for decidable fragments should be initiated. Finally, we also want to study the DE-equivalence [ 10 ] mentioned above. We conjecture that DE-equivalence of SO tgds is also undecidable. However, this has to be proved yet.

Allauzen and

Durand . The classical decision problem, by E . Bo¨rger, E. Gra¨del and Yu . Gurevich, chapter “Appendix A: Tiling Problems”, pages 407 - 420 . Springer, 2001 .

Arenas , J. Pe´rez,

J. L.

Reutter , and

Riveros . Composition and inversion of schema mappings . SIGMOD Record , 38 ( 3 ): 17 - 28 , 2009 .

Arenas , J. Pe´rez, and

Riveros . The recovery of a schema mapping: Bringing exchanged data back . ACM Trans. Database Syst ., 34 ( 4 ), 2009 .

Beeri and

M. Y.

Vardi . A proof procedure for data dependencies . J. ACM , 31 ( 4 ): 718 - 741 , 1984 .

Berger . The undecidability of the domino problem . Amer Mathematical Society , 1966 .

P. A.

Bernstein . Applying model management to classical meta data problems . In Proc. CIDR'03 , 2003 . available from http://www-db.cs.wisc.edu/cidr/cidr2003/program/p19.pdf.

P. A.

Bernstein ,

T. J.

Green ,

Melnik , and

Nash . Implementing mapping composition . VLDB J ., 17 ( 2 ): 333 - 353 , 2008 .

Fagin . Inverting schema mappings . In Proc. PODS'06 , pages 50 - 59 . ACM, 2006 .

Fagin ,

P. G.

Kolaitis ,

R. J.

Miller , and

Popa . Data exchange: semantics and query answering . Theor. Comput. Sci. , 336 ( 1 ): 89 - 124 , 2005 .

10.

Fagin ,

P. G.

Kolaitis ,

Nash , and

Popa . Towards a theory of schema-mapping optimization . In Proc. PODS'08 , pages 33 - 42 . ACM, 2008 .

11.

Fagin ,

P. G.

Kolaitis , and

Popa . Data exchange: getting to the core . ACM Trans. Database Syst ., 30 ( 1 ): 174 - 210 , 2005 .

12.

Fagin ,

P. G.

Kolaitis ,

Popa , and

W. C.

Tan . Composing schema mappings: Second-order dependencies to the rescue . ACM Trans. Database Syst ., 30 ( 4 ): 994 - 1055 , 2005 .

13. A. Y. Halevy , A.

Rajaraman , and J. J.

Ordille . Data integration: The teenage years . In Proc. VLDB'06 , pages 9 - 16 . ACM, 2006 .

14.

Madhavan and

A. Y.

Halevy . Composing mappings among data sources . In Proc. VLDB'03 , pages 572 - 583 , 2003 .

15.

Melnik . Generic Model Management: Concepts and Algorithms , volume 2967 of Lecture Notes in Computer Science. Springer, 2004 .

16.

Nash ,

P. A.

Bernstein , and

Melnik . Composition of mappings given by embedded dependencies . ACM Trans. Database Syst ., 32 ( 1 ): 4 , 2007 .

17.

Pichler , E. Sallinger, and

Savenkov . Relaxed notions of schema mapping equivalence revisited . In Proc. ICDT'11 , pages 90 - 101 . ACM, 2011 .