On the Undecidability of the Equivalence of Second-Order Tuple Generating Dependencies

On the Undecidability of the Equivalence of Second-Order Tuple Generating Dependencies IngoFeinerer Vienna University of Technology ReinhardPichler Vienna University of Technology EmanuelSallinger Vienna University of Technology VadimSavenkov Vienna University of Technology On the Undecidability of the Equivalence of Second-Order Tuple Generating Dependencies 79093C65250598CA46EDF8706A0490C2 GROBID - A machine learning software for extracting information from scholarly documents

Second-Order tuple generating dependencies (SO tgds) were introduced by Fagin et al. to capture the composition of simple schema mappings. Testing the equivalence of SO tgds would be important for applications like model management and mapping optimization. However, we prove the undecidability of the logical equivalence of SO tgds. Moreover, under weak additional assumptions, we also show the undecidability of a relaxed notion of equivalence between two SO tgds, namely the so-called conjunctive query equivalence.

Introduction

Schema mappings play an important role in several areas of database research, notably in data integration [13], data exchange [9], peer data management [14], and model management [6]. A schema mapping is given by two schemas, called the source schema and the target schema, as well as a set of dependencies describing the relationship between the source and target schema. The most fundamental form of schema mappings are mappings defined by a set of source-to-target tuple generating dependencies (s-t tgds): They are First-Order formulae of the form ∀x(ϕ(x) → ∃yψ(x, y)), where the antecedent ϕ(x) is a conjunctive query (CQ) over the source schema and the conclusion ψ(x, y) is a CQ over the target schema. Intuitively, such an s-t tgd defines a constraint that the presence of certain tuples in the source database I (namely those in the image of some homomorphism h from ϕ(x) to I) enforce the presence of certain tuples in the target database J (s.t. h can be extended to a homomorphism from ψ(x, y) to J).

Several algebraic operators [6,15] on schema mappings have been intensively studied in recent time like computing inverses [8,3,2] and composing schema mappings [7,12,14,16]. Our work is rather related to the composition operator. Fagin et al. proved that s-t tgds are not powerful enough to express the composition of two mappings defined by s-t tgds [12]. To remedy this defect, so-called Second-Order tuple generating dependencies (SO tgds) were introduced in [12]. SO tgds extend s-t tgds by existentially quantified function-variables and equalities of (possibly functional) terms. Details and formal definitions are given in Section 2. It was shown in [12] that SO tgds capture exactly the closure under composition of mappings defined by s-t tgds.

Example 1 ([12]). Consider the following three schemas. Let S 1 consist of the unary relation symbol Emp(•) of employees. Schema S 2 consists of a single binary relation symbol Mgr (•, •) that associates each employee with a manager. Schema S 3 consists of a similar binary relation symbol Mgr(•, •) that is intended to provide a copy of Mgr and an additional unary relation symbol SelfMgr(•) to store employees who are their own manager. Consider the mappings M 12 = (S 1 , S 2 , Σ 12 ) and M 23 = (S 2 , S 3 , Σ 23 ) with Σ 12 = {∀e(Emp(e) → ∃m Mgr (e, m))} and Σ 23 = {∀e, m(Mgr (e, m) → Mgr(e, m)), ∀e(Mgr (e, e) → SelfMgr(e))}. We are looking for the composition of M 12 and M 23 . It can be verified that this composition can be expressed by the SO tgd σ = ∃f (∀e (Emp(e) → Mgr(e, f (e))) ∧ ∀e (Emp(e) ∧ (e = f (e)) → SelfMgr(e))).

In this paper, we want to study the equivalence of SO tgds. Note that the question of equivalence naturally arises in several scenarios. Figure 1 illustrates a model evolution scenario, where data structured under some schema S is first migrated to a database with schema T and then further transformed to meet schema U . Now suppose that there exists an alternative migration path from schema S via T to schema U . The question if the two migration paths yield the same result comes down to checking if the dependencies σ and σ (which represent the respective mapping compositions) are equivalent. Actually, Figure 1 can also be thought of as illustrating a peer data management scenario, where some peer with data structured according to S provides part of its data to some other peer with schema T (resp. T ). The latter peer in turn passes this data on to yet another peer with schema U . Now suppose that a user may access the data only at the peer with schema U . What happens if some link in this peer data network is broken, say the one corresponding to mapping Σ 23 ? Will the path of mappings from S via T to U still give the user full access to the data provided by the peer with schema S? Testing the equivalence of σ and σ is thus crucial for answering questions of redundancy and reliability in a peer data network.

The equivalence of mappings is also fundamental to mapping optimization. As mentioned in [10], optimizing a mapping ultimately means replacing the mapping by an "equivalent" one with better (computational) properties. This raises the question of how the "equivalence" of two mappings should be defined. Since dependencies are logical formulae, the most natural notion of equivalence is logical equivalence. In this paper, we show that logical equivalence of SO tgds is undecidable. In order to allow for more flexibility in optimizing mappings, Fagin et al. introduced relaxed notions of equivalence [10]. In particular, the potential of conjunctive query (CQ) equivalence for optimizing several kinds of mappings was studied in [10]. Intuitively, two mappings are CQ-equivalent, if conjunctive queries posed against the target database yield the same result for both mappings (for details, see Section 2). For instance, it was shown in [12] that the composition of the mappings M 12 and M 23 in Example 1 cannot be represented by an SO tgd without equalities in the antecedent. On the other hand, σ in Example 1 is CQ-equivalent to σ = ∃f (∀e (Emp(e) → Mgr(e, f (e))). Under the weak additional assumption that the source schema may have key dependencies, we shall prove the undecidability of CQ-equivalence of SO tgds.

Organization of the paper and summary of results. In Section 2, we recall some basic notions and results. A conclusion and an outlook to future work are given in Section 5. The main results of the paper are detailed in Sections 3 and 4, namely: • Logical equivalence. In Section 3 we prove that the logical equivalence of SO tgds is undecidable. This result is easily obtained via a previous undecidability result in the context of inverse schema mappings [3].

• Logical equivalence vs. CQ-equivalence. In Section 3, we also present a different proof strategy for the undecidability of the logical equivalence of two SO tgds in order to identify an important difference between logical equivalence and CQ-equivalence. More precisely, we show that logical equivalence of two SO tgds remains undecidable even if the two SO tgds are already known to be CQ-equivalent.

• CQ-equivalence. In Section 4 we prove the undecidability of CQ-equivalence of SO tgds if the source schema may have key dependencies. The proof is by reduction from the domino problem. As a by-product of this proof, we also get the undecidability of logical equivalence of SO tgds without equalities.

Due to lack of space, proofs are sketched. Details will be provided in the full version.

Preliminaries

Schemas and instances. A schema R = {R 1 , . . . , R n } is a set of relation symbols R i each of a fixed arity. An instance I over a schema R consists of a relation for each relation symbol in R, s.t. both have the same arity. For a relation symbol R, we write R I to denote the relation of R in I. We only consider finite instances here.

Schema mappings. Let S = {S 1 , . . . , S n } and T = {T 1 , . . . , T m } be schemas with no relation symbols in common. A schema mapping is given by a triple M = (S, T, Σ) where S is the source schema, T is the target schema, and Σ is a set of logical formulae called dependencies expressing the relationship between S and T.

Instances over S (resp. T) are called source (resp. target) instances. We write S, T to denote the schema {S 1 , . . . , S n , T 1 , . . . , T m }. If I is a source instance and J a target instance, then I, J is an instance of the schema S, T .

Given a (ground) source instance I, a target instance J is called a solution for I under M if I, J |= Σ. The set of all solutions for I under M is denoted by Sol (I, M).

Dependencies. Source-to-target tuple generating dependencies (as already defined in the introduction) are logical formulae of the form ∀x(ϕ(x) → ∃yψ(x, y)). We write x for a tuple (x 1 , . . . , x n ). However, by slight abuse of notation, we also refer to the set {x 1 , . . . , x n } as x. Hence, we may use expressions like x i ∈ x or x ⊆ X, etc.

A key dependency over schema R is of the form ∀x(R(u, y, v) ∧ R(u, z, w) → y = z) where R denotes a relation symbol in R with u, v, w ⊆ x and y, z ∈ x.

A second-order tuple generating dependency (SO tgd) is a logical formula of the form ∃f ((∀x

1 (ϕ 1 → ψ 1 )) ∧ • • • ∧ (∀x n (ϕ n → ψ n )))

where (1) each member of f is a function symbol, (2) each ϕ i is a conjunction of atomic formulas of the form S(y 1 , . . . , y k ) (with S ∈ S and y j ∈ x i ), and equalities of the form t = t (with t and t terms based on x i and f ), (3) each ψ i is a conjunction of atomic formulas of the form T (t 1 , . . . , t ) (with T ∈ T where t 1 , . . . , t are terms based on x i and f ), and (4) each variable in x i appears in some atomic formula of ϕ i .

When dealing with instances I, J in the context of SO tgds the domain of source instances I is allowed to consist only of constants (i.e., grounded) whereas target instances J may contain functional terms which can be treated like fresh variables, also called labelled nulls. [12] Homomorphisms. Let I, I be instances. A homomorphism h: I → I is a mapping s.t.

(1) whenever R(x) ∈ I, then R(h(x)) ∈ I , and (2) for every constant c, h(c) = c. If such h exists, we write I → I . Moreover, if I ↔ I then we say that I and I are homomorphically equivalent.

If a homomorphism h: I → I is invertible, s.t. h −1 is also a homomorphism from I to I, then h is called an isomorphism, denoted I ∼ = I . An endomorphism is a homomorphism I → I. An endomorphism is proper if it is not surjective (for finite instances, this is equivalent to being not injective), i.e., if it reduces the domain of I.

If I is an instance, and I ⊆ I is such that I → I holds but for no other I ⊂ I : I → I (that is, I cannot be further "shrunk" by a proper endomorphism), then I is called a core of I. The core is unique up to isomorphism. Hence, we may speak about the core of I. Cores have the following important property: for arbitrary instances J and J , J ↔ J iff core(J) ∼ = core(J ).

If J ∈ Sol (I, M) is such that J → J holds for any other solution J ∈ Sol (I, M), then J is called a universal solution. Since the universal solutions for a source instance I are homomorphically equivalent, their core is unique up to isomorphism [11]. We write UnivSol (I, M) to denote the set of universal solutions for I under mapping M and we write core(I, M) to denote the core of the universal solutions.

A substitution λ is a mapping which sends variables to other domain elements (i.e., variables or constants). We write λ = {x 1 ← a 1 , . . . , x n ← a n } if λ maps each x i to a i and λ is the identity outside {x 1 , . . . , x n }. The application of a substitution is usually denoted in postfix notation, e.g.: xλ denotes the image of x under λ. For an expression ϕ(x) (e.g., a conjunctive query with variables in x ), we write ϕ(xλ) to denote the result of replacing every occurrence of every variable x ∈ x by xλ.

Chase. A target instance for a given source instance and a schema mapping can be computed by the chase [4]. For SO tgds, the original chase procedure of [4] has to be modified since homomorphisms now have to take the equalities in the antecedents into account. Formally, a mapping h from a conjunct C i = ∀x(ϕ i → ψ i ) of an SO tgd to an instance I is a homomorphism if for every relational atom S(y 1 , . . . , y k ) in ϕ i the tuple (h(y 1 ), . . . , h(y k )) is in S I and for every equality t = t we have h(t) = h(t ).

Let I, J be an instance of a schema S, T and let C i = ∀x(ϕ i → ψ i ) be an SO conjunct. If there is a homomorphism h of ϕ i into the instance I, then the conjunct C i can be applied to I, J 1 with homomorphism h. I.e., for every target atom T (t 1 , . . . , t ) of ψ i the tuple (h(t 1 ), . . . , h(t )) is added to the T relation and the union with J 1 is taken yielding a new instance J 2 . The instance I, J 2 is now called the result of applying C i to I, J 1 with h. The actual application is called a chase step.

A chase sequence is a finite sequence of chase steps. The chase of I, ∅ with an SO tgd Σ is a chase sequence until no conjuncts can be applied anymore. The result of this chase is denoted by chase(I, Σ). Chasing I, ∅ with SO tgds can be done in polynomial time w.r.t. the size of the source instance and results in a universal solution [12].

Equivalence of schema mappings. Let M = (S, T, Σ) and M = (S, T, Σ ) be two schema mappings.

M and M are logically equivalent (denoted as M ≡ M ) if, for every source instance I and target instance J, the equivalence I, J |= Σ ⇔ I, J |= Σ holds. This is the case iff Sol (I, M) = Sol (I, M ) holds for every source instance I.

M and M are conjunctive-query equivalent (denoted as M ≡ CQ M ) if, for every source instance I, either Sol (I, M) = ∅ = Sol (I, M ) or core(I, M) = core(I, M ). Note that the original definition of M ≡ CQ M is that, for every source instance I, any conjunctive query posed against the target schema yields the same certain answers for the mappings M and M . The above characterization via the core, which is more convenient for our purposes here, was proved to be equivalent in [10].

Undecidability of Logical Equivalence

The undecidability of logical equivalence of two SO tgds follows easily from a previous undecidability result in the context of inverse schema mappings [3].

Theorem 1. Let τ and τ be two SO tgds. It is undecidable whether τ ≡ τ , i.e. that τ is logically equivalent to τ .

Proof. (Sketch). 1 In [8], a schema mapping M 2 is defined to be an inverse of another mapping M 1 if their composition M 1 • M 2 is logically equivalent to the identity mapping, i.e., a set of s-t tgds of the form (∀x) P (x) → P (x) for every source relation P . In [3, Corollary 9.5] the authors show that the following problem is undecidable: Given mappings M 1 and M 2 specified by s-t tgds, check whether M 2 is an inverse of M 1 .

In other words, checking if the composition M 1 • M 2 is equivalent to the identity mapping is undecidable. Clearly, a set of s-t tgds can be represented as an SO tgd. Let τ denote the SO tgd corresponding to the set of s-t tgds of the identity mapping. Likewise, M 1 • M 2 can be represented as an SO tgd, say τ . Then τ ≡ τ holds iff M 2 is an inverse of M 1 . Hence, the logical equivalence of two SO tgds is undecidable.

We now want to strengthen the above result by relating logical equivalence to CQequivalence. Below, we show that logical equivalence of two SO tgds remains undecidable even if the two SO tgds are already known to be CQ-equivalent. To this end, we adopt a different proof strategy which goes by reduction from the following variant of the Halting problem: Given a Turing machine TM, does TM halt when run on empty input? Given an arbitrary instance TM of the Halting problem, we have to construct two SO tgds τ and τ , s.t. the following equivalence holds: TM halts on empty input ⇔ τ ≡ τ . A major challenge of this reduction is that SO tgds do not directly provide us with recursion or with a means to enforce equalities.

Schemas. Let TM be a Turing machine (Q, A, δ, q 0 ) with set of states Q = {0, . . . , m}, tape alphabet A = {0, . . . , n}, transition function δ : Q × A → Q × A × {←, →}, and initial state q 0 . W.l.o.g., assume that q 0 = 0 and let 1 denote the halting state. Moreover, let 1 denote the tape starting symbol and let 0 denote the blank symbol .

We represent the state, tape and cursor of a Turing machine configuration at time t and tape location l by the function symbols state q (t), tape a (t, l), and cursor(t, l) (where q ∈ Q, a ∈ A). Each of these functions is intended to have Boolean range, e.g. tape a (t, l) = 1 indicates that at time t, the symbol a is stored on the tape at location l, and tape a (t, l) = 0 otherwise. We will ensure this Boolean range using our SO tgds.

As source schema, we use root(•) and chain(•, •), which is intended to describe a linear order starting at root(r) and continued by chain(•, •) atoms, i.e. source instance I should have the form I = {root(0), chain(0, 1), chain(1, 2), . . .}. Still, we must make sure through our SO tgds that what is described by root and chain does not deviate from a linear order in a way that adversely affects our simulation.

We define our target schema to consist of the relation symbol range(•), which is intended to describe the range of the functions we defined above. Furthermore, we use the target relation symbol halt to indicate a halting computation as well as to signify an error condition. That is, an intended target instance should have the form J = {range(0), range(1)} and possibly include halt().

Dependencies. Given an arbitrary Turing machine TM, we define the SO tgds τ and τ as a "big" conjunction consisting of several groups of conjuncts (i.e., implications). As we will see, it is possible to construct an SO tgd that simulates a computation of TM.

We start by describing initialization conjuncts which are used to encode the initial configuration of the TM on the empty input tape. Since we cannot enforce equalities on the right-hand side of implications, all the implications are formulated in such a way that taking the wrong value will lead to an error condition indicated by halt(). E.g., we want to enforce the initial state state 0 (r) = 1. Therefore, in the case that state 0 (r) = 0 holds, we enforce halt() in the target instance to indicate this error condition:

• root(r) ∧ state 0 (r) = 0 → halt() Note that this does not yet cover the case that state 0 (r) ∈ {0, 1}. We construct other implications which define the rest of the initial configuration analogously.

We now define transition conjuncts, which implement the transition function of TM. In what follows, we write dom(z) as a short-hand for a predicate that is satisfied by instantiating z to an arbitrary constant occurring in the source instance, i.e., dom(z) has to be replaced by root(z), chain(z, ), or chain( , z). Such a predicate is needed to fulfil the safety condition of SO tgds. We first define a common part of the following implications, matching the current state q and the current tape symbol a.

ϕ := state q (t) = 1 ∧ cursor(t, l) = 1 ∧ tape a (t, p) = 1 ∧ chain(t, t ) ∧ dom(l)

Suppose that the transition function δ contains the transition δ(q, a) = (q , a , ←). Then the SO tgds τ and τ contain the following conjuncts:

• ϕ ∧ state q (t ) = 0 → halt() • ϕ ∧ tape a (t , l) = 0 → halt() • ϕ ∧ chain(l , l) ∧ cursor(t , l ) = 0 → halt()

While the transition conjuncts define what changes, we now construct inertia conjuncts which describe the function values of what remains the same. In particular, we can construct (but for space reasons omit here) conjuncts, which define that the cursor is not located at positions that are (a) more than one move away from the previous cursor location, (b) at the same location, or (c) at the location opposite to the cursor movement.

Furthermore, we need to state that the tape symbol remains the same wherever the cursor is not located. That is, for each q ∈ Q, a ∈ A:

• state q (t) = 1 ∧ cursor(t, l) = 0 ∧ tape a (t, l) = 1 ∧ dom(l) ∧ chain(t, t ) ∧ tape a (t , l) = 0 → halt() Finally, we also have to construct uniqueness conjuncts, ensuring that at each time t, there is exactly one state q (t) and at each time t and location l exactly one tape a (t, l) whose function value is 1. Guarding conjuncts. It is clear that, since the domain is in general non-Boolean, we cannot prevent functions state q , tape a and cursor to take a range outside of {0, 1} for arbitrary interpretations. However, the structure of our reduction will show that we have to either, for a given interpretation of the function symbols, define the interpretation of the relation symbols, or for a given interpretation of the relation symbols, define the interpretation of the function symbols.

So if we define the functions, we can guarantee Boolean domain ourselves. If we define the relations, and in particular the target relations, we can use function guarding conjuncts to materialize the range in the target:

• dom(x) ∧ dom(y) ∧ state (x) = y → range(y) Analogous conjuncts are introduced for tape and cursor. This ensures the Boolean range of our functions, given appropriately constructed target relations (range J = {0, 1}). Now that we have sufficient control over the functions, we still need to gain some control over the source instance, since our simulation is intended to work on a linear order described by root and chain. We therefore define source guarding conjuncts. What we can directly enforce is that the chain relation includes no self-loops and never returns to any root using the following conjuncts:

• chain(x, x) → halt()

• root(r) ∧ chain( , r) → halt()

However, what we cannot directly exclude is that some element x of the chain has two distinct predecessors px = px with chain(px, x) and chain(px , x). In particular, we do not have equality on the right-hand side, which would allow us to require px = px . Nor can we use a derivation of halt() when there are two predecessors chain(px, x) and chain(px , x) because we cannot guarantee that px = px on the left-hand side.

But what we can do is control the effect of such multiple predecessors (or successors). Notably, if these multiple predecessors or successors actually have no effect on our simulation, then we have no problem with these discrepancies from linear order, since the simulation is doing the correct thing anyway. Altogether, we cannot detect all deviations from a linear order, e.g., two distinct root(r) atoms cannot be ruled out. But we can gain just enough control to ensure the correctness of our reduction.

Halting detection. We now construct the two SO tgds for our reduction, namely such that TM halts ⇔ τ ≡ τ . The first one, τ , simply consists of all the conjuncts we constructed up to this point, that is simulation conjuncts and guarding conjuncts. The function symbols tape , state and cursor are in the scope of an existential quantifier over the entire conjunction. We now define the halting detection conjunct:

• dom(t) ∧ state 1 (t) = 1 → halt() and construct τ to be the same as τ plus this halting detection conjunct. So overall, both τ and τ simulate the computation of Turing machine TM, but only τ indicates a halting state by enforcing halt().

Theorem 2. Let τ and τ be two SO tgds. It is undecidable whether τ ≡ τ (i.e. τ is logically equivalent to τ ), even when τ ≡ CQ τ (i.e. τ is CQ-equivalent to τ ).

Proof. (Sketch). We constructed τ as τ with an additional conjunct, therefore it is sufficient to show that TM halts ⇔ τ τ . We show both directions separately. The challenging part of the reduction is that, in one direction we have control over the functions, but not over the relations, and the converse for the other direction.

Assume that TM halts. Then we have to find an instance K = (I, J) such that K τ but K τ . Since TM halts, there is a computation of length n that ends in the halting state. We construct the instance K = (I, J) as

I = {root(1), chain(1, 2), . . . , chain(n − 1, n)} J = {range(0), range(1)

} Then K τ is clearly the case. What needs to be shown is that K τ . In particular, K τ means that for arbitrary f , we have (K, f ) ϕ . That is, we constructed K, but must deal with arbitrary f for ϕ . We proceed by case distinction, assuming v t,l,a is the correct value of tape a (t, l) according to the transition function of TM:

1. tape a (t, l) = v t,l,a for all t, l, a, that is, it correctly represents the TM computation 2. tape a (t, l) = ¬v t,l,a for some t, l, a, that is, the simulation violates TM 3. For some tape a (t, p) = y we have that y ∈ {0, 1}, i.e., non-Boolean range and in each case show K, f τ . This also holds for state and cursor. Assume that TM does not halt. Given an arbitrary instance K = (I, J), we have to show that K τ implies K τ . That is, we deal with arbitrary K, but only have to find a single f for ϕ . Assume that TM does not halt. Then let K = (I, J) be an arbitrary instance. We proceed by case distinction based on K:

1. J already contains an atom halt() 2. Source instance I deviates from a linear order significantly for our simulation 3. J contains some range(c) atom with c ∈ {0, 1}, i.e., allows non-Boolean range 4. All range(c) atoms in J have c ∈ {0, 1} and in each case show K, f τ . Finally, note that the SO tgds τ and τ in the above problem reduction are CQ-equivalent. The claim of the theorem thus follows.

Equalities are fundamental in SO tgds in order to define the composition of two schema mappings [12,Theorem 5.4]. Equalities also play a central role in our undecidability proof of logical equivalence for SO tgds in Theorem 2 above. However, if we consider CQ-equivalence rather than logical equivalence, then the equalities are not strictly necessary anymore and can be removed: In [2] it is proved that every SO tgd is CQ-equivalent to a plain SO tgd (a class of SO tgds which do not contain equalities). As a consequence the previous proof idea cannot be reused and we need a set of new proof techniques to show undecidability also for CQ-equivalence.

Undecidability of Conjunctive Query Equivalence

We now show the undecidability of CQ-equivalence of SO tgds by a reduction from the Domino Problem (DP), which was first shown undecidable by Berger [5]. We use the following, slightly modified version of DP [1, p.414]. Definition 1. A domino system (DS) is given by a set D of domino types defined as follows. Let C be a set of colours. Then, each domino type is a quadruple l, t, r, b of values from C, corresponding respectively to the left, top, right and bottom colour of a square domino piece. Let d.side denote an accessor function D → C returning the colour of d, corresponding to side. Then, the Domino Problem (DP) given by D asks if for each m, n ∈ N a consistent tiling of an m × n grid with the domino types from D exists. That is, if there is a function t m,n : {1 . . . m} × {1 . . . n} → D, such that the equality t m,n (i, j).right = t m,n (i + 1, j).left holds for 1 ≤ i < m and 1 ≤ j ≤ n and t m,n (k, l).bottom = t m,n (k, l + 1).top for 1 ≤ k ≤ m and 1 ≤ l < n.

Chains, proper instances, and coordinates. We model a grid using the source schema S consisting of two binary relations C h and C v intended to define a horizontal and vertical successor relation: For an instance I of S, C I h and C I v denote the relations of I. Their intended form is {(a 0 , a 1 ), (a 1 , a 2 ) . . . (a n−1 , a n )}, with a i = a j , for every

0 ≤ i < j ≤ n. We define a < b w.r.t. an order C, if (a, b) ∈ C or ∃x, C(x, b) ∧ a < x.

Taking the tuples of C as edges of a directed graph (which we call the dual graph), the order defined above corresponds to an acyclic chain. If each relation of an instance I over S contains a single acyclic chain, I is called proper. The size (m, n) of such a proper instance I is defined as

(|C h |, |C v |).

Domino tilings are modelled by the target relation D/4. Each of its four arguments is populated by a functional term c s (x, y) where c corresponds to the colour, s to the side of the domino piece, and the pair (x, y) defines the position in the grid. To save notation, we will omit parentheses when writing the terms in the relation D (e.g. the above term will be written as c s xy). Given a fact d ∈ D, we define the accessor function d.n, returning the term in the n th place of d, 1 ≤ n ≤ 4.

1 : C h (x, x 1 ) ∧ C h (x, x 2 ) → x 1 = x 2 2 : C h (x 1 , x) ∧ C h (x 2 , x) → x 1 = x 2 3 : C v (x, x 1 ) ∧ C v (x, x 2 ) → x 1 = x 2 4 : C v (x 1 , x) ∧ C v (x 2 , x) → x 1 = x 2

The SO tgd τ D has the form

∃c∃p h ∃p v ∃q h ∃q v (τ d1 ∧ . . . ∧ τ dn ∧ γ 1 ∧ γ 2 )

, where τ d1 . . . τ dn encode the corresponding domino types:

τ di : C h (x 1 , x 2 ) ∧ C v (y 1 , y 2 ) → D(l h x 1 y 2 , t v x 2 y 1 , r h x 2 y 2 , b v x 2 y 2 )

, l, t, r, b denoting respectively the left, top, right and bottom colours of d i . Superscripts distinguish the horizontal and vertical dimensions, so that there are at most two distinct functions in c for each colour in D. Moreover, c contains neither p h|v nor q h|v . The conjuncts γ 1 and γ 2 are called guards and have a form similar to τ di : 2. Dotted rectangles are facts in core(I, ΣD), corresponding to a consistent tiling Informally, given a proper source instance of size (m, n), the conjuncts τ d1 . . . τ d |D| create a stack of |D| domino tiles at each position in the m × n grid. Additionally, the guards γ h and γ v create two more tiles with unique colours p and q on each position (i, j) with i, j > 1, or a single tile if i = 1 or j = 1. The guards tackle source instances with cycles instead of the linear order, as will be explained later. 1.

γ h : C h (x 0 , x 1 ) ∧ C h (x 1 , x 2 ) ∧ C v (y 1 , y 2 ) → D(p h x 1 y 2 , p v x 2 y 1 , p h x 2 y 2 , p v x 2 y 2 ) γ v : C h (x 1 , x 2 ) ∧ C v (y 0 , y 1 ) ∧ C v (y 1 , y 2 ) → D(q h x 1 y 2 , q v x 2 y 1 , q h x 2 y 2 , q v x 2 y 2 ) D I = {C h (0, 1), C h (1, 2), C h (2, 3), Cv(0, 1), Cv(1, 2)} a b c d w h 0 1 w v 1 0 g h 1 1 g v 1 1 a w h 0 2 g v 1 1 w h 1 2 w v 1 2 b g h 1 1 g v 2 0 g h 2 1 w v 2 1 c w h 1 2 w v 2 1 g h 2 2 g v 2 2 a g h 2 1 w v 3 0 w h 3 1 w v 3 1 d g h 2 2 w v 3 1 w h 3 2 w v 3 2 d Fig.d 1 .3 = d 2 .1 ⇒ (x 1 , x 2 ) ∈ C I h ∧ y 1 = y 2 ∧ ∀i = 3 ∀j = 1 d 1 .i = d 2 .j. 2. d 1 .2 = d 2 .4 ⇒ (y 1 , y 2 ) ∈ C I v ∧ x 1 = x 2 ∧ ∀i = 4 ∀j = 2 d 1 .i = d 2 .j. 3. ∃i ∈ {1...4} d 1 .i = d 2 .i ⇒ x 1 = x 2 ∧ y 1 = y 2 . 4. d 1 .1 = d 2 .2 ∧ d 1 .3 = d 2 .4 ∧ d 1 .1 = d 2 .4.

Claim 1 stipulates, that whenever a term is shared between the 3 rd place of d 1 and the 1 st place of d 2 , the facts correspond to horizontally adjacent grid positions (i.e., have first coordinates i and i + 1 and the same second coordinate) and can have no further terms in common. Claim 2 is an analogue of Claim 1 for the equality d 1 .4 = d 2 .2 and vertical adjacency: d 1 and d 2 must have coordinates (i, j) and (i, j + 1). Claim 3 states that terms at respective positions can be only shared by facts with the same coordinates. Finally, Claim 4 restricts term equalities to the above cases.

a : C h (x 1 , x 2 )∧C v (y 1 , y 2 ) → D(w h x 1 y 2 , w v x 2 y 1 , g h x 2 y 2 , g v x 2 y 2 ).

Here, w and g stand for white and gray colours, respectively. The 3 × 2 grid is represented by a proper source instance I (see top of Figure 2). Six dotted rectangles represent a consistent tiling of this grid with D. Each rectangle bears the four terms of the corresponding D-fact in chase(I, Σ D ): E.g., the topmost leftmost tile (the one at position (1,1), that is) is encoded by the fact D(w h 01, w v 10, g h 11, g v 11). The SO tgd in Σ B has the following form:

τ B : ∃b h ∃b v ∃p h ∃p v ∃q h ∃q v C h (x 1 , x 2 ) ∧ C v (y 1 , y 2 ) → D(b h x 1 y 2 , b v x 2 y 1 , b h x 2 y 2 , b v x 2 y 2 ) ∧ γ h ∧ γ v .

It is easy to see that the conjuncts γ h and γ v are redundant in τ B . Hence, in the following we assume τ B to contain a single implicational conjunct and no guards. Moreover, the following relationship holds between the chase result of τ B and τ D : Property 2. Let D be a DS with simulation mapping Σ D , let (i 1 , j 1 ) and (i 2 , j 2 ) be two distinct grid positions, and let d 1 , d 2 ∈ chase(I, Σ D ) be facts at positions (i 1 , j 1 ) resp. (i 2 , j 2 ). Then also chase(I, Σ B ) contains two facts d 1 , d 2 at these grid positions. Moreover, for every

1 ≤ k, l ≤ 4, if d 1 .k = d 2 .l then d 1 .k = d 2 .

l, i.e., if two terms in d 1 and d 2 are equal, then the terms at these places in d 1 and d 2 are also This property uses an observation, that every implicational conjunct τ di encoding a domino type d i in the SO tgd of a simulation mapping produces exactly one D-fact for each grid position, and that τ B ∈ Σ B uses the minimal set of functional symbols.

For DPs that have a solution (of which B is obviously an example) the following lemma, which is crucial for our construction, holds: Lemma 1. Let I be a proper instance of size (m, n) and let D be a DS. Then, there is a consistent tiling of an m × n grid with D iff core(I, τ D ) ∼ = core(I, Σ B ) holds.

Example 3. It is easy to check, that the six D-facts in Figure 2 indeed form core(I, Σ D ). For I, Σ B yields a solution with the isomorphic core: each fact in core(I, Σ B ) can be obtained from some fact of core(I, Σ D ) by replacing leading symbols w and g with b.

To lift the equivalence between solvability of D and existence of an isomorphism core(I, Σ D ) ∼ = core(I, Σ B ) to unrestricted source instances I, we will first show, that whenever C I h or C I v contains a single cycle (that is, its dual graph is connected, each vertex having a single incoming and a single outgoing edge), then any simulation mapping yields an instance with the same core as chase(I, Σ B ). Proof (Sketch). W.l.o.g. assume that the single cycle is contained in C h . For such I, the guard γ h in the SO tgd τ D ∈ Σ D is indistinguishable from τ B in Σ B . Indeed, the conclusions of these implications coincide up to renaming of functional terms. The antecedent ϕ h of γ h has an extra atom C(x 0 , x 1 ), that prevents it from being satisfied by a C h -fact at the beginning of the chain. However, if C I h is a cycle, the relations q 1|2 defined as q 1 (x 1 , x 2 ) ↔ C h (x 1 , x 2 ) and q 2 (x 1 , x 2 ) ↔ ∃x 0 C h (x 0 , x 1 ) ∧ C h (x 1 , x 2 ), coincide, and so do τ B and γ h . Hence, J = chase(I, Σ D ) contains a subinstance J B isomorphic to chase(I, Σ B ). Using Property 2, J → J B can be shown.

Next, we address the source instances I containing multiple connected components in C I h|v . The connectedness is defined as follows: Given an instance K, the fact graph G K has the facts of K as vertices, connected by an undirected edge whenever corresponding facts have a common term. The facts f 1 , f 2 ∈ K are connected, if there is a path between the corresponding vertices in G K . The following property can be shown by induction on the length of the path between two D-facts: The above lemma implies, that if the source I contains l h and l v connected components in C I h and C I v respectively, then for any simulation mapping Σ, chase(I, Σ) can be decomposed into l h l v pairwise disconnected instances. Furthermore, if dual graphs of the connected components of C I h|v are acyclic chains or cycles, then by applying one of the Lemmas 1, 2 in each of the l h l v cases, the reduction established by these lemmas can be lifted to the the case when relations C I h|v contain multiple connected components. Finally, we observe that union of chains and cycles is the only type of dual graphs which is relevant. Indeed, if the dual graph of a source instance I contains a V-formed subgraph {C(x, y 1 ), C(x, y 2 )} or {C(y 1 , x), C(y 2 , x)} with y 1 = y 2 , then the key dependencies ensure that for any simulation mapping Σ (including Σ B ) the set Sol (I, Σ) is empty. This concludes our reduction from the domino problem to CQ-equivalence of SO tgds. We may thus formulate the main result of this section.

Theorem 3. The CQ-equivalence of mappings consisting of SO tgds and source key dependencies is undecidable.

For the simulation mappings Σ D and Σ B in the above problem reduction, it can be easily proved that Σ D ≡ CQ Σ B holds iff Σ D ≡ Σ B holds. We thus immediately get Theorem 4. The logical equivalence of mappings consisting of SO tgds without equalities in the antecedents and source key dependencies is undecidable.

Conclusion

Testing the equivalence of schema mappings is a fundamental task. For mappings defined by s-t tgds and target dependencies in the form of tgds and equality generating dependencies (egds) this problem has been well studied. In the general case, the (logical) equivalence of such mappings is undecidable. If the target tgds admit a finite chase, then the problem becomes decidable [4]. In [10], two relaxed notions of equivalence were introduced, namely the CQ-equivalence studied here and data exchange equivalence (DE-equivalence). Moreover, in [10], the undecidability of CQ-equivalence was proved for mappings consisting of s-t tgds and target tgds. The undecidability was extended in [17] in two directions, namely to mappings with target egds and from CQ-equivalence to DE-equivalence. The equivalence of SO tgds has been unexplored so far.

In this paper, we have proved the undecidability of logical equivalence. For the case of source schemas with key dependencies, we proved the undecidability of CQequivalence. The status of CQ-equivalence without any assumptions on the source schema has been left open. Closing this gap is on top of the agenda for future work. Moreover, the search for decidable fragments should be initiated. Finally, we also want to study the DE-equivalence [10] mentioned above. We conjecture that DE-equivalence of SO tgds is also undecidable. However, this has to be proved yet.

Fig. 1 .1Fig. 1. Mapping compositions.

Definition 2 .2Let D be a DS with types d 1 . . . d n . A simulation mapping Σ D for D over the source schema S = {C h (•, •), C v (•, •)} and the target schema T = {D(•, •, •, •)} consists of an SO tgd τ D and four source key dependencies:

Property 1 .1For a source instance I, a D-fact d enforced by the SO tgd in a simulation mapping (that is, d ∈ chase(I, Σ D )) has a form D(l h x p y, t v xy p , r h xy, b v xy), such that (x p , x) ∈ C I h , (y p , y) ∈ C I v . Definition 3. Let D be a DS and I a proper instance I of size (m, n). A fact d = D(l h x p y, t v xy p , r h xy, b v xy) ∈ chase(I, Σ D ) is associated with the domino type l, t, r, b , and with the pair (x, y) of I-values, occurring as arguments in its two last terms. Moreover, let i be the ordinal of x w.r.t. the order defined by C I h and j be the ordinal of y w.r.t. C I v . We call the pair (i, j) the coordinates of d defining its grid position. Let I be a proper source instance, and Σ D the simulation mapping of a DS D. Moreover, let d 1 and d 2 be two facts in chase(I, τ D ), d 1 associated with the pair of I-values (x 1 , y 1 ), and d 2 with (x 2 , y 2 ). Then, the following properties hold true:

Example 2 .2Consider a DS D containing the four types a, b, c, d in Figure 2. In the simulation mapping Σ D , the first type is modelled by the following implicational conjunct in the SO tgd τ

Problem reduction. Let D be a DS and Σ D its simulation mapping. Furthermore, consider a singleton colour set C b = {b} (where b stands for "black") and a corresponding domino set B = { b, b, b, b } with the simulation mapping Σ B . We claim that D has a solution iff Σ D ≡ CQ Σ B holds.

Lemma 2 .2Let I contain a single cycle in C h or in C v . Then, for arbitrary DS D, core(I, Σ D ) ∼ = core(I, Σ B ) holds.

Lemma 3 .3Let Σ D be a simulation mapping, I a source instance over S, and let d 1 , d 2 be facts in J = chase(I, Σ D ) containing respectively terms α(x 1 , y 1 ) and β(x 2 , y 2 ). Assume that x 1 is a term in a fact c h ∈ C I h , and x 2 in c h ∈ C I h . Likewise, let y 1 and y 2 occur, respectively, in the facts c v , c v ∈ C I v . Then, d 1 and d 2 are connected in J only if the facts c h and c h are connected in C I h , and so are c v and c v in C I v . We thank the anonymous referee of AMW who pointed out this proof.

This work was supported by the Vienna Science and Technology Fund (WWTF), project ICT08-032. Savenkov was supported by the Erasmus Mundus ECW Program of the EU.

The classical decision problem CAllauzen BDurand chapter "Appendix A: Tiling Problems EBörger EGrädel YuGurevich Springer 2001 Composition and inversion of schema mappings MArenas JPérez JLReutter CRiveros SIGMOD Record 38 3 2009 The recovery of a schema mapping: Bringing exchanged data back MArenas JPérez CRiveros ACM Trans. Database Syst 34 4 2009 A proof procedure for data dependencies CBeeri MYVardi J. ACM 31 4 1984 The undecidability of the domino problem RBerger Amer Mathematical Society 1966 Applying model management to classical meta data problems PABernstein Proc. CIDR'03 CIDR'03 2003 Implementing mapping composition PABernstein TJGreen SMelnik ANash VLDB J 17 2 2008 Inverting schema mappings RFagin Proc. PODS'06 PODS'06 ACM 2006 Data exchange: semantics and query answering RFagin PGKolaitis RJMiller LPopa Theor. Comput. Sci 336 1 2005 Towards a theory of schema-mapping optimization RFagin PGKolaitis ANash LPopa Proc. PODS'08 PODS'08 ACM 2008 Data exchange: getting to the core RFagin PGKolaitis LPopa ACM Trans. Database Syst 30 1 2005 Composing schema mappings: Second-order dependencies to the rescue RFagin PGKolaitis LPopa WCTan ACM Trans. Database Syst 30 4 2005 Data integration: The teenage years AYHalevy ARajaraman JJOrdille Proc. VLDB'06 VLDB'06 ACM 2006 Composing mappings among data sources JMadhavan AYHalevy Proc. VLDB'03 VLDB'03 2003 Generic Model Management: Concepts and Algorithms SMelnik Lecture Notes in Computer Science 2967. 2004 Springer Composition of mappings given by embedded dependencies ANash PABernstein SMelnik ACM Trans. Database Syst 32 1 4 2007 Relaxed notions of schema mapping equivalence revisited RPichler ESallinger VSavenkov Proc. ICDT'11 ICDT'11 ACM 2011