=Paper=
{{Paper
|id=None
|storemode=property
|title=On the Undecidability of the Equivalence of Second-Order Tuple Generating Dependencies
|pdfUrl=https://ceur-ws.org/Vol-749/paper5.pdf
|volume=Vol-749
|dblpUrl=https://dblp.org/rec/conf/amw/FeinererPSS11
}}
==On the Undecidability of the Equivalence of Second-Order Tuple Generating Dependencies==
On the Undecidability of the Equivalence of
Second-Order Tuple Generating Dependencies ?
Ingo Feinerer, Reinhard Pichler, Emanuel Sallinger, and Vadim Savenkov
Vienna University of Technology
{feinerer|pichler|sallinger|savenkov}@dbai.tuwien.ac.at
Abstract. Second-Order tuple generating dependencies (SO tgds) were intro-
duced by Fagin et al. to capture the composition of simple schema mappings.
Testing the equivalence of SO tgds would be important for applications like
model management and mapping optimization. However, we prove the undecid-
ability of the logical equivalence of SO tgds. Moreover, under weak additional
assumptions, we also show the undecidability of a relaxed notion of equivalence
between two SO tgds, namely the so-called conjunctive query equivalence.
1 Introduction
Schema mappings play an important role in several areas of database research, notably
in data integration [13], data exchange [9], peer data management [14], and model man-
agement [6]. A schema mapping is given by two schemas, called the source schema and
the target schema, as well as a set of dependencies describing the relationship between
the source and target schema. The most fundamental form of schema mappings are
mappings defined by a set of source-to-target tuple generating dependencies (s-t tgds):
They are First-Order formulae of the form ∀x(ϕ(x) → ∃yψ(x, y)), where the an-
tecedent ϕ(x) is a conjunctive query (CQ) over the source schema and the conclusion
ψ(x, y) is a CQ over the target schema. Intuitively, such an s-t tgd defines a constraint
that the presence of certain tuples in the source database I (namely those in the image
of some homomorphism h from ϕ(x) to I) enforce the presence of certain tuples in the
target database J (s.t. h can be extended to a homomorphism from ψ(x, y) to J).
Several algebraic operators [6, 15] on schema mappings have been intensively stud-
ied in recent time like computing inverses [8, 3, 2] and composing schema mappings [7,
12, 14, 16]. Our work is rather related to the composition operator. Fagin et al. proved
that s-t tgds are not powerful enough to express the composition of two mappings de-
fined by s-t tgds [12]. To remedy this defect, so-called Second-Order tuple generating
dependencies (SO tgds) were introduced in [12]. SO tgds extend s-t tgds by existentially
quantified function-variables and equalities of (possibly functional) terms. Details and
formal definitions are given in Section 2. It was shown in [12] that SO tgds capture
exactly the closure under composition of mappings defined by s-t tgds.
Example 1 ([12]). Consider the following three schemas. Let S1 consist of the unary
relation symbol Emp(·) of employees. Schema S2 consists of a single binary relation
?
This work was supported by the Vienna Science and Technology Fund (WWTF), project
ICT08-032. Savenkov was supported by the Erasmus Mundus ECW Program of the EU.
symbol Mgr0 (·, ·) that associates each employee with a manager. Schema S3 consists
of a similar binary relation symbol Mgr(·, ·) that is intended to provide a copy of Mgr0
and an additional unary relation symbol SelfMgr(·) to store employees who are their
own manager. Consider the mappings M12 = (S1 , S2 , Σ12 ) and M23 = (S2 , S3 , Σ23 )
with Σ12 = {∀e(Emp(e) → ∃m Mgr0 (e, m))} and Σ23 = {∀e, m(Mgr0 (e, m) →
Mgr(e, m)), ∀e(Mgr0 (e, e) → SelfMgr(e))}. We are looking for the composition of
M12 and M23 . It can be verified that this composition can be expressed by the SO tgd
σ = ∃f (∀e (Emp(e) → Mgr(e, f (e))) ∧ ∀e (Emp(e) ∧ (e = f (e)) → SelfMgr(e))).
In this paper, we want to study the equivalence of SO tgds. Note that the question of
equivalence naturally arises in several scenarios. Figure 1 illustrates a model evolution
scenario, where data structured under some schema S is first migrated to a database
with schema T and then further transformed to meet schema U . Now suppose that there
exists an alternative migration path from
σ
schema S via T 0 to schema U . The question
if the two migration paths yield the same re-
T sult comes down to checking if the dependen-
Σ12 Σ23 cies σ and σ 0 (which represent the respective
mapping compositions) are equivalent. Actu-
S U
ally, Figure 1 can also be thought of as illus-
Σ‘12 Σ‘23
trating a peer data management scenario, where
T‘ some peer with data structured according to S
provides part of its data to some other peer
σ‘ with schema T (resp. T 0 ). The latter peer in
turn passes this data on to yet another peer with
Fig. 1. Mapping compositions. schema U . Now suppose that a user may access
the data only at the peer with schema U . What
happens if some link in this peer data network is broken, say the one corresponding to
mapping Σ23 ? Will the path of mappings from S via T 0 to U still give the user full
access to the data provided by the peer with schema S? Testing the equivalence of σ
and σ 0 is thus crucial for answering questions of redundancy and reliability in a peer
data network.
The equivalence of mappings is also fundamental to mapping optimization. As men-
tioned in [10], optimizing a mapping ultimately means replacing the mapping by an
“equivalent” one with better (computational) properties. This raises the question of
how the “equivalence” of two mappings should be defined. Since dependencies are
logical formulae, the most natural notion of equivalence is logical equivalence. In this
paper, we show that logical equivalence of SO tgds is undecidable. In order to allow
for more flexibility in optimizing mappings, Fagin et al. introduced relaxed notions of
equivalence [10]. In particular, the potential of conjunctive query (CQ) equivalence for
optimizing several kinds of mappings was studied in [10]. Intuitively, two mappings
are CQ-equivalent, if conjunctive queries posed against the target database yield the
same result for both mappings (for details, see Section 2). For instance, it was shown
in [12] that the composition of the mappings M12 and M23 in Example 1 cannot be
represented by an SO tgd without equalities in the antecedent. On the other hand, σ
in Example 1 is CQ-equivalent to σ 0 = ∃f (∀e (Emp(e) → Mgr(e, f (e))). Under the
2
weak additional assumption that the source schema may have key dependencies, we
shall prove the undecidability of CQ-equivalence of SO tgds.
Organization of the paper and summary of results. In Section 2, we recall some basic
notions and results. A conclusion and an outlook to future work are given in Section 5.
The main results of the paper are detailed in Sections 3 and 4, namely:
• Logical equivalence. In Section 3 we prove that the logical equivalence of SO tgds
is undecidable. This result is easily obtained via a previous undecidability result in the
context of inverse schema mappings [3].
• Logical equivalence vs. CQ-equivalence. In Section 3, we also present a different
proof strategy for the undecidability of the logical equivalence of two SO tgds in order
to identify an important difference between logical equivalence and CQ-equivalence.
More precisely, we show that logical equivalence of two SO tgds remains undecidable
even if the two SO tgds are already known to be CQ-equivalent.
• CQ-equivalence. In Section 4 we prove the undecidability of CQ-equivalence of SO
tgds if the source schema may have key dependencies. The proof is by reduction from
the domino problem. As a by-product of this proof, we also get the undecidability of
logical equivalence of SO tgds without equalities.
Due to lack of space, proofs are sketched. Details will be provided in the full version.
2 Preliminaries
Schemas and instances. A schema R = {R1 , . . . , Rn } is a set of relation symbols
Ri each of a fixed arity. An instance I over a schema R consists of a relation for each
relation symbol in R, s.t. both have the same arity. For a relation symbol R, we write
RI to denote the relation of R in I. We only consider finite instances here.
Schema mappings. Let S = {S1 , . . . , Sn } and T = {T1 , . . . , Tm } be schemas with no
relation symbols in common. A schema mapping is given by a triple M = (S, T, Σ)
where S is the source schema, T is the target schema, and Σ is a set of logical formulae
called dependencies expressing the relationship between S and T.
Instances over S (resp. T) are called source (resp. target) instances. We write hS, Ti
to denote the schema {S1 , . . . , Sn , T1 , . . . , Tm }. If I is a source instance and J a target
instance, then hI, Ji is an instance of the schema hS, Ti.
Given a (ground) source instance I, a target instance J is called a solution for I un-
der M if hI, Ji |= Σ. The set of all solutions for I under M is denoted by Sol (I, M).
Dependencies. Source-to-target tuple generating dependencies (as already defined in
the introduction) are logical formulae of the form ∀x(ϕ(x) → ∃yψ(x, y)). We write
x for a tuple (x1 , . . . , xn ). However, by slight abuse of notation, we also refer to the set
{x1 , . . . , xn } as x. Hence, we may use expressions like xi ∈ x or x ⊆ X, etc.
A key dependency over schema R is of the form ∀x(R(u, y, v) ∧ R(u, z, w) →
y = z) where R denotes a relation symbol in R with u, v, w ⊆ x and y, z ∈ x.
A second-order tuple generating dependency (SO tgd) is a logical formula of the
form ∃f ((∀x1 (ϕ1 → ψ1 )) ∧ · · · ∧ (∀xn (ϕn → ψn ))) where (1) each member of
f is a function symbol, (2) each ϕi is a conjunction of atomic formulas of the form
3
S(y1 , . . . , yk ) (with S ∈ S and yj ∈ xi ), and equalities of the form t = t0 (with t and t0
terms based on xi and f ), (3) each ψi is a conjunction of atomic formulas of the form
T (t1 , . . . , t` ) (with T ∈ T where t1 , . . . , t` are terms based on xi and f ), and (4) each
variable in xi appears in some atomic formula of ϕi .
When dealing with instances hI, Ji in the context of SO tgds the domain of source
instances I is allowed to consist only of constants (i.e., grounded) whereas target in-
stances J may contain functional terms which can be treated like fresh variables, also
called labelled nulls. [12]
Homomorphisms. Let I, I 0 be instances. A homomorphism h: I → I 0 is a mapping s.t.
(1) whenever R(x) ∈ I, then R(h(x)) ∈ I 0 , and (2) for every constant c, h(c) = c.
If such h exists, we write I → I 0 . Moreover, if I ↔ I 0 then we say that I and I 0 are
homomorphically equivalent.
If a homomorphism h: I → I 0 is invertible, s.t. h−1 is also a homomorphism from
I to I, then h is called an isomorphism, denoted I ∼
0
= I 0 . An endomorphism is a homo-
morphism I → I. An endomorphism is proper if it is not surjective (for finite instances,
this is equivalent to being not injective), i.e., if it reduces the domain of I.
If I is an instance, and I 0 ⊆ I is such that I → I 0 holds but for no other I 00 ⊂
I : I → I 00 (that is, I 0 cannot be further “shrunk” by a proper endomorphism), then I 0
0
is called a core of I. The core is unique up to isomorphism. Hence, we may speak about
the core of I. Cores have the following important property: for arbitrary instances J
and J 0 , J ↔ J 0 iff core(J) ∼ = core(J 0 ).
If J ∈ Sol (I, M) is such that J → J 0 holds for any other solution J 0 ∈ Sol (I, M),
then J is called a universal solution. Since the universal solutions for a source instance
I are homomorphically equivalent, their core is unique up to isomorphism [11]. We
write UnivSol (I, M) to denote the set of universal solutions for I under mapping M
and we write core(I, M) to denote the core of the universal solutions.
A substitution λ is a mapping which sends variables to other domain elements (i.e.,
variables or constants). We write λ = {x1 ← a1 , . . . , xn ← an } if λ maps each xi to ai
and λ is the identity outside {x1 , . . . , xn }. The application of a substitution is usually
denoted in postfix notation, e.g.: xλ denotes the image of x under λ. For an expression
ϕ(x) (e.g., a conjunctive query with variables in x ), we write ϕ(xλ) to denote the
result of replacing every occurrence of every variable x ∈ x by xλ.
Chase. A target instance for a given source instance and a schema mapping can be
computed by the chase [4]. For SO tgds, the original chase procedure of [4] has to be
modified since homomorphisms now have to take the equalities in the antecedents into
account. Formally, a mapping h from a conjunct Ci = ∀x(ϕi → ψi ) of an SO tgd to
an instance I is a homomorphism if for every relational atom S(y1 , . . . , yk ) in ϕi the
tuple (h(y1 ), . . . , h(yk )) is in S I and for every equality t = t0 we have h(t) = h(t0 ).
Let hI, Ji be an instance of a schema hS, Ti and let Ci = ∀x(ϕi → ψi ) be an SO
conjunct. If there is a homomorphism h of ϕi into the instance I, then the conjunct Ci
can be applied to hI, J1 i with homomorphism h. I.e., for every target atom T (t1 , . . . , t` )
of ψi the tuple (h(t1 ), . . . , h(t` )) is added to the T relation and the union with J1
is taken yielding a new instance J2 . The instance hI, J2 i is now called the result of
applying Ci to hI, J1 i with h. The actual application is called a chase step.
4
A chase sequence is a finite sequence of chase steps. The chase of hI, ∅i with an SO
tgd Σ is a chase sequence until no conjuncts can be applied anymore. The result of this
chase is denoted by chase(I, Σ). Chasing hI, ∅i with SO tgds can be done in polyno-
mial time w.r.t. the size of the source instance and results in a universal solution [12].
Equivalence of schema mappings. Let M = (S, T, Σ) and M0 = (S, T, Σ 0 ) be two
schema mappings.
M and M0 are logically equivalent (denoted as M ≡ M0 ) if, for every source
instance I and target instance J, the equivalence hI, Ji |= Σ ⇔ hI, Ji |= Σ 0 holds.
This is the case iff Sol (I, M) = Sol (I, M0 ) holds for every source instance I.
M and M0 are conjunctive-query equivalent (denoted as M ≡CQ M0 ) if, for every
source instance I, either Sol (I, M) = ∅ = Sol (I, M0 ) or core(I, M) = core(I, M0 ).
Note that the original definition of M ≡CQ M0 is that, for every source instance I,
any conjunctive query posed against the target schema yields the same certain answers
for the mappings M and M0 . The above characterization via the core, which is more
convenient for our purposes here, was proved to be equivalent in [10].
3 Undecidability of Logical Equivalence
The undecidability of logical equivalence of two SO tgds follows easily from a previous
undecidability result in the context of inverse schema mappings [3].
Theorem 1. Let τ and τ 0 be two SO tgds. It is undecidable whether τ ≡ τ 0 , i.e. that τ
is logically equivalent to τ 0 .
Proof. (Sketch).1 In [8], a schema mapping M2 is defined to be an inverse of another
mapping M1 if their composition M1 ◦ M2 is logically equivalent
to the identity map-
ping, i.e., a set of s-t tgds of the form (∀x) P (x) → P̂ (x) for every source relation P .
In [3, Corollary 9.5] the authors show that the following problem is undecidable: Given
mappings M1 and M2 specified by s-t tgds, check whether M2 is an inverse of M1 .
In other words, checking if the composition M1 ◦ M2 is equivalent to the identity
mapping is undecidable. Clearly, a set of s-t tgds can be represented as an SO tgd. Let τ
denote the SO tgd corresponding to the set of s-t tgds of the identity mapping. Likewise,
M1 ◦ M2 can be represented as an SO tgd, say τ 0 . Then τ ≡ τ 0 holds iff M2 is an
inverse of M1 . Hence, the logical equivalence of two SO tgds is undecidable.
We now want to strengthen the above result by relating logical equivalence to CQ-
equivalence. Below, we show that logical equivalence of two SO tgds remains undecid-
able even if the two SO tgds are already known to be CQ-equivalent. To this end, we
adopt a different proof strategy which goes by reduction from the following variant of
the Halting problem: Given a Turing machine TM, does TM halt when run on empty
input? Given an arbitrary instance TM of the Halting problem, we have to construct
two SO tgds τ and τ 0 , s.t. the following equivalence holds: TM halts on empty input ⇔
τ 6≡ τ 0 . A major challenge of this reduction is that SO tgds do not directly provide us
with recursion or with a means to enforce equalities.
1
We thank the anonymous referee of AMW 2011 who pointed out this proof.
5
Schemas. Let TM be a Turing machine (Q, A, δ, q0 ) with set of states Q = {0, . . . , m},
tape alphabet A = {0, . . . , n}, transition function δ : Q × A → Q × A × {←, →}, and
initial state q0 . W.l.o.g., assume that q0 = 0 and let 1 denote the halting state. Moreover,
let 1 denote the tape starting symbol . and let 0 denote the blank symbol t.
We represent the state, tape and cursor of a Turing machine configuration at time
t and tape location l by the function symbols stateq (t), tapea (t, l), and cursor(t, l)
(where q ∈ Q, a ∈ A). Each of these functions is intended to have Boolean range, e.g.
tapea (t, l) = 1 indicates that at time t, the symbol a is stored on the tape at location l,
and tapea (t, l) = 0 otherwise. We will ensure this Boolean range using our SO tgds.
As source schema, we use root(·) and chain(·, ·), which is intended to describe a
linear order starting at root(r) and continued by chain(·, ·) atoms, i.e. source instance I
should have the form I = {root(0), chain(0, 1), chain(1, 2), . . .}. Still, we must make
sure through our SO tgds that what is described by root and chain does not deviate from
a linear order in a way that adversely affects our simulation.
We define our target schema to consist of the relation symbol range(·), which
is intended to describe the range of the functions we defined above. Furthermore, we
use the target relation symbol halt to indicate a halting computation as well as to
signify an error condition. That is, an intended target instance should have the form
J = {range(0), range(1)} and possibly include halt().
Dependencies. Given an arbitrary Turing machine TM, we define the SO tgds τ and τ 0
as a “big” conjunction consisting of several groups of conjuncts (i.e., implications). As
we will see, it is possible to construct an SO tgd that simulates a computation of TM.
We start by describing initialization conjuncts which are used to encode the initial
configuration of the TM on the empty input tape. Since we cannot enforce equalities on
the right-hand side of implications, all the implications are formulated in such a way
that taking the wrong value will lead to an error condition indicated by halt(). E.g., we
want to enforce the initial state state0 (r) = 1. Therefore, in the case that state0 (r) = 0
holds, we enforce halt() in the target instance to indicate this error condition:
• root(r) ∧ state0 (r) = 0 → halt()
Note that this does not yet cover the case that state0 (r) 6∈ {0, 1}. We construct other
implications which define the rest of the initial configuration analogously.
We now define transition conjuncts, which implement the transition function of
TM. In what follows, we write dom(z) as a short-hand for a predicate that is satisfied
by instantiating z to an arbitrary constant occurring in the source instance, i.e., dom(z)
has to be replaced by root(z), chain(z, ), or chain( , z). Such a predicate is needed to
fulfil the safety condition of SO tgds. We first define a common part of the following
implications, matching the current state q and the current tape symbol a.
ϕ := stateq (t) = 1 ∧ cursor(t, l) = 1 ∧ tapea (t, p) = 1 ∧ chain(t, t0 ) ∧ dom(l)
Suppose that the transition function δ contains the transition δ(q, a) = (q 0 , a0 , ←). Then
the SO tgds τ and τ 0 contain the following conjuncts:
• ϕ ∧ stateq0 (t0 ) = 0 → halt()
• ϕ ∧ tapea0 (t0 , l) = 0 → halt()
• ϕ ∧ chain(l0 , l) ∧ cursor(t0 , l0 ) = 0 → halt()
6
While the transition conjuncts define what changes, we now construct inertia con-
juncts which describe the function values of what remains the same. In particular, we
can construct (but for space reasons omit here) conjuncts, which define that the cursor is
not located at positions that are (a) more than one move away from the previous cursor
location, (b) at the same location, or (c) at the location opposite to the cursor movement.
Furthermore, we need to state that the tape symbol remains the same wherever the
cursor is not located. That is, for each q ∈ Q, a ∈ A:
• stateq (t) = 1 ∧ cursor(t, l) = 0 ∧ tapea (t, l) = 1 ∧ dom(l) ∧
chain(t, t0 ) ∧ tapea (t0 , l) = 0 → halt()
Finally, we also have to construct uniqueness conjuncts, ensuring that at each time t,
there is exactly one stateq (t) and at each time t and location l exactly one tapea (t, l)
whose function value is 1.
Guarding conjuncts. It is clear that, since the domain is in general non-Boolean, we
cannot prevent functions stateq , tapea and cursor to take a range outside of {0, 1} for
arbitrary interpretations. However, the structure of our reduction will show that we have
to either, for a given interpretation of the function symbols, define the interpretation of
the relation symbols, or for a given interpretation of the relation symbols, define the
interpretation of the function symbols.
So if we define the functions, we can guarantee Boolean domain ourselves. If we
define the relations, and in particular the target relations, we can use function guarding
conjuncts to materialize the range in the target:
• dom(x) ∧ dom(y) ∧ state (x) = y → range(y)
Analogous conjuncts are introduced for tape and cursor. This ensures the Boolean range
of our functions, given appropriately constructed target relations (rangeJ = {0, 1}).
Now that we have sufficient control over the functions, we still need to gain some
control over the source instance, since our simulation is intended to work on a linear
order described by root and chain. We therefore define source guarding conjuncts.
What we can directly enforce is that the chain relation includes no self-loops and never
returns to any root using the following conjuncts:
• chain(x, x) → halt()
• root(r) ∧ chain( , r) → halt()
However, what we cannot directly exclude is that some element x of the chain has
two distinct predecessors px 6= px0 with chain(px, x) and chain(px0 , x). In particular,
we do not have equality on the right-hand side, which would allow us to require px =
px0 . Nor can we use a derivation of halt() when there are two predecessors chain(px, x)
and chain(px0 , x) because we cannot guarantee that px 6= px0 on the left-hand side.
But what we can do is control the effect of such multiple predecessors (or succes-
sors). Notably, if these multiple predecessors or successors actually have no effect on
our simulation, then we have no problem with these discrepancies from linear order,
since the simulation is doing the correct thing anyway. Altogether, we cannot detect all
deviations from a linear order, e.g., two distinct root(r) atoms cannot be ruled out. But
we can gain just enough control to ensure the correctness of our reduction.
Halting detection. We now construct the two SO tgds for our reduction, namely such
that TM halts ⇔ τ 6≡ τ 0 . The first one, τ , simply consists of all the conjuncts we
7
constructed up to this point, that is simulation conjuncts and guarding conjuncts. The
function symbols tape , state and cursor are in the scope of an existential quantifier
over the entire conjunction. We now define the halting detection conjunct:
• dom(t) ∧ state1 (t) = 1 → halt()
and construct τ 0 to be the same as τ plus this halting detection conjunct. So overall,
both τ and τ 0 simulate the computation of Turing machine TM, but only τ 0 indicates a
halting state by enforcing halt().
Theorem 2. Let τ and τ 0 be two SO tgds. It is undecidable whether τ ≡ τ 0 (i.e. τ is
logically equivalent to τ 0 ), even when τ ≡CQ τ 0 (i.e. τ is CQ-equivalent to τ 0 ).
Proof. (Sketch). We constructed τ 0 as τ with an additional conjunct, therefore it is suf-
ficient to show that TM halts ⇔ τ 6 τ 0 . We show both directions separately. The chal-
lenging part of the reduction is that, in one direction we have control over the functions,
but not over the relations, and the converse for the other direction.
Assume that TM halts. Then we have to find an instance K = (I, J) such that K τ
but K 6 τ 0 . Since TM halts, there is a computation of length n that ends in the halting
state. We construct the instance K = (I, J) as
I = {root(1), chain(1, 2), . . . , chain(n − 1, n)} J = {range(0), range(1)}
Then K τ is clearly the case. What needs to be shown is that K 6 τ 0 . In particular,
K 6 τ 0 means that for arbitrary f , we have (K, f ) 6 ϕ0 . That is, we constructed K,
but must deal with arbitrary f for ϕ0 . We proceed by case distinction, assuming vt,l,a is
the correct value of tapea (t, l) according to the transition function of TM:
1. tapea (t, l) = vt,l,a for all t, l, a, that is, it correctly represents the TM computation
2. tapea (t, l) = ¬vt,l,a for some t, l, a, that is, the simulation violates TM
3. For some tapea (t, p) = y we have that y 6∈ {0, 1}, i.e., non-Boolean range
and in each case show K, f 6 τ 0 . This also holds for state and cursor.
Assume that TM does not halt. Given an arbitrary instance K = (I, J), we have to
show that K τ implies K τ 0 . That is, we deal with arbitrary K, but only have
to find a single f for ϕ0 . Assume that TM does not halt. Then let K = (I, J) be an
arbitrary instance. We proceed by case distinction based on K:
1. J already contains an atom halt()
2. Source instance I deviates from a linear order significantly for our simulation
3. J contains some range(c) atom with c 6∈ {0, 1}, i.e., allows non-Boolean range
4. All range(c) atoms in J have c ∈ {0, 1}
and in each case show K, f τ 0 .
Finally, note that the SO tgds τ and τ 0 in the above problem reduction are CQ-equivalent.
The claim of the theorem thus follows.
Equalities are fundamental in SO tgds in order to define the composition of two
schema mappings [12, Theorem 5.4]. Equalities also play a central role in our unde-
cidability proof of logical equivalence for SO tgds in Theorem 2 above. However, if
we consider CQ-equivalence rather than logical equivalence, then the equalities are not
strictly necessary anymore and can be removed: In [2] it is proved that every SO tgd is
CQ-equivalent to a plain SO tgd (a class of SO tgds which do not contain equalities).
As a consequence the previous proof idea cannot be reused and we need a set of new
proof techniques to show undecidability also for CQ-equivalence.
8
4 Undecidability of Conjunctive Query Equivalence
We now show the undecidability of CQ-equivalence of SO tgds by a reduction from the
Domino Problem (DP), which was first shown undecidable by Berger [5]. We use the
following, slightly modified version of DP [1, p.414].
Definition 1. A domino system (DS) is given by a set D of domino types defined as
follows. Let C be a set of colours. Then, each domino type is a quadruple hl, t, r, bi of
values from C, corresponding respectively to the left, top, right and bottom colour of
a square domino piece. Let d.side denote an accessor function D → C returning the
colour of d, corresponding to side. Then, the Domino Problem (DP) given by D asks
if for each m, n ∈ N a consistent tiling of an m × n grid with the domino types from
D exists. That is, if there is a function tm,n : {1 . . . m} × {1 . . . n} → D, such that the
equality tm,n (i, j).right = tm,n (i + 1, j).left holds for 1 ≤ i < m and 1 ≤ j ≤ n
and tm,n (k, l).bottom = tm,n (k, l + 1).top for 1 ≤ k ≤ m and 1 ≤ l < n.
Chains, proper instances, and coordinates. We model a grid using the source schema
S consisting of two binary relations Ch and Cv intended to define a horizontal and
vertical successor relation: For an instance I of S, ChI and CvI denote the relations of
I. Their intended form is {(a0 , a1 ), (a1 , a2 ) . . . (an−1 , an )}, with ai 6= aj , for every
0 ≤ i < j ≤ n. We define a < b w.r.t. an order C, if (a, b) ∈ C or ∃x, C(x, b) ∧ a < x.
Taking the tuples of C as edges of a directed graph (which we call the dual graph),
the order defined above corresponds to an acyclic chain. If each relation of an instance
I over S contains a single acyclic chain, I is called proper. The size (m, n) of such a
proper instance I is defined as (|Ch |, |Cv |).
Domino tilings are modelled by the target relation D/4. Each of its four arguments
is populated by a functional term cs (x, y) where c corresponds to the colour, s to the
side of the domino piece, and the pair (x, y) defines the position in the grid. To save
notation, we will omit parentheses when writing the terms in the relation D (e.g. the
above term will be written as cs xy). Given a fact d ∈ D, we define the accessor function
d.n, returning the term in the nth place of d, 1 ≤ n ≤ 4.
Definition 2. Let D be a DS with types d1 . . . dn . A simulation mapping ΣD for D over
the source schema S = {Ch (·, ·), Cv (·, ·)} and the target schema T = {D(·, ·, ·, ·)}
consists of an SO tgd τD and four source key dependencies:
1 : Ch (x, x1 ) ∧ Ch (x, x2 ) → x1 = x2 2 : Ch (x1 , x) ∧ Ch (x2 , x) → x1 = x2
3 : Cv (x, x1 ) ∧ Cv (x, x2 ) → x1 = x2 4 : Cv (x1 , x) ∧ Cv (x2 , x) → x1 = x2
The SO tgd τD has the form ∃c∃ph ∃pv ∃q h ∃q v (τd1 ∧ . . . ∧ τdn ∧ γ1 ∧ γ2 ), where
τd1 . . . τdn encode the corresponding domino types:
τdi : Ch (x1 , x2 ) ∧ Cv (y1 , y2 ) → D(lh x1 y2 , tv x2 y1 , rh x2 y2 , bv x2 y2 ),
l, t, r, b denoting respectively the left, top, right and bottom colours of di . Superscripts
distinguish the horizontal and vertical dimensions, so that there are at most two distinct
functions in c for each colour in D. Moreover, c contains neither ph|v nor q h|v .
The conjuncts γ1 and γ2 are called guards and have a form similar to τdi :
γh : Ch (x0 , x1 ) ∧ Ch (x1 , x2 ) ∧ Cv (y1 , y2 ) → D(ph x1 y2 , pv x2 y1 , ph x2 y2 , pv x2 y2 )
γv : Ch (x1 , x2 ) ∧ Cv (y0 , y1 ) ∧ Cv (y1 , y2 ) → D(q h x1 y2 , q v x2 y1 , q h x2 y2 , q v x2 y2 )
9
D I = {Ch (0, 1), Ch (1, 2), Ch (2, 3), Cv (0, 1), Cv (1, 2)}
wv 1 0 gv 2 0 wv 3 0
a
wh 0 1 a gh 1 1 gh 1 1 c gh 2 1 gh 2 1 d wh 3 1
b
gv 1 1 wv 2 1 wv 3 1
v
c g 11 wv 2 1 wv 3 1
wh 0 2 b wh 1 2 wh 1 2 a gh 2 2 gh 2 2 d wh 3 2
d
wv 1 2 gv 2 2 wv 3 2
Fig. 2. Dotted rectangles are facts in core(I, ΣD ), corresponding to a consistent tiling
Informally, given a proper source instance of size (m, n), the conjuncts τd1 . . . τd|D|
create a stack of |D| domino tiles at each position in the m × n grid. Additionally, the
guards γh and γv create two more tiles with unique colours p and q on each position
(i, j) with i, j > 1, or a single tile if i = 1 or j = 1. The guards tackle source instances
with cycles instead of the linear order, as will be explained later.
For a source instance I, a D-fact d enforced by the SO tgd in a simulation map-
ping (that is, d ∈ chase(I, ΣD )) has a form D(lh xp y, tv xyp , rh xy, bv xy), such that
(xp , x) ∈ ChI , (yp , y) ∈ CvI .
Definition 3. Let D be a DS and I a proper instance I of size (m, n). A fact d =
D(lh xp y, tv xyp , rh xy, bv xy) ∈ chase(I, ΣD ) is associated with the domino type
hl, t, r, bi, and with the pair (x, y) of I-values, occurring as arguments in its two last
terms. Moreover, let i be the ordinal of x w.r.t. the order defined by ChI and j be the or-
dinal of y w.r.t. CvI . We call the pair (i, j) the coordinates of d defining its grid position.
Property 1. Let I be a proper source instance, and ΣD the simulation mapping of a DS
D. Moreover, let d1 and d2 be two facts in chase(I, τD ), d1 associated with the pair of
I-values (x1 , y1 ), and d2 with (x2 , y2 ). Then, the following properties hold true:
1. d1 .3 = d2 .1 ⇒ (x1 , x2 ) ∈ ChI ∧ y1 = y2 ∧ ∀i 6= 3 ∀j 6= 1 d1 .i 6= d2 .j.
2. d1 .2 = d2 .4 ⇒ (y1 , y2 ) ∈ CvI ∧ x1 = x2 ∧ ∀i 6= 4 ∀j 6= 2 d1 .i 6= d2 .j.
3. ∃i ∈ {1...4} d1 .i = d2 .i ⇒ x1 = x2 ∧ y1 = y2 .
4. d1 .1 6= d2 .2 ∧ d1 .3 6= d2 .4 ∧ d1 .1 6= d2 .4.
Claim 1 stipulates, that whenever a term is shared between the 3rd place of d1 and
the 1st place of d2 , the facts correspond to horizontally adjacent grid positions (i.e., have
first coordinates i and i + 1 and the same second coordinate) and can have no further
terms in common. Claim 2 is an analogue of Claim 1 for the equality d1 .4 = d2 .2 and
vertical adjacency: d1 and d2 must have coordinates (i, j) and (i, j + 1). Claim 3 states
that terms at respective positions can be only shared by facts with the same coordinates.
Finally, Claim 4 restricts term equalities to the above cases.
Example 2. Consider a DS D containing the four types a, b, c, d in Figure 2. In the sim-
ulation mapping ΣD , the first type is modelled by the following implicational conjunct
in the SO tgd τa : Ch (x1 , x2 )∧Cv (y1 , y2 ) → D(wh x1 y2 , wv x2 y1 , g h x2 y2 , g v x2 y2 ).
Here, w and g stand for white and gray colours, respectively.
10
The 3 × 2 grid is represented by a proper source instance I (see top of Figure 2). Six
dotted rectangles represent a consistent tiling of this grid with D. Each rectangle bears
the four terms of the corresponding D-fact in chase(I, ΣD ): E.g., the topmost leftmost
tile (the one at position (1,1), that is) is encoded by the fact D(wh 01, wv 10, g h 11, g v 11).
Problem reduction. Let D be a DS and ΣD its simulation mapping. Furthermore, con-
sider a singleton colour set Cb = {b} (where b stands for “black”) and a corresponding
domino set B = {hb, b, b, bi} with the simulation mapping ΣB . We claim that D has a
solution iff ΣD ≡CQ ΣB holds.
The SO tgd in ΣB has the following form: τB : ∃bh ∃bv∃ph ∃pv ∃q h∃q v
Ch (x1 , x2 ) ∧ Cv (y1 , y2 ) → D(bh x1 y2 , bv x2 y1 , bh x2 y2 , bv x2 y2 ) ∧ γh ∧ γv .
It is easy to see that the conjuncts γh and γv are redundant in τB . Hence, in the follow-
ing we assume τB to contain a single implicational conjunct and no guards. Moreover,
the following relationship holds between the chase result of τB and τD :
Property 2. Let D be a DS with simulation mapping ΣD , let (i1 , j1 ) and (i2 , j2 ) be
two distinct grid positions, and let d1 , d2 ∈ chase(I, ΣD ) be facts at positions (i1 , j1 )
resp. (i2 , j2 ). Then also chase(I, ΣB ) contains two facts d01 , d02 at these grid positions.
Moreover, for every 1 ≤ k, l ≤ 4, if d1 .k = d2 .l then d01 .k = d02 .l, i.e., if two terms in
d1 and d2 are equal, then the terms at these places in d01 and d02 are also equal.
This property uses an observation, that every implicational conjunct τdi encoding a
domino type di in the SO tgd of a simulation mapping produces exactly one D-fact for
each grid position, and that τB ∈ ΣB uses the minimal set of functional symbols.
For DPs that have a solution (of which B is obviously an example) the following
lemma, which is crucial for our construction, holds:
Lemma 1. Let I be a proper instance of size (m, n) and let D be a DS. Then, there is
a consistent tiling of an m × n grid with D iff core(I, τD ) ∼
= core(I, ΣB ) holds.
Example 3. It is easy to check, that the six D-facts in Figure 2 indeed form core(I, ΣD ).
For I, ΣB yields a solution with the isomorphic core: each fact in core(I, ΣB ) can be
obtained from some fact of core(I, ΣD ) by replacing leading symbols w and g with b.
To lift the equivalence between solvability of D and existence of an isomorphism
core(I, ΣD ) ∼ = core(I, ΣB ) to unrestricted source instances I, we will first show, that
whenever ChI or CvI contains a single cycle (that is, its dual graph is connected, each ver-
tex having a single incoming and a single outgoing edge), then any simulation mapping
yields an instance with the same core as chase(I, ΣB ).
Lemma 2. Let I contain a single cycle in Ch or in Cv . Then, for arbitrary DS D,
core(I, ΣD ) ∼
= core(I, ΣB ) holds.
Proof (Sketch). W.l.o.g. assume that the single cycle is contained in Ch . For such I,
the guard γh in the SO tgd τD ∈ ΣD is indistinguishable from τB in ΣB . Indeed,
the conclusions of these implications coincide up to renaming of functional terms. The
antecedent ϕh of γh has an extra atom C(x0 , x1 ), that prevents it from being satisfied
by a Ch -fact at the beginning of the chain. However, if ChI is a cycle, the relations q1|2
11
defined as q1 (x1 , x2 ) ↔ Ch (x1 , x2 ) and q2 (x1 , x2 ) ↔ ∃x0 Ch (x0 , x1 ) ∧ Ch (x1 , x2 ),
coincide, and so do τB and γh . Hence, J = chase(I, ΣD ) contains a subinstance JB
isomorphic to chase(I, ΣB ). Using Property 2, J → JB can be shown.
Next, we address the source instances I containing multiple connected components
I
in Ch|v . The connectedness is defined as follows: Given an instance K, the fact graph
GK has the facts of K as vertices, connected by an undirected edge whenever corre-
sponding facts have a common term. The facts f1 , f2 ∈ K are connected, if there is a
path between the corresponding vertices in GK . The following property can be shown
by induction on the length of the path between two D-facts:
Lemma 3. Let ΣD be a simulation mapping, I a source instance over S, and let d1 , d2
be facts in J = chase(I, ΣD ) containing respectively terms α(x1 , y1 ) and β(x2 , y2 ).
Assume that x1 is a term in a fact c0h ∈ ChI , and x2 in c00h ∈ ChI . Likewise, let y1 and y2
occur, respectively, in the facts c0v , c00v ∈ CvI . Then, d1 and d2 are connected in J only if
the facts c0h and c00h are connected in ChI , and so are c0v and c00v in CvI .
The above lemma implies, that if the source I contains lh and lv connected components
in ChI and CvI respectively, then for any simulation mapping Σ, chase(I, Σ) can be
decomposed into lh lv pairwise disconnected instances. Furthermore, if dual graphs of
I
the connected components of Ch|v are acyclic chains or cycles, then by applying one of
the Lemmas 1, 2 in each of the lh lv cases, the reduction established by these lemmas can
I
be lifted to the the case when relations Ch|v contain multiple connected components.
Finally, we observe that union of chains and cycles is the only type of dual graphs
which is relevant. Indeed, if the dual graph of a source instance I contains a V-formed
subgraph {C(x, y1 ), C(x, y2 )} or {C(y1 , x), C(y2 , x)} with y1 6= y2 , then the key de-
pendencies ensure that for any simulation mapping Σ (including ΣB ) the set Sol (I, Σ)
is empty. This concludes our reduction from the domino problem to CQ-equivalence of
SO tgds. We may thus formulate the main result of this section.
Theorem 3. The CQ-equivalence of mappings consisting of SO tgds and source key
dependencies is undecidable.
For the simulation mappings ΣD and ΣB in the above problem reduction, it can be
easily proved that ΣD ≡CQ ΣB holds iff ΣD ≡ ΣB holds. We thus immediately get
Theorem 4. The logical equivalence of mappings consisting of SO tgds without equal-
ities in the antecedents and source key dependencies is undecidable.
5 Conclusion
Testing the equivalence of schema mappings is a fundamental task. For mappings de-
fined by s-t tgds and target dependencies in the form of tgds and equality generating de-
pendencies (egds) this problem has been well studied. In the general case, the (logical)
equivalence of such mappings is undecidable. If the target tgds admit a finite chase, then
the problem becomes decidable [4]. In [10], two relaxed notions of equivalence were
introduced, namely the CQ-equivalence studied here and data exchange equivalence
12
(DE-equivalence). Moreover, in [10], the undecidability of CQ-equivalence was proved
for mappings consisting of s-t tgds and target tgds. The undecidability was extended in
[17] in two directions, namely to mappings with target egds and from CQ-equivalence
to DE-equivalence. The equivalence of SO tgds has been unexplored so far.
In this paper, we have proved the undecidability of logical equivalence. For the
case of source schemas with key dependencies, we proved the undecidability of CQ-
equivalence. The status of CQ-equivalence without any assumptions on the source
schema has been left open. Closing this gap is on top of the agenda for future work.
Moreover, the search for decidable fragments should be initiated. Finally, we also want
to study the DE-equivalence [10] mentioned above. We conjecture that DE-equivalence
of SO tgds is also undecidable. However, this has to be proved yet.
References
1. C. Allauzen and B. Durand. The classical decision problem, by E. Börger, E. Grädel and Yu.
Gurevich, chapter “Appendix A: Tiling Problems”, pages 407–420. Springer, 2001.
2. M. Arenas, J. Pérez, J. L. Reutter, and C. Riveros. Composition and inversion of schema
mappings. SIGMOD Record, 38(3):17–28, 2009.
3. M. Arenas, J. Pérez, and C. Riveros. The recovery of a schema mapping: Bringing exchanged
data back. ACM Trans. Database Syst., 34(4), 2009.
4. C. Beeri and M. Y. Vardi. A proof procedure for data dependencies. J. ACM, 31(4):718–741,
1984.
5. R. Berger. The undecidability of the domino problem. Amer Mathematical Society, 1966.
6. P. A. Bernstein. Applying model management to classical meta data problems. In Proc.
CIDR’03, 2003. available from http://www-db.cs.wisc.edu/cidr/cidr2003/program/p19.pdf.
7. P. A. Bernstein, T. J. Green, S. Melnik, and A. Nash. Implementing mapping composition.
VLDB J., 17(2):333–353, 2008.
8. R. Fagin. Inverting schema mappings. In Proc. PODS’06, pages 50–59. ACM, 2006.
9. R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data exchange: semantics and query
answering. Theor. Comput. Sci., 336(1):89–124, 2005.
10. R. Fagin, P. G. Kolaitis, A. Nash, and L. Popa. Towards a theory of schema-mapping opti-
mization. In Proc. PODS’08, pages 33–42. ACM, 2008.
11. R. Fagin, P. G. Kolaitis, and L. Popa. Data exchange: getting to the core. ACM Trans.
Database Syst., 30(1):174–210, 2005.
12. R. Fagin, P. G. Kolaitis, L. Popa, and W. C. Tan. Composing schema mappings: Second-order
dependencies to the rescue. ACM Trans. Database Syst., 30(4):994–1055, 2005.
13. A. Y. Halevy, A. Rajaraman, and J. J. Ordille. Data integration: The teenage years. In Proc.
VLDB’06, pages 9–16. ACM, 2006.
14. J. Madhavan and A. Y. Halevy. Composing mappings among data sources. In Proc.
VLDB’03, pages 572–583, 2003.
15. S. Melnik. Generic Model Management: Concepts and Algorithms, volume 2967 of Lecture
Notes in Computer Science. Springer, 2004.
16. A. Nash, P. A. Bernstein, and S. Melnik. Composition of mappings given by embedded
dependencies. ACM Trans. Database Syst., 32(1):4, 2007.
17. R. Pichler, E. Sallinger, and V. Savenkov. Relaxed notions of schema mapping equivalence
revisited. In Proc. ICDT’11, pages 90–101. ACM, 2011.
13