=Paper= {{Paper |id=Vol-1912/paper28 |storemode=property |title=Three Easy Pieces on Schema Mappings for Tree-structured Data |pdfUrl=https://ceur-ws.org/Vol-1912/paper28.pdf |volume=Vol-1912 |authors=Claire David,Filip Murlak |dblpUrl=https://dblp.org/rec/conf/amw/DavidM17 }} ==Three Easy Pieces on Schema Mappings for Tree-structured Data== https://ceur-ws.org/Vol-1912/paper28.pdf
      Three easy pieces on schema mappings for
                tree-structured data?

                             Claire David1 and Filip Murlak2
                         1
                             Université Paris-Est Marne-la-Vallée
                                  2
                                    University of Warsaw



        Abstract. Schema mappings specify how data organized under a source
        schema should be reorganized under a target schema. For tree-structured
        data, the interplay between these specifications and the complex struc-
        tural conditions imposed by the schemas, makes schema mappings a very
        rich formalism. Classical data management tasks, well-understood in the
        relational model, once again become challenging theoretical problems.
        Recent results on non-mixing integrity constraints help to push the decid-
        ability frontier further for three such problems: consistency of mappings,
        membership in the composition of mappings, and query answering.


1     Introduction
Schema mappings specify how data organized under a source schema should be
reorganized under a target schema. In the relational setting they are expressed
as sets of tuple generating dependencies (tgds) of the form ϕ(x̄) ⇒ ψ(x̄, ȳ), where
ϕ(x̄) and ψ(x̄, ȳ) are conjunctions of atoms over the source schema S and the
target schema T , respectively. A schema mapping M defines a relation [[M]]
between instances of S and instances of T : it contains pairs (S, T ) such that
for each dependency of the form above, for each tuple ū, if S |= ϕ(ū) then
T |= ψ(ū, v̄) for some v̄ (we assume a common data domain). Such mappings
were intensively studied in the context of data exchange, data integration, and
metadata management [8, 9], and finally made their way to commercial tools.
    In the peak of the popularity of the XML data format, schema mappings
were ported to the setting of tree-structured data, in the hope of developing an
equally manageable formalism to deal with data exchange and integration [2].
    A convenient model for tree-structured data are data trees: ordered unranked
trees, whose nodes carry a label from a finite alphabet A and a data value from
an infinite data domain D. For simplicity we shall assume that schemas are
DTDs. A DTD describes trees by specifying the labelling of children based on
the label of their parent. It consists of rules of the form a → ω, where a ∈ A
and ω is a regular expression over A. Such rule means that for each node with
label a the sequence of labels of its children, read from left to right, gives a word
generated by ω. Labels with no rule may be used in leaves only. The root has a
fixed label r ∈ A. A (data) tree conforms to the DTD if it satisfies all these rules.
?
    Supported by Poland’s National Science Center grant 2013/11/D/ST6/03075.
For instance, trees conforming to the DTD with rules r → ss + ε, s → rr + ε,
are binary trees in which odd-level nodes have label r and even-level nodes have
label s. Notice that DTDs do not talk about data values at all.
    Schema mappings are defined using queries selecting tuples of data values
from data trees. We focus on queries expressed with tree patterns. Let Var be a
countably infinite set of variables. The syntax of tree patterns is defined recur-
sively: for ` ∈ A ∪ { } and x ∈ Var, expressions ` and `(x) are tree patterns, and
if ϕ1 , . . . , ϕk are tree patterns, then

                  //ϕ1 ,   `(x)[ϕ1 , . . . , ϕk ] ,   and `[ϕ1 , . . . , ϕk ]

are tree patterns as well. We write ϕ(x̄) to indicate that ϕ uses only variables
from x̄. For ū ∈ D, the pattern ϕ(ū) is obtained from ϕ(x̄) by substituting
variables x̄ with elements of the tuple ū. For a ∈ A, the pattern a can be
matched at each tree node with label a; for , the label can be arbitrary. In case
of a(d) and (d), the node additionally must store the data value d. Pattern //ϕ
can be matched at each node that has a descendent at which ϕ can be matched.
Finally, ρ[ϕ1 , . . . , ϕk ] with ρ ∈ {a, , a(d), (d)} can be matched at each node at
which ρ can be matched, and which has children (not necessarily distinct) at
which ϕ1 , . . . , ϕk can be matched. We write T |= ϕ(ū) if ϕ(ū) can be matched
at some node of the data tree T . Mappings are defined with dependencies of the
form
                              ϕ(x̄) ∧ α(x̄) ⇒ ψ(x̄, ȳ) ∧ β(x̄, ȳ) ,

where ϕ(x̄), ψ(x̄, ȳ) are tree patterns and α(x̄), β(x̄, ȳ) are conjunctions of equal-
ities and inequalities among variables. We make the usual safety assumption
that ϕ(x̄) uses each variable in x̄ and ψ(x̄, ȳ) uses each variable in ȳ. To simplify
descriptions of fragments with restricted use of equalities and inequalities, we
make a proviso that ϕ(x̄) uses each variable in x̄ only once, and ψ(x̄, ȳ) uses
each variable in ȳ only once; variables from x̄ can be repeated in ψ(x̄, ȳ).
    As mappings define relations, they can be composed in the semantic sense:
[[M]] · [[M0 ]] is just the usual (left) composition of relations [[M]] and [[M0 ]]. It
has been shown that to represent the resulting relation as a single mapping, the
setting must be extended with function symbols (and some restrictions on M and
M0 must be made) [1, 9]. This is done by allowing in α(x̄), ψ(x̄, ȳ), and β(x̄, ȳ)
not just variables, but also arbitrary terms over variables x̄, and a countably
infinite set Fun of functions symbols. These symbols are implicitly quantified
existentially; that is, a pair of trees satisfies the set of dependencies constituting
a mapping if there is an interpretation of function symbols as functions Dn → D
for appropriate values of n, that makes each dependency satisfied when terms
are evaluated according to this interpretation. For instance, the dependency

                        a[b(x), c(y)] ⇒ a[b(x), c(y), b0 (f (x))]

invents a value to be stored in the b0 -labelled node together with the copied pair
(x, y), but the invented value is the same for all pairs sharing the value of x.
    Despite the apparent simplicity of DTDs and tree patterns, the interplay be-
tween them makes schema mappings for tree-structured data a very rich formal-
ism. Classical data management tasks, well-understood in the relational model,
once again become challenging computational problems, for which even decid-
ability is hard to obtain [1, 5]. Here we show how recent results on non-mixing
integrity constraints [6] help to push the decidability frontier further for three
such problems: consistency of mappings (Section 2), membership in the compo-
sition of mappings (Section 3), and query answering (Section 4). Each of these
advances raises further open questions, formulated at the end of the sections.
    All the upper bounds we obtain also hold if we allow tree automata as
schemas, and replace tree patterns with conjunctive queries (with node variables
and data value variables) over the signature consisting of label tests, vertical and
horizontal axes, and a predicate relating nodes with stored data values.

2   Consistency
A classic static analysis problem for schema mappings is the following consistency
problem first considered by Arenas and Libkin [2]:

            Problem: Cons
               Input: mapping M
            Question: Are there trees S, T such that (S, T ) ∈ [[M]]?

It is essentially the satisfiability problem for a formalism defining relations over
structures, rather than classes of structures.
    This problem is undecidable as soon as either = or 6= is allowed on both
sides of the mappings [1], but there are no reasons—other than mathematical
elegance—to consider only dependencies with symmetric use of data compar-
isons. In fact, = on the source side of the mappings is often not essential. For
instance, while syntactic closure under composition requires = on the target, it
does not require = on the source [1]. Forbidding = on the source side of course
means that we cannot have proper joins when querying the source instance, but
it might be a price worth paying for decidable consistency. Similarly, 6= on the
source may not be always necessary: for instance, key constraints on the source
can be expressed using 6= exclusively on the target side [6]. Here we take a closer
look at dependencies with asymmetric use of data comparisons. In this new con-
text, we tighten the results of [1] as follows. We use superscripts = and 6= to
indicate that a formula uses, respectively, only equalities, and only inequalities.
Proposition 1. Cons is undecidable already for mappings using dependencies
of exactly one of the following forms:
            ϕ(x̄) ∧ α= (x̄) ⇒ ψ(x̄) ,     ϕ(x̄) ⇒ ψ(x̄, ȳ) ∧ α6= (x̄, ȳ) ,
and ExpTime-complete for mappings with function symbols and dependencies
of the form:
                  ϕ(x̄) ∧ α6= (x̄) ⇒ ψ(x̄, ȳ) ∧ α= (x̄, ȳ) .
Proof. The proof relies on the fact that consistency is undecidable already when
dependencies are of the form ϕ(x̄) ∧ α= (x̄) ⇒ β = (x̄), as proved in [1].
   We first reduce consistency of such mappings to consistency of mappings
with dependencies of the first form given in the statement of the proposition.
Let M be a mapping with dependencies of the form

           ϕ(x1 , x2 . . . , xn ) ∧ α= (x1 , x2 . . . , xn ) ⇒ β = (x1 , x2 , . . . , xn ) .

To obtain the new mapping M0 , in each dependency we replace β = (x1 , x2 , . . . , xn )
with the pattern                                                
                     b (x1 )[c1 ], (x2 )[c2 ], . . . , (xn )[cn ] ,
where b and c1 , c2 , . . . , cn are fresh labels specific to this dependency. Let us
partition the set of variables x1 , x2 , . . . , xn into nonempty subsets Z1 , Z2 . . . , Zm
for some m ≤ n, in such a way that two variables are in the same subset if and
only if there is a sequence of equalities in β = (x1 , x2 , . . . , xn ) that connects them.
Let b1 , b2 , . . . , bm be fresh labels and let Tb be a tree whose root has label b and
each Zi is represented by a bi -labelled child of the root that has a child with
label cj if and only if xj ∈ Zi . The target schema of M0 allows all trees obtained
by combining under one root with label r arbitrary numbers of copies of trees
Tb corresponding to all dependencies of M; note that it uses only finitely many
labels. The source schema of M0 is inherited from M. By construction, for each
source tree S there exists T such that (S, T ) ∈ [[M]] if and only if there exists
T 0 such that (S, T 0 ) |= [[M0 ]]. This completes the proof of the first claim.
    For dependencies of the second form we also begin with the mapping M
above. Each of its dependencies is of the form

                      ϕ(x̄) ∧ xi1 = xj1 ∧ · · · ∧ xi` = xj` ⇒ β = (x̄) .

This is equivalent to

                      ϕ(x̄) ⇒ xi1 6= xj1 ∨ · · · ∨ xi` 6= xj` ∨ β = (x̄) ,

which can be further rewritten as

      ϕ(x̄) ⇒ y1 6= xj1 ∧ · · · ∧ y` 6= xj` ∧ y1 = xi1 ∨ · · · ∨ y` = xi` ∨ β = (x̄) ,
                                                                                    

using fresh variables ȳ. Thus, we can assume that M has dependencies of the
form
           ϕ(x̄) ⇒ α6= (x̄, ȳ) ∧ α0= (x̄, ȳ) ∨ α1= (x̄, ȳ) ∨ · · · ∨ α`= (x̄, ȳ) ,
                                                                                    

where αi= are conjunctions of equalities and α6= is a conjunction of inequalities.
In M00 such dependency is replaced with

            ϕ(x̄) ⇒ b (z1 )[c1 ], (z2 )[c2 ], . . . , (zm )[cm ] ∧ α6= (x̄, ȳ) ,
                                                               

where b is a fresh letter specific to this dependency only, and z1 , z2 , . . . , zm is an
enumeration of x̄ and ȳ. The target schema of M00 is defined like before, except
that subtrees with label b in the root have ` + 1 variants, corresponding to
α0= , α1= , . . . , α`= , and multiple copies of multiple variants are allowed. The source
schema is inherited from M. It is easy to check that the reduction is correct.
     For the decidability claim, let M be a mapping using neither = over the
source, not 6= over the target. Take (S, T ) ∈ [[M]]. Replace all data values with
the same fixed data value ⊥. The obtained pair of trees (S 0 , T 0 ) is also in [[M]],
and in the witnessing valuation we can assume that all functions are constantly
equal to ⊥. Hence, if M is consistent, it can be witnessed with the set of data
values restricted to a single value ⊥. Consequently, we can drop all dependencies
containing an inequality on the source (their left hand side will never be satisfied
if ⊥ is the only available data value), and all references to data values in the
remaining dependencies (it is always the same value ⊥, so they make no differ-
ence). This gives a mapping that does not use variables at all; for such mappings
Cons amounts to exponentially many nonemptiness tests for tree automata of
exponential size [1, 2]. The lower bound holds already for such mappings [2]. t         u
    Surprisingly, Cons becomes decidable for mappings using only dependencies
of the second form in Proposition 1, ϕ(x̄) ⇒ ψ(x̄, ȳ) ∧ α6= (x̄, ȳ), as soon as we
forbid introducing new variables on the target side (as in full tgds from the
relational setting) [6]. Here we show that the argument in fact works for a much
more general class of mappings.
    The approach relies on a strong decidability result proved in [6]. A non-mixing
constraint is a dependency of the form
                               ϕ(x̄) ⇒ η = (x̄) ∧ η 6= (x̄) ,
where ϕ(x̄) is a pattern possibly using constants from D, η = (x̄) is a positive
boolean combination of equalities over x̄ and constants from D, and η 6= (x̄) is a
positive boolean combination of inequalities over x̄ and constants from D, as well
as equalities between a variable from x̄ and a constant from D. A tree T satisfies
such a constraint if for each tuple ū, T |= ϕ(ū) implies that η = (ū) ∧ η 6= (ū) holds.
Theorem 1 ([6]). It is decidable if there exists a tree conforming to a given
schema, and satisfying a given set of non-mixing constraints as well as a given
                                                                     poly(`)
set of patterns without variables. The algorithm runs in time 2n             , where n
is the total size of the input, and ` is the maximal size of involved patterns. The
problem is in fact 2ExpTime-complete.
We note that the original setting of [6] models schemas as tree automata and
allows full conjunctive queries, but disallows equalities in η 6= (x̄). Because of that,
both the upper bound and the lower bound require some routine adjustments.
    The second ingredient is a useful normal form for sets of dependencies, im-
plicit in [7], resulting from the decomposition of schemas into so-called kinds [5].
Lemma 1. For each mapping M one can compute in doubly exponential time
sets Σ1 , Σ2 , . . . , Σm of dependencies, each obtained from the set of dependencies
of M by replacing all target-side patterns ψ(x̄, ȳ) with exponential-size positive
boolean combinations η = (x̄, ȳ) of equalities over x̄, ȳ and constants from D, such
that for each data tree S, (S, T ) ∈ [[M]] for some data tree T if and only if
S |= Σi for some i ∈ {1, . . . , m}.
Proposition 2. Cons is decidable ( 2ExpTime-complete) if the input mapping
has only dependencies of the following two forms:

            ϕ(x̄) ∧ α6= (x̄) ⇒ ψ(x̄, ȳ) ∧ α= (x̄, ȳ) ,     ϕ(x̄) ⇒ α6= (x̄) .

Proof. By the assumption on the form of dependencies in the input mapping,
using Lemma 1 we obtain sets Σi of dependencies of the forms

           ϕ(x̄) ∧ α6= (x̄) ⇒ η = (x̄, ȳ) ∧ α= (x̄, ȳ) ,   ϕ(x̄) ⇒ α6= (x̄) .

Dependencies of the second form are already non-mixing constraints. Dependen-
cies of the first form can be rewritten as

                      ϕ(x̄) ⇒ ¬α6= (x̄) ∨ η = (x̄, ȳ) ∧ α= (x̄, ȳ) ,
                                                                    

and because ¬α6= (x̄) is equivalent to a positive Boolean combination of equalities,
we can assume they are of the form

                                    ϕ(x̄) ⇒ η = (x̄, ȳ) .

Finally, variables ȳ can be eliminated from η = (x̄, ȳ) by rewriting the positive
boolean combination in the disjunctive normal form, completing each disjunct
with equalities following by transitivity, and then dropping all equalities involv-
ing ȳ. Because η = (x̄, ȳ) has exponential size, it uses only exponentially many
constants. Consequently, there can be only exponentially many different dis-
juncts that do not contain an equality between two different constants; disjuncts
containing such equalities can be dropped. After these modifications each Σi is
a set of non-mixing constraints and by Theorem 1 it can be tested if either of
them is satisfiable in a tree conforming to the source schema. The total size of
the obtained non-mixing constraints is exponential, because of Lemma 1, but
the size of patterns has not increased. Hence, the problem is in 2ExpTime. The
lower bound follows from the proof of Theorem 1.                                  t
                                                                                  u
    Note that dependencies of the form ϕ(x̄) ⇒ ψ(x̄, ȳ) ∧ α= (x̄, ȳ) ∧ α6= (x̄) can
be replaced with ϕ(x̄) ⇒ ψ(x̄, ȳ) ∧ α= (x̄, ȳ) and ϕ(x̄) ⇒ α6= (x̄); hence, they are
also covered by Proposition 2.
Problem 1. It is an interesting question if Proposition 2 can be extended to map-
pings with function symbols. This would correspond to allowing appropriately
limited use of function symbols in non-mixing constraints. As function symbols
can simulate fresh variables on the target, this will require quite some care.


3    Composition membership
Composition membership is the following decision problem:

                      Problem: Comp
                         Input: mappings M, M0 , trees S, T
                      Question: (S, T ) ∈ [[M]] · [[M0 ]] ?
While it is not a static analysis problem—data are clearly there—it does resemble
one, because it asks about the existence of an instance I such that (S, I) ∈ [[M]]
and (I, T ) ∈ [[M0 ]]. Consequently, despite the simplicity of the logical formalism
involved, the problem sits astride the decidability frontier: it is undecidable if
either = or 6= is allowed, but becomes decidable if both are forbidden [1]. The
argument for the latter makes an additional assumption that function symbols
are forbidden. Most often, this is just a simplifying assumption. Surprisingly, in
this case it is necessary, as the following result shows.

Proposition 3. Comp is undecidable for mappings with function symbols. In
fact, even existence of a source tree for a given target tree with respect to a fixed
mapping is undecidable.

Proof. To show that the latter problem is undecidable, we reduce the following
tiling problem: given a set of tiles X , an initial tile I ∈ X , a final tile F ∈ X ,
and relations H, V ⊆ X × X , decide if for some N there exists an N × N tiling;
that is, a function f : {1, 2, . . . , N }2 → X such that f (1, 1) = I, f (N, N ) = F ,
(f (i, j), f (i, j + 1)) ∈ V , and (f (j, i), f (j + 1, i)) ∈ H for all i ≤ N and j < N .
    The mapping does not depend on the instance of tiling: the source schema
is r → a, a → b, b → b + c, the target schema is r → IF V ∗ H ∗ , V → V1 V2 ,
H → H1 H2 , and the dependencies are
                                                                      
                        r a(x), //c(y) −→ r I(f (x, x)), F (f (y, y)) ,
                 r // (x), // (y)[ (y 0 )] −→ V V1 (f (x, y)), V2 (f (x, y 0 )) ,
                                                                            

                 r // (x)[ (x0 )], // (y) −→ H H1 (f (x, y)), H2 (f (x0 , y)) .
                                                                              

    The target instance T is the standard tree encoding of a relational instance
with relations I, F , H, V storing an instance of the tiling problem. Notice that
the set X is not represented explicitly, but it is assumed to be the set of values
occurring in the relations H, V .
    The source instance is intended to represent a finite linear order with the
least element in the a node, and the largest element in the c node.
    Let us see that the reduction is correct. If there exists an N × N tiling,
there is a tree satisfying the dependencies: the data values in this tree are just
successive numbers from 1 to N , and the function symbol f is interpreted as the
tiling function. Conversely, if there is a source tree satisfying the dependencies,
there is also one where all data values in the leaves are different:
 – no new rule will be fired, because the dependencies do not have inequality
   on the source side;
 – no rule fired previously will be violated, because no data value is used directly
   on the target side, and the function f on new values can be defined just like
   on the values they replace.
Let us call these data values d1 , d2 , . . . , dN , according to their order of appear-
ance in the single branch of the source tree. The interpretation of the function
symbol f , witnessing that the dependencies are satisfied, gives an N × N -tiling.
    It follows that Comp is undecidable as well, because a source tree exists for
T with respect to M if and only if (T0 , T ) ∈ [[M0 ]] · [[M]], where T0 is some fixed
tree and M0 is the mapping from the trivial schema allowing only T0 to the
source schema of M, whose set of dependencies is empty.                              t
                                                                                     u

   Thus, to regain decidability we need to restrict the use of =, 6=, and function
symbols. But do we have to forbid them entirely, as in [1]? Below we show that
the restriction can be much weaker than that. The resulting class of dependencies
extends that of Proposition 2 in two ways: the second form allows equalities on
the source side and unrestricted use of fresh variables on the target side.

Proposition 4. Comp is decidable (2ExpTime-complete) if the second map-
ping uses no function symbols and has dependencies of the following two forms:

   ϕ(x̄) ∧ α6= (x̄) ⇒ ψ(x̄, ȳ) ∧ α= (x̄, ȳ) ,     ϕ(x̄) ∧ α= (x̄) ⇒ ψ(x̄, ȳ) ∧ α6= (x̄, ȳ) .

Proof. Let M, M0 , S, T an instance of Comp. Rather than using Lemma 1,
as in Proposition 2, we note that we always evaluate dependencies of M0 over
pairs of trees of the form (I, T ). Hence, in each dependency we can replace the
pattern ψ(x̄, ȳ) with a formula γ = (x̄, ȳ) that is a disjunction of exponentially
many conjunctions of equalities between variables and data values used in T ,
corresponding to all ways of matching ψ(x̄, ȳ) in T . Moreover, by the safety
assumption we know that in each conjunction all variables ȳ occur in at least
one equality. Moving all data comparisons to the target side, we get

ϕ(x̄) ⇒ ¬α6= (x̄)∨ γ = (x̄, ȳ)∧α= (x̄, ȳ) , ϕ(x̄) ⇒ ¬α= (x̄)∨ γ = (x̄, ȳ)∧α6= (x̄, ȳ) .
                                                                                        


Dependencies of the first form can be turned into non-mixing constraints as in the
proof of Proposition 2. In dependencies of the second form we can also eliminate
variables ȳ. Recall that γ = (x̄, ȳ) = γ1= (x̄, ȳ) ∨ · · · ∨ γm
                                                                 =
                                                                   (x̄, ȳ), and in each γi= (x̄, ȳ)
each of the variables ȳ appears in an equality with a data value; we can assume
that it is exactly one equality, because otherwise γi= (x̄, ȳ) is not satisfiable and
can be dropped. Hence, we can replace γ = (x̄, ȳ) ∧ α6= (x̄, ȳ) with the disjunction
of formulas γi= (x̄, ȳ) ∧ α6= (x̄, ȳ) for i = 1, . . . , m. In each γi= (x̄, ȳ) ∧ α6= (x̄, ȳ) we
eliminate ȳ independently, replacing all their occurrences with the data values
assigned to them by γi= (x̄, ȳ). Because ¬α= (x̄) is equivalent to a positive boolean
combination of inequalities, we thus transform dependencies of the second form
into non-mixing constraints. Let Σ 0 be the resulting set of constraints.
    Since M can use function symbols, we can assume that its dependencies do
not introduce new variables on the target side. Let ∆ be the set of all terms
occurring in M with variables substituted with data values used in S. Restrict
D to data values used in S or T , and |∆| fresh data values. Iterate through
all consistent valuations of terms in ∆. For each valuation consider formulas
ψ(ū) ∧ β(ū) for all dependencies ϕ(x̄)∧α(x̄) ⇒ ψ(x̄)∧β(x̄) of M and all ū ∈ D
such that S |= ϕ(ū) ∧ α(ū). It one of β(ū) does not hold, discard this valuation of
∆. Otherwise, let Ψ be the set of all such ψ(ū). Now it remains to check that there
exists a tree I conforming to the target schema of M and the source schema of
M0 , such that I |= Ψ and I |= Σ 0 . This is essentially an instance of the problem
from Theorem 1, except that we have an intersection of two schemas. This is
not problematic, because the setting in [6] abstracts schemas as automata, which
are closed under intersection and the resulting automaton is quadratic. The total
size of patterns from Ψ and constraints from Σ 0 is exponential, but the size of
involved patterns has not increased. Hence, the problem is in 2ExpTime. The
lower bound holds already for mappings without data comparisons [1].              t
                                                                                  u

Problem 2. Unrestricted use of either = or 6= on both sides of the second mapping
immediately leads to undecidability of Comp [1], but it would be very interesting
to see if one can allow dependencies of the form

                     ϕ0 (x̄) ⇒ ψ 0 (x̄, ȳ) ∧ α= (x̄, ȳ) ∧ α6= (x̄, ȳ) ,

which is not excluded by the existing undecidability results.

Problem 3. When the mappings are fixed, the algorithm from Proposition 4 runs
in exponential time. One easily transfers the NP lower bound from the relational
case [9], but no tighter bound is known. This leaves a substantial gap.


4   Certain answers
The central problem of data exchange is answering queries over the target. Given
that the solution to a given source instance is typically not unique, the adopted
semantics is that of certain answers: we only keep those tuples that are returned
over each solution to the given source instance. That is, query answering in the
context of data exchange boils down to the following problem:

       Problem: Cert
          Input: mapping M, tree pattern query q(x̄), tree S, tuple ā
       Question: Does T |= q(ā) hold for each solution T for S wrt. M?

    In [1] it is shown that as soon as one forbids inequality in queries, the problem
becomes decidable. This proof goes by reduction to a more fundamental problem
of certain answers in incomplete information scenarios [3], which in turn can be
seen as an instance of query containment for unions of conjunctive queries over
trees in the presence of a schema [4]. In [3] the problem is solved by resorting
to first order types of fixed rank, which gives data complexity in coNP, but
apparently very high combined complexity, not even determined in the paper.
Here we show how to use the results on non-mixing constraints to pinpoint the
exact complexity, and also to extend decidability to a larger class of queries.

Proposition 5. Cert is decidable ( 2ExpTime-complete) for unions of queries
of the form ∃ȳ ϕ(x̄, ȳ) ∧ α= (x̄, ȳ) and ∃ȳ ϕ(x̄, ȳ) ∧ α6= (x̄, ȳ).

Proof. Let M, S, q(x̄), ū ∈ D be an instance of Cert as in the statement.
Construct a set Σ 0 of non-mixing constraints equivalent to the negation of q(ū):
for disjuncts ∃ȳ ϕ(ū, ȳ) ∧ α= (ū, ȳ) and ∃ȳ ϕ(ū, ȳ) ∧ α6= (ū, ȳ), take constraints
ϕ(ū, ȳ) ⇒ ¬α= (ū, ȳ) and ϕ(ū, ȳ) ⇒ ¬α6= (ū, ȳ) with negation pushed down to
the atoms. Then proceed exactly like in Proposition 4 to reduce to the problem
in Theorem 1 and get the 2ExpTime upper bound. The lower bound follows
from the proof of Theorem 1.                                                              t
                                                                                          u
    The upper bound transfers to the certain answers problem in the incomplete
information scenario, as the latter can be easily seen as a special case of Cert;
the bound is also tight, by the same argument as above. In fact, for such queries
even validity with respect to a schema is 2ExpTime-hard.
Problem 4. The lower bound in Proposition 5 requires both = and 6=. For queries
using only =, the exact complexity remains an open question, particularly in-
triguing for the incomplete information scenario.
Problem 5. The upper bound above relies on an algorithm whose data complex-
ity is exponential. We also know that there is, at least for queries using only =, a
nondeterministic algorithm with polynomial data complexity, but undetermined
combined complexity. Can we have an algorithm with both data complexity and
combined complexity optimized? What does that even mean in a situation when
data complexity is pinpointed in the nondeterministic model, and combined com-
plexity in the deterministic model?

Acknowledgments. We thank Wojtek Czerwiński and Pawel Parys for fruitful
discussions. A large part of this work was done while the second author was
visiting researcher at Université Paris-Est Marne-la-Vallée.

References
1. Shun’ichi Amano, Claire David, Leonid Libkin, and Filip Murlak. XML schema
   mappings: Data exchange and metadata management. J. ACM, 61(2):12:1–12:48,
   2014.
2. Marcelo Arenas and Leonid Libkin. XML data exchange: Consistency and query
   answering. J. ACM, 55(2), 2008.
3. Pablo Barceló, Leonid Libkin, Antonella Poggi, and Cristina Sirangelo. XML with
   incomplete information. J. ACM, 58(1):4, 2010.
4. Henrik Björklund, Wim Martens, and Thomas Schwentick. Optimizing conjunctive
   queries over trees using schema information. In Proc. MFCS, pages 132–143, 2008.
5. Mikolaj Bojanczyk, Leszek Aleksander Kolodziejczyk, and Filip Murlak. Solutions
   in XML data exchange. J. Comput. Syst. Sci., 79(6):785–815, 2013.
6. Wojciech Czerwiński, Claire David, Filip Murlak, and Pawel Parys. Reasoning
   about integrity constraints for tree-structured data. In Proc. ICDT, pages 20:1–
   20:18, 2016.
7. Claire David, Piotr Hofman, Filip Murlak, and Michal Pilipczuk. Synthesizing
   transformations from XML schema mappings. In Proc. ICDT, pages 61–71, 2014.
8. Ronald Fagin, Phokion G. Kolaitis, Renée J. Miller, and Lucian Popa. Data ex-
   change: semantics and query answering. Theor. Comput. Sci., 336(1):89–124, 2005.
9. Ronald Fagin, Phokion G. Kolaitis, Lucian Popa, and Wang Chiew Tan. Composing
   schema mappings: Second-order dependencies to the rescue. ACM Trans. Database
   Syst., 30(4):994–1055, 2005.