Three easy pieces on schema mappings for tree-structured data? Claire David1 and Filip Murlak2 1 Université Paris-Est Marne-la-Vallée 2 University of Warsaw Abstract. Schema mappings specify how data organized under a source schema should be reorganized under a target schema. For tree-structured data, the interplay between these specifications and the complex struc- tural conditions imposed by the schemas, makes schema mappings a very rich formalism. Classical data management tasks, well-understood in the relational model, once again become challenging theoretical problems. Recent results on non-mixing integrity constraints help to push the decid- ability frontier further for three such problems: consistency of mappings, membership in the composition of mappings, and query answering. 1 Introduction Schema mappings specify how data organized under a source schema should be reorganized under a target schema. In the relational setting they are expressed as sets of tuple generating dependencies (tgds) of the form ϕ(x̄) ⇒ ψ(x̄, ȳ), where ϕ(x̄) and ψ(x̄, ȳ) are conjunctions of atoms over the source schema S and the target schema T , respectively. A schema mapping M defines a relation [[M]] between instances of S and instances of T : it contains pairs (S, T ) such that for each dependency of the form above, for each tuple ū, if S |= ϕ(ū) then T |= ψ(ū, v̄) for some v̄ (we assume a common data domain). Such mappings were intensively studied in the context of data exchange, data integration, and metadata management [8, 9], and finally made their way to commercial tools. In the peak of the popularity of the XML data format, schema mappings were ported to the setting of tree-structured data, in the hope of developing an equally manageable formalism to deal with data exchange and integration [2]. A convenient model for tree-structured data are data trees: ordered unranked trees, whose nodes carry a label from a finite alphabet A and a data value from an infinite data domain D. For simplicity we shall assume that schemas are DTDs. A DTD describes trees by specifying the labelling of children based on the label of their parent. It consists of rules of the form a → ω, where a ∈ A and ω is a regular expression over A. Such rule means that for each node with label a the sequence of labels of its children, read from left to right, gives a word generated by ω. Labels with no rule may be used in leaves only. The root has a fixed label r ∈ A. A (data) tree conforms to the DTD if it satisfies all these rules. ? Supported by Poland’s National Science Center grant 2013/11/D/ST6/03075. For instance, trees conforming to the DTD with rules r → ss + ε, s → rr + ε, are binary trees in which odd-level nodes have label r and even-level nodes have label s. Notice that DTDs do not talk about data values at all. Schema mappings are defined using queries selecting tuples of data values from data trees. We focus on queries expressed with tree patterns. Let Var be a countably infinite set of variables. The syntax of tree patterns is defined recur- sively: for ` ∈ A ∪ { } and x ∈ Var, expressions ` and `(x) are tree patterns, and if ϕ1 , . . . , ϕk are tree patterns, then //ϕ1 , `(x)[ϕ1 , . . . , ϕk ] , and `[ϕ1 , . . . , ϕk ] are tree patterns as well. We write ϕ(x̄) to indicate that ϕ uses only variables from x̄. For ū ∈ D, the pattern ϕ(ū) is obtained from ϕ(x̄) by substituting variables x̄ with elements of the tuple ū. For a ∈ A, the pattern a can be matched at each tree node with label a; for , the label can be arbitrary. In case of a(d) and (d), the node additionally must store the data value d. Pattern //ϕ can be matched at each node that has a descendent at which ϕ can be matched. Finally, ρ[ϕ1 , . . . , ϕk ] with ρ ∈ {a, , a(d), (d)} can be matched at each node at which ρ can be matched, and which has children (not necessarily distinct) at which ϕ1 , . . . , ϕk can be matched. We write T |= ϕ(ū) if ϕ(ū) can be matched at some node of the data tree T . Mappings are defined with dependencies of the form ϕ(x̄) ∧ α(x̄) ⇒ ψ(x̄, ȳ) ∧ β(x̄, ȳ) , where ϕ(x̄), ψ(x̄, ȳ) are tree patterns and α(x̄), β(x̄, ȳ) are conjunctions of equal- ities and inequalities among variables. We make the usual safety assumption that ϕ(x̄) uses each variable in x̄ and ψ(x̄, ȳ) uses each variable in ȳ. To simplify descriptions of fragments with restricted use of equalities and inequalities, we make a proviso that ϕ(x̄) uses each variable in x̄ only once, and ψ(x̄, ȳ) uses each variable in ȳ only once; variables from x̄ can be repeated in ψ(x̄, ȳ). As mappings define relations, they can be composed in the semantic sense: [[M]] · [[M0 ]] is just the usual (left) composition of relations [[M]] and [[M0 ]]. It has been shown that to represent the resulting relation as a single mapping, the setting must be extended with function symbols (and some restrictions on M and M0 must be made) [1, 9]. This is done by allowing in α(x̄), ψ(x̄, ȳ), and β(x̄, ȳ) not just variables, but also arbitrary terms over variables x̄, and a countably infinite set Fun of functions symbols. These symbols are implicitly quantified existentially; that is, a pair of trees satisfies the set of dependencies constituting a mapping if there is an interpretation of function symbols as functions Dn → D for appropriate values of n, that makes each dependency satisfied when terms are evaluated according to this interpretation. For instance, the dependency a[b(x), c(y)] ⇒ a[b(x), c(y), b0 (f (x))] invents a value to be stored in the b0 -labelled node together with the copied pair (x, y), but the invented value is the same for all pairs sharing the value of x. Despite the apparent simplicity of DTDs and tree patterns, the interplay be- tween them makes schema mappings for tree-structured data a very rich formal- ism. Classical data management tasks, well-understood in the relational model, once again become challenging computational problems, for which even decid- ability is hard to obtain [1, 5]. Here we show how recent results on non-mixing integrity constraints [6] help to push the decidability frontier further for three such problems: consistency of mappings (Section 2), membership in the compo- sition of mappings (Section 3), and query answering (Section 4). Each of these advances raises further open questions, formulated at the end of the sections. All the upper bounds we obtain also hold if we allow tree automata as schemas, and replace tree patterns with conjunctive queries (with node variables and data value variables) over the signature consisting of label tests, vertical and horizontal axes, and a predicate relating nodes with stored data values. 2 Consistency A classic static analysis problem for schema mappings is the following consistency problem first considered by Arenas and Libkin [2]: Problem: Cons Input: mapping M Question: Are there trees S, T such that (S, T ) ∈ [[M]]? It is essentially the satisfiability problem for a formalism defining relations over structures, rather than classes of structures. This problem is undecidable as soon as either = or 6= is allowed on both sides of the mappings [1], but there are no reasons—other than mathematical elegance—to consider only dependencies with symmetric use of data compar- isons. In fact, = on the source side of the mappings is often not essential. For instance, while syntactic closure under composition requires = on the target, it does not require = on the source [1]. Forbidding = on the source side of course means that we cannot have proper joins when querying the source instance, but it might be a price worth paying for decidable consistency. Similarly, 6= on the source may not be always necessary: for instance, key constraints on the source can be expressed using 6= exclusively on the target side [6]. Here we take a closer look at dependencies with asymmetric use of data comparisons. In this new con- text, we tighten the results of [1] as follows. We use superscripts = and 6= to indicate that a formula uses, respectively, only equalities, and only inequalities. Proposition 1. Cons is undecidable already for mappings using dependencies of exactly one of the following forms: ϕ(x̄) ∧ α= (x̄) ⇒ ψ(x̄) , ϕ(x̄) ⇒ ψ(x̄, ȳ) ∧ α6= (x̄, ȳ) , and ExpTime-complete for mappings with function symbols and dependencies of the form: ϕ(x̄) ∧ α6= (x̄) ⇒ ψ(x̄, ȳ) ∧ α= (x̄, ȳ) . Proof. The proof relies on the fact that consistency is undecidable already when dependencies are of the form ϕ(x̄) ∧ α= (x̄) ⇒ β = (x̄), as proved in [1]. We first reduce consistency of such mappings to consistency of mappings with dependencies of the first form given in the statement of the proposition. Let M be a mapping with dependencies of the form ϕ(x1 , x2 . . . , xn ) ∧ α= (x1 , x2 . . . , xn ) ⇒ β = (x1 , x2 , . . . , xn ) . To obtain the new mapping M0 , in each dependency we replace β = (x1 , x2 , . . . , xn ) with the pattern   b (x1 )[c1 ], (x2 )[c2 ], . . . , (xn )[cn ] , where b and c1 , c2 , . . . , cn are fresh labels specific to this dependency. Let us partition the set of variables x1 , x2 , . . . , xn into nonempty subsets Z1 , Z2 . . . , Zm for some m ≤ n, in such a way that two variables are in the same subset if and only if there is a sequence of equalities in β = (x1 , x2 , . . . , xn ) that connects them. Let b1 , b2 , . . . , bm be fresh labels and let Tb be a tree whose root has label b and each Zi is represented by a bi -labelled child of the root that has a child with label cj if and only if xj ∈ Zi . The target schema of M0 allows all trees obtained by combining under one root with label r arbitrary numbers of copies of trees Tb corresponding to all dependencies of M; note that it uses only finitely many labels. The source schema of M0 is inherited from M. By construction, for each source tree S there exists T such that (S, T ) ∈ [[M]] if and only if there exists T 0 such that (S, T 0 ) |= [[M0 ]]. This completes the proof of the first claim. For dependencies of the second form we also begin with the mapping M above. Each of its dependencies is of the form ϕ(x̄) ∧ xi1 = xj1 ∧ · · · ∧ xi` = xj` ⇒ β = (x̄) . This is equivalent to ϕ(x̄) ⇒ xi1 6= xj1 ∨ · · · ∨ xi` 6= xj` ∨ β = (x̄) , which can be further rewritten as ϕ(x̄) ⇒ y1 6= xj1 ∧ · · · ∧ y` 6= xj` ∧ y1 = xi1 ∨ · · · ∨ y` = xi` ∨ β = (x̄) ,  using fresh variables ȳ. Thus, we can assume that M has dependencies of the form ϕ(x̄) ⇒ α6= (x̄, ȳ) ∧ α0= (x̄, ȳ) ∨ α1= (x̄, ȳ) ∨ · · · ∨ α`= (x̄, ȳ) ,  where αi= are conjunctions of equalities and α6= is a conjunction of inequalities. In M00 such dependency is replaced with ϕ(x̄) ⇒ b (z1 )[c1 ], (z2 )[c2 ], . . . , (zm )[cm ] ∧ α6= (x̄, ȳ) ,   where b is a fresh letter specific to this dependency only, and z1 , z2 , . . . , zm is an enumeration of x̄ and ȳ. The target schema of M00 is defined like before, except that subtrees with label b in the root have ` + 1 variants, corresponding to α0= , α1= , . . . , α`= , and multiple copies of multiple variants are allowed. The source schema is inherited from M. It is easy to check that the reduction is correct. For the decidability claim, let M be a mapping using neither = over the source, not 6= over the target. Take (S, T ) ∈ [[M]]. Replace all data values with the same fixed data value ⊥. The obtained pair of trees (S 0 , T 0 ) is also in [[M]], and in the witnessing valuation we can assume that all functions are constantly equal to ⊥. Hence, if M is consistent, it can be witnessed with the set of data values restricted to a single value ⊥. Consequently, we can drop all dependencies containing an inequality on the source (their left hand side will never be satisfied if ⊥ is the only available data value), and all references to data values in the remaining dependencies (it is always the same value ⊥, so they make no differ- ence). This gives a mapping that does not use variables at all; for such mappings Cons amounts to exponentially many nonemptiness tests for tree automata of exponential size [1, 2]. The lower bound holds already for such mappings [2]. t u Surprisingly, Cons becomes decidable for mappings using only dependencies of the second form in Proposition 1, ϕ(x̄) ⇒ ψ(x̄, ȳ) ∧ α6= (x̄, ȳ), as soon as we forbid introducing new variables on the target side (as in full tgds from the relational setting) [6]. Here we show that the argument in fact works for a much more general class of mappings. The approach relies on a strong decidability result proved in [6]. A non-mixing constraint is a dependency of the form ϕ(x̄) ⇒ η = (x̄) ∧ η 6= (x̄) , where ϕ(x̄) is a pattern possibly using constants from D, η = (x̄) is a positive boolean combination of equalities over x̄ and constants from D, and η 6= (x̄) is a positive boolean combination of inequalities over x̄ and constants from D, as well as equalities between a variable from x̄ and a constant from D. A tree T satisfies such a constraint if for each tuple ū, T |= ϕ(ū) implies that η = (ū) ∧ η 6= (ū) holds. Theorem 1 ([6]). It is decidable if there exists a tree conforming to a given schema, and satisfying a given set of non-mixing constraints as well as a given poly(`) set of patterns without variables. The algorithm runs in time 2n , where n is the total size of the input, and ` is the maximal size of involved patterns. The problem is in fact 2ExpTime-complete. We note that the original setting of [6] models schemas as tree automata and allows full conjunctive queries, but disallows equalities in η 6= (x̄). Because of that, both the upper bound and the lower bound require some routine adjustments. The second ingredient is a useful normal form for sets of dependencies, im- plicit in [7], resulting from the decomposition of schemas into so-called kinds [5]. Lemma 1. For each mapping M one can compute in doubly exponential time sets Σ1 , Σ2 , . . . , Σm of dependencies, each obtained from the set of dependencies of M by replacing all target-side patterns ψ(x̄, ȳ) with exponential-size positive boolean combinations η = (x̄, ȳ) of equalities over x̄, ȳ and constants from D, such that for each data tree S, (S, T ) ∈ [[M]] for some data tree T if and only if S |= Σi for some i ∈ {1, . . . , m}. Proposition 2. Cons is decidable ( 2ExpTime-complete) if the input mapping has only dependencies of the following two forms: ϕ(x̄) ∧ α6= (x̄) ⇒ ψ(x̄, ȳ) ∧ α= (x̄, ȳ) , ϕ(x̄) ⇒ α6= (x̄) . Proof. By the assumption on the form of dependencies in the input mapping, using Lemma 1 we obtain sets Σi of dependencies of the forms ϕ(x̄) ∧ α6= (x̄) ⇒ η = (x̄, ȳ) ∧ α= (x̄, ȳ) , ϕ(x̄) ⇒ α6= (x̄) . Dependencies of the second form are already non-mixing constraints. Dependen- cies of the first form can be rewritten as ϕ(x̄) ⇒ ¬α6= (x̄) ∨ η = (x̄, ȳ) ∧ α= (x̄, ȳ) ,  and because ¬α6= (x̄) is equivalent to a positive Boolean combination of equalities, we can assume they are of the form ϕ(x̄) ⇒ η = (x̄, ȳ) . Finally, variables ȳ can be eliminated from η = (x̄, ȳ) by rewriting the positive boolean combination in the disjunctive normal form, completing each disjunct with equalities following by transitivity, and then dropping all equalities involv- ing ȳ. Because η = (x̄, ȳ) has exponential size, it uses only exponentially many constants. Consequently, there can be only exponentially many different dis- juncts that do not contain an equality between two different constants; disjuncts containing such equalities can be dropped. After these modifications each Σi is a set of non-mixing constraints and by Theorem 1 it can be tested if either of them is satisfiable in a tree conforming to the source schema. The total size of the obtained non-mixing constraints is exponential, because of Lemma 1, but the size of patterns has not increased. Hence, the problem is in 2ExpTime. The lower bound follows from the proof of Theorem 1. t u Note that dependencies of the form ϕ(x̄) ⇒ ψ(x̄, ȳ) ∧ α= (x̄, ȳ) ∧ α6= (x̄) can be replaced with ϕ(x̄) ⇒ ψ(x̄, ȳ) ∧ α= (x̄, ȳ) and ϕ(x̄) ⇒ α6= (x̄); hence, they are also covered by Proposition 2. Problem 1. It is an interesting question if Proposition 2 can be extended to map- pings with function symbols. This would correspond to allowing appropriately limited use of function symbols in non-mixing constraints. As function symbols can simulate fresh variables on the target, this will require quite some care. 3 Composition membership Composition membership is the following decision problem: Problem: Comp Input: mappings M, M0 , trees S, T Question: (S, T ) ∈ [[M]] · [[M0 ]] ? While it is not a static analysis problem—data are clearly there—it does resemble one, because it asks about the existence of an instance I such that (S, I) ∈ [[M]] and (I, T ) ∈ [[M0 ]]. Consequently, despite the simplicity of the logical formalism involved, the problem sits astride the decidability frontier: it is undecidable if either = or 6= is allowed, but becomes decidable if both are forbidden [1]. The argument for the latter makes an additional assumption that function symbols are forbidden. Most often, this is just a simplifying assumption. Surprisingly, in this case it is necessary, as the following result shows. Proposition 3. Comp is undecidable for mappings with function symbols. In fact, even existence of a source tree for a given target tree with respect to a fixed mapping is undecidable. Proof. To show that the latter problem is undecidable, we reduce the following tiling problem: given a set of tiles X , an initial tile I ∈ X , a final tile F ∈ X , and relations H, V ⊆ X × X , decide if for some N there exists an N × N tiling; that is, a function f : {1, 2, . . . , N }2 → X such that f (1, 1) = I, f (N, N ) = F , (f (i, j), f (i, j + 1)) ∈ V , and (f (j, i), f (j + 1, i)) ∈ H for all i ≤ N and j < N . The mapping does not depend on the instance of tiling: the source schema is r → a, a → b, b → b + c, the target schema is r → IF V ∗ H ∗ , V → V1 V2 , H → H1 H2 , and the dependencies are     r a(x), //c(y) −→ r I(f (x, x)), F (f (y, y)) , r // (x), // (y)[ (y 0 )] −→ V V1 (f (x, y)), V2 (f (x, y 0 )) ,     r // (x)[ (x0 )], // (y) −→ H H1 (f (x, y)), H2 (f (x0 , y)) .     The target instance T is the standard tree encoding of a relational instance with relations I, F , H, V storing an instance of the tiling problem. Notice that the set X is not represented explicitly, but it is assumed to be the set of values occurring in the relations H, V . The source instance is intended to represent a finite linear order with the least element in the a node, and the largest element in the c node. Let us see that the reduction is correct. If there exists an N × N tiling, there is a tree satisfying the dependencies: the data values in this tree are just successive numbers from 1 to N , and the function symbol f is interpreted as the tiling function. Conversely, if there is a source tree satisfying the dependencies, there is also one where all data values in the leaves are different: – no new rule will be fired, because the dependencies do not have inequality on the source side; – no rule fired previously will be violated, because no data value is used directly on the target side, and the function f on new values can be defined just like on the values they replace. Let us call these data values d1 , d2 , . . . , dN , according to their order of appear- ance in the single branch of the source tree. The interpretation of the function symbol f , witnessing that the dependencies are satisfied, gives an N × N -tiling. It follows that Comp is undecidable as well, because a source tree exists for T with respect to M if and only if (T0 , T ) ∈ [[M0 ]] · [[M]], where T0 is some fixed tree and M0 is the mapping from the trivial schema allowing only T0 to the source schema of M, whose set of dependencies is empty. t u Thus, to regain decidability we need to restrict the use of =, 6=, and function symbols. But do we have to forbid them entirely, as in [1]? Below we show that the restriction can be much weaker than that. The resulting class of dependencies extends that of Proposition 2 in two ways: the second form allows equalities on the source side and unrestricted use of fresh variables on the target side. Proposition 4. Comp is decidable (2ExpTime-complete) if the second map- ping uses no function symbols and has dependencies of the following two forms: ϕ(x̄) ∧ α6= (x̄) ⇒ ψ(x̄, ȳ) ∧ α= (x̄, ȳ) , ϕ(x̄) ∧ α= (x̄) ⇒ ψ(x̄, ȳ) ∧ α6= (x̄, ȳ) . Proof. Let M, M0 , S, T an instance of Comp. Rather than using Lemma 1, as in Proposition 2, we note that we always evaluate dependencies of M0 over pairs of trees of the form (I, T ). Hence, in each dependency we can replace the pattern ψ(x̄, ȳ) with a formula γ = (x̄, ȳ) that is a disjunction of exponentially many conjunctions of equalities between variables and data values used in T , corresponding to all ways of matching ψ(x̄, ȳ) in T . Moreover, by the safety assumption we know that in each conjunction all variables ȳ occur in at least one equality. Moving all data comparisons to the target side, we get ϕ(x̄) ⇒ ¬α6= (x̄)∨ γ = (x̄, ȳ)∧α= (x̄, ȳ) , ϕ(x̄) ⇒ ¬α= (x̄)∨ γ = (x̄, ȳ)∧α6= (x̄, ȳ) .   Dependencies of the first form can be turned into non-mixing constraints as in the proof of Proposition 2. In dependencies of the second form we can also eliminate variables ȳ. Recall that γ = (x̄, ȳ) = γ1= (x̄, ȳ) ∨ · · · ∨ γm = (x̄, ȳ), and in each γi= (x̄, ȳ) each of the variables ȳ appears in an equality with a data value; we can assume that it is exactly one equality, because otherwise γi= (x̄, ȳ) is not satisfiable and can be dropped. Hence, we can replace γ = (x̄, ȳ) ∧ α6= (x̄, ȳ) with the disjunction of formulas γi= (x̄, ȳ) ∧ α6= (x̄, ȳ) for i = 1, . . . , m. In each γi= (x̄, ȳ) ∧ α6= (x̄, ȳ) we eliminate ȳ independently, replacing all their occurrences with the data values assigned to them by γi= (x̄, ȳ). Because ¬α= (x̄) is equivalent to a positive boolean combination of inequalities, we thus transform dependencies of the second form into non-mixing constraints. Let Σ 0 be the resulting set of constraints. Since M can use function symbols, we can assume that its dependencies do not introduce new variables on the target side. Let ∆ be the set of all terms occurring in M with variables substituted with data values used in S. Restrict D to data values used in S or T , and |∆| fresh data values. Iterate through all consistent valuations of terms in ∆. For each valuation consider formulas ψ(ū) ∧ β(ū) for all dependencies ϕ(x̄)∧α(x̄) ⇒ ψ(x̄)∧β(x̄) of M and all ū ∈ D such that S |= ϕ(ū) ∧ α(ū). It one of β(ū) does not hold, discard this valuation of ∆. Otherwise, let Ψ be the set of all such ψ(ū). Now it remains to check that there exists a tree I conforming to the target schema of M and the source schema of M0 , such that I |= Ψ and I |= Σ 0 . This is essentially an instance of the problem from Theorem 1, except that we have an intersection of two schemas. This is not problematic, because the setting in [6] abstracts schemas as automata, which are closed under intersection and the resulting automaton is quadratic. The total size of patterns from Ψ and constraints from Σ 0 is exponential, but the size of involved patterns has not increased. Hence, the problem is in 2ExpTime. The lower bound holds already for mappings without data comparisons [1]. t u Problem 2. Unrestricted use of either = or 6= on both sides of the second mapping immediately leads to undecidability of Comp [1], but it would be very interesting to see if one can allow dependencies of the form ϕ0 (x̄) ⇒ ψ 0 (x̄, ȳ) ∧ α= (x̄, ȳ) ∧ α6= (x̄, ȳ) , which is not excluded by the existing undecidability results. Problem 3. When the mappings are fixed, the algorithm from Proposition 4 runs in exponential time. One easily transfers the NP lower bound from the relational case [9], but no tighter bound is known. This leaves a substantial gap. 4 Certain answers The central problem of data exchange is answering queries over the target. Given that the solution to a given source instance is typically not unique, the adopted semantics is that of certain answers: we only keep those tuples that are returned over each solution to the given source instance. That is, query answering in the context of data exchange boils down to the following problem: Problem: Cert Input: mapping M, tree pattern query q(x̄), tree S, tuple ā Question: Does T |= q(ā) hold for each solution T for S wrt. M? In [1] it is shown that as soon as one forbids inequality in queries, the problem becomes decidable. This proof goes by reduction to a more fundamental problem of certain answers in incomplete information scenarios [3], which in turn can be seen as an instance of query containment for unions of conjunctive queries over trees in the presence of a schema [4]. In [3] the problem is solved by resorting to first order types of fixed rank, which gives data complexity in coNP, but apparently very high combined complexity, not even determined in the paper. Here we show how to use the results on non-mixing constraints to pinpoint the exact complexity, and also to extend decidability to a larger class of queries. Proposition 5. Cert is decidable ( 2ExpTime-complete) for unions of queries of the form ∃ȳ ϕ(x̄, ȳ) ∧ α= (x̄, ȳ) and ∃ȳ ϕ(x̄, ȳ) ∧ α6= (x̄, ȳ). Proof. Let M, S, q(x̄), ū ∈ D be an instance of Cert as in the statement. Construct a set Σ 0 of non-mixing constraints equivalent to the negation of q(ū): for disjuncts ∃ȳ ϕ(ū, ȳ) ∧ α= (ū, ȳ) and ∃ȳ ϕ(ū, ȳ) ∧ α6= (ū, ȳ), take constraints ϕ(ū, ȳ) ⇒ ¬α= (ū, ȳ) and ϕ(ū, ȳ) ⇒ ¬α6= (ū, ȳ) with negation pushed down to the atoms. Then proceed exactly like in Proposition 4 to reduce to the problem in Theorem 1 and get the 2ExpTime upper bound. The lower bound follows from the proof of Theorem 1. t u The upper bound transfers to the certain answers problem in the incomplete information scenario, as the latter can be easily seen as a special case of Cert; the bound is also tight, by the same argument as above. In fact, for such queries even validity with respect to a schema is 2ExpTime-hard. Problem 4. The lower bound in Proposition 5 requires both = and 6=. For queries using only =, the exact complexity remains an open question, particularly in- triguing for the incomplete information scenario. Problem 5. The upper bound above relies on an algorithm whose data complex- ity is exponential. We also know that there is, at least for queries using only =, a nondeterministic algorithm with polynomial data complexity, but undetermined combined complexity. Can we have an algorithm with both data complexity and combined complexity optimized? What does that even mean in a situation when data complexity is pinpointed in the nondeterministic model, and combined com- plexity in the deterministic model? Acknowledgments. We thank Wojtek Czerwiński and Pawel Parys for fruitful discussions. A large part of this work was done while the second author was visiting researcher at Université Paris-Est Marne-la-Vallée. References 1. Shun’ichi Amano, Claire David, Leonid Libkin, and Filip Murlak. XML schema mappings: Data exchange and metadata management. J. ACM, 61(2):12:1–12:48, 2014. 2. Marcelo Arenas and Leonid Libkin. XML data exchange: Consistency and query answering. J. ACM, 55(2), 2008. 3. Pablo Barceló, Leonid Libkin, Antonella Poggi, and Cristina Sirangelo. XML with incomplete information. J. ACM, 58(1):4, 2010. 4. Henrik Björklund, Wim Martens, and Thomas Schwentick. Optimizing conjunctive queries over trees using schema information. In Proc. MFCS, pages 132–143, 2008. 5. Mikolaj Bojanczyk, Leszek Aleksander Kolodziejczyk, and Filip Murlak. Solutions in XML data exchange. J. Comput. Syst. Sci., 79(6):785–815, 2013. 6. Wojciech Czerwiński, Claire David, Filip Murlak, and Pawel Parys. Reasoning about integrity constraints for tree-structured data. In Proc. ICDT, pages 20:1– 20:18, 2016. 7. Claire David, Piotr Hofman, Filip Murlak, and Michal Pilipczuk. Synthesizing transformations from XML schema mappings. In Proc. ICDT, pages 61–71, 2014. 8. Ronald Fagin, Phokion G. Kolaitis, Renée J. Miller, and Lucian Popa. Data ex- change: semantics and query answering. Theor. Comput. Sci., 336(1):89–124, 2005. 9. Ronald Fagin, Phokion G. Kolaitis, Lucian Popa, and Wang Chiew Tan. Composing schema mappings: Second-order dependencies to the rescue. ACM Trans. Database Syst., 30(4):994–1055, 2005.