On Incomplete XML Documents with Integrity Constraints Pablo Barceló1 , Leonid Libkin2 , and Juan Reutter2 1 Department of Computer Science, University of Chile 2 School of Informatics, University of Edinburgh Abstract. We consider incomplete specifications of XML documents in the presence of schema information and integrity constraints. We show that integrity constraints such as keys and foreign keys affect consis- tency of such specifications. We prove that the consistency problem for incomplete specifications with keys and foreign keys can always be solved in NP. We then show a dichotomy result, classifying the complexity of the problem as NP-complete or PTIME, depending on the precise set of features used in incomplete descriptions. 1 Introduction While much is known about the transfer and extension of traditional relational tools to XML data, the study of incomplete information in XML has not yet received much attention. Various papers considered specific tasks related to the handling of incomplete information in XML. For example, [2] concentrated on incompleteness arising in a setting where the structure of a document is revealed by a sequence of queries, [10, 11] expressed incompleteness by means of descrip- tion logic theories, and [16] showed how to deal with incompleteness in query results. In relational theory, incompleteness of information has been studied inde- pendently of any particular application, with two seminal papers providing the foundation of the theory of databases with incomplete information. The paper [13] by Imielinski and Lipski introduced tables as a representation mechanism for incompleteness, and studied query evaluation over different types of tables. The paper [1] by Abiteboul, Kanellakis, and Grahne then answered most funda- mental questions related to the complexity of computational problems associated with incompleteness. A recent paper [5] attempted to re-do the basic results of [1, 13] in the XML context. It defined incomplete XML documents, and looked at two basic classes of problems related to them: Representation Given an incomplete description of a document, does it rep- resent some document (i.e., is it consistent)? And can it represent a specific complete document? Querying How does one answer queries posed over incomplete descriptions? Specifically, how does one compute answers to queries in a way that is consis- tent with every document that is represented by the incomplete description, and what is the complexity? While [5] answered these questions for many types of incomplete descriptions, even in the presence of XML schemas, it did not look at the issue of documents with integrity constraints. However, integrity constraints are ubiquitous in the XML context: many documents are generated from databases that typically specify keys and inclusion constraints, and such constraints have now found their way into the standards for describing XML documents. So we would like to see how the key tasks of handling incompleteness in XML behave when integrity constraints enter the picture. In the relational context, it is known that constraints often change the complexity of many tasks, from dependency implication to representation to querying [4, 18, 9]. In this paper we deal with the Representation task. As the membership prob- lem (is a complete document represented by an incomplete description?) is not affected by the presence of constraints, we concentrate on the following: Consistency Problem Given a schema S, an incomplete description of a doc- ument t, and a set ∆ of constraints, are they consistent? That is, is there a complete document T represented by t that conforms to the schema S and satisfies all the constraints in ∆? To see why constraints change the picture for XML with incomplete infor- mation, consider three incomplete descriptions of documents below. r r r a(x) a(x) a(x) a(x) a(x) a(1) b(x) (1) (2) (3) Putting a(x) next to a node means that it is labeled a, and the value of its attribute is not known; putting a(1) means that the value of the attribute is 1. Note that variables can be re-used, i.e., we have naı̈ve nulls. Without integrity constraints, all three descriptions are consistent: we can assign any value to x, and any ordering to children that agrees with the horizon- tal edge we have. Now look at the tree (1) and assume that we have a constraint saying that the attribute of a-nodes is a key. Since we did not specify any re- lationship between the children of the a-nodes in this tree, the description is still consistent, as it is satisfied by a tree with just one a-child of the root. But tree (2) is not consistent: the horizontal edge tells us that there are two distinct a-nodes with the same value of their attribute, which therefore cannot be a key. Now let us look at tree (3). Assume that we have the same key information about a. The description is consistent: one can set x to be any value other than 1. Likewise, if we say that the attribute of b is a key, the description remains consistent. However, if we add an inclusion constraint that attribute values of a-nodes are among the attribute values of b-nodes, the tree is becoming incon- sistent: the key constraint tells us that x 6= 1 and thus the inclusion constraint is violated. In order to restore consistency, a tree that “completes” this description must add a new b-node under the root with attribute value 1. The consistency problem for schemas and constraints (without incomplete- ness) was studied in [12, 3], with the main result stating that it is: – undecidable, if non-unary constraints are used (i.e., n-attribute keys and foreign keys, for n > 1); and – NP-complete for unary constraints. Hence, in the paper we consider only unary constraints. Our main questions are: (a) Do upper bounds on the complexity of the problem continue to hold in the presence of incomplete information? (b) What is the precise complexity of the consistency problem? Our main results answer these questions as follows: (a) For DTDs, unary constraints, and incomplete information, we retain the NP upper bound. (b) With DTDs, even very simple instances of the problem are NP-complete. Without DTDs, we prove a dichotomy theorem, classifying the complexity as either NP-complete or PTIME, depending on what features are allowed in incomplete descriptions. Organization Basic notations are given in Section 2. Incomplete descriptions of XML documents are presented in Section 3. The consistency problem is described in Section 4. We state the main result and prove it. Due to space limitations, some proofs are shown in the appendix. 2 Preliminaries XML documents and DTDs Assume that we have the following disjoint countably infinite sets: – Labels of possible names of element types (that is, node labels in trees); – Attr of attribute names; we precede them with an @ to distinguish them from element types; – I of node ids; and – D of attribute values (e.g., strings). We formally define trees as two-sorted relational structures over node ids and attribute values. For finite sets of labels and attributes, Σ ⊂ Labels and A ⊂ Attr, define the vocabulary τΣ,A = (E, NS, (A@a )@a∈A , (Pℓ )ℓ∈Σ ) where all relations are binary except for the Pℓ ’s, which are unary. A tree T is a 2-sorted structure of vocabulary τΣ,A , i.e. T = hV, D, τΣ,A i, where V ⊂ I is a finite set of node ids, D ⊂ D is a finite set of data values, and – E, NS are the child and the next-sibling relations, so that hV, E, NSi is an ordered unranked tree; we also use E ∗ and NS ∗ to denote their reflexive- transitive closures (respectively, descendant or self, and younger sibling or self). – each A@ai assigns values of attribute @ai to nodes, i.e. it is a subset of V × D such that at most one pair (s, c) is present for each s ∈ V ; – Pℓ are labeling predicates: s ∈ V belongs to Pℓ iff it is labeled ℓ; as usual, we assume that the Pℓ ’s are pairwise disjoint. We refer to a node that is labeled ℓ as an ℓ-node, and to the attribute @a of a node as its @a-attribute. A DTD over a set Σ ⊂ Labels of labels and A ⊂ Attr of attributes is a triple d = (r, ρ, α), where r ∈ Σ, and ρ is a mapping from Σ to regular languages over Σ − {r}, and α is a mapping from Σ to subsets of A. As usual, r is the root, and in a tree T that conforms to d (written as T |= d), for each node s labeled ℓ, the set of labels of its children, read left-to-right, forms a string in the language of ρ(ℓ), and the set of attributes of s is precisely α(ℓ). We assume, for complexity results, that regular languages are given by NFAs. XML Integrity Constraints We consider keys, inclusion constraints and for- eign keys as our integrity constraints. They are the most common constraints in relational databases, and are common in XML as well, as many documents are generated from databases. Moreover, these sets of constraints are similar to, but more general than XML ID/IDREF specifications, and can be used to model most of the key/keyref specifications of XML Schema used in practice [17, 15]. Let Σ ⊂ Labels and A ⊂ Attr, and let T be an XML tree. Then a constraint ϕ over Σ and A is one of the following: – Key ℓ.X → ℓ, where ℓ ∈ Σ and X is a set of attributes from A. The XML tree T satisfies ϕ, denoted by T |= ϕ iff for every ℓ-node s in T , X is contained in the set of attributes of s, and, in addition, T satisfies  ^   ∀x∀y Pℓ (x) ∧ Pℓ (y) ∧ ∃u A@a (x, u) ∧ A@a (y, u) → x = y. @a∈X – Inclusion Constraint ℓ1 [X] ⊆ ℓ2 [Y ], where ℓ1 , ℓ2 ∈ Σ and X = @a1 , . . . , @an and Y = @b1 , . . . , @bn are nonempty lists of attributes from A of the same length. We write T |= ϕ iff for every ℓ1 -node (resp. ℓ2 -node) s in T , X (resp. Y ) is contained in the set of attributes of s, and, in addition, T satisfies  ^ ^    ∀x ∀u1 · · · ∀un Pℓ1 (x) ∧ A@ai (x, ui ) → ∃yPℓ2 (y) ∧ A@bi (y, ui ) . 1≤i≤n 1≤i≤n – Foreign Key: A combination of an inclusion constraint and a key constraint, namely ℓ1 [X] ⊆FK ℓ2 [Y ] if ℓ1 [X] ⊆ ℓ2 [Y ] and ℓ2 .Y → ℓ2 . We write T |= ϕ if T satisfies both the inclusion and the key constraint. As usual, a key ℓ.X → ℓ indicates that two nodes labeled ℓ cannot have the same X-attribute values (i.e., X-attributes uniquely determine the node), an inclusion constraint ℓ1 [X] ⊆ ℓ2 [Y ] indicates that the list of X-attribute values of every ℓ1 node must match the list of Y -attribute values of an ℓ2 -node, and a foreign key ℓ1 [X] ⊆FK ℓ2 [Y ] indicates that X is a foreign key of ℓ1 -nodes referencing the key Y of ℓ2 -nodes. A constraint is called unary if all sets of attributes involved are singletons. That is, unary keys are ℓ.@a → ℓ and unary inclusion constraints are ℓ1 [@a1 ] ⊆ ℓ2 [@a2 ]. Consistency of constraints and DTDs The consistency problem for con- straints and DTDs is as follows: given a DTD d and a set ∆ of keys, inclusion constraints, and foreign keys, is there a tree T so that T |= d and T |= ∆? Of course by T |= ∆ we mean T |= ϕ for every ϕ in ∆. The following is known. Theorem 1 ([12]). 1. The consistency problem for DTDs and constraints is undecidable, even if all sets of attributes involved in constraints have cardinality at most 2. 2. The consistency problem for DTDs and unary constraints is NP-complete. In view of this result, in what follows we only consider unary constraints. 3 XML with Incomplete Information We follow the model of incompleteness in XML proposed in [5]. That model extends, in a natural way, the notion of tables [13] to XML documents. We shall not use every single feature of the model of [5], trying to keep the description reasonable, but the features we consider are sufficient for studying the interaction between incompleteness and constraints. Roughly speaking, incomplete XML trees can occur as a result of missing some of the following information: (a) attribute values (they can be replaced with variables) (b) node labels (they can be replaced by wildcards ); (c) precise vertical relationship between nodes (we can use descendant edges in addition to child edges); (d) precise horizontal relationship between nodes (using younger-sibling edges instead of next-sibling). All these types of incompleteness are represented by means of tree/forest descriptions. An ℓ-node with m attributes will be described as β = ℓ[@a1 = z1 , . . . , @am = zm ], where – ℓ ∈ Σ ∪ { } (label or wildcard); – @a1 , . . . , @am are attribute names, and each zi is a variable, or a constant from D. Incomplete documents are given by means of incomplete tree descriptions (t) and incomplete forest descriptions (f ): t := βhf ihhf ′ ii (1) f, f ′ := ε | t1 θ1 t2 θ2 . . . θk tk+1 | f, f ′ where each ti is a tree, and each θi is either → or →∗ . Informally, a tree βhf ihhf ′ ii has the node denoted by β as the root, a forest f of children, and a forest f ′ of descendants. A forest is either empty, or a sequence of trees with specified → and →∗ relationships between their roots, or a union of forests. For example, the tree (3) from the example in the introduction is given as follows: rhβa(x) → βa(1) , βb(x) i, where βa(x) = a[@a = x], βa(1) = a[@a = 1], and βb(x) = b[@b = x], assuming that the attributes of a- and b-nodes are called @a and @b, respectively. There are two ways to give the semantics: by satisfaction of incomplete de- scriptions in trees, and by homomorphisms between relational representations. We use the former here; both are used, and shown to be equivalent, in [5]. Let z̄ be the set of all variables (nulls) used in t. Given a valuation ν : z̄ → D, and a node s of T , we use the semantic notion (T, ν, s) |= t: intuitively, it means that a complete tree T matches t at node s, if nulls are interpreted according to ν. Then we define Rep(t) = {T | (T, ν, s) |= t for some node s and valuation ν}. We now define (T, ν, s) |= t, as well as (T, ν, S) |= f (which means that T matches f at a set S of roots of subtrees in T ). We assume that ν is the identity when applied to data values from D. – (T, ν, s) |= ℓ[@a1 = z1 , . . . , @am = zm ] iff s is labeled ℓ (if ℓ 6= ) and the value of each attribute @ai of s is ν(zi ) (i.e., (s, ν(zi )) ∈ A@ai ). – (T, ν, s) |= βhf ihhf ′ ii iff (T, ν, s) |= β and there is a set S of children of s such that (T, ν, S) |= f and a set S ′ of descendants of s such that (T, ν, S ′ ) |= f ′ . – (T, ν, ∅) |= ε; – (T, ν, S) |= t1 θ1 t2 θ2 . . . θk tk+1 iff there exists a sequence s1 , . . . , sk+1 of el- ements from S, in which every element from S appears at least once, and such that (T, ν, si ) |= ti for each i ≤ k + 1, and (si , si+1 ) is in NS whenever θi is →, and in NS ∗ whenever θi is →∗ , for each i ≤ k. – (T, ν, S) |= f1 , f2 iff S = S1 ∪ S2 such that (T, ν, Si ) |= fi , for i = 1, 2. The minimum we need to describe a tree structure is the child edges and the union of forests, hence we assume that those are always present. In other words, the minimal grammar we consider is t:=βhf i, f :=ε | t | f, f , with β using only labels from Σ. We refer to incomplete descriptions given by this grammar as basic incomplete trees. Additional features are: – next sibling →; – younger sibling →∗ ; – descendant hhf ii (which is also represented by ↓∗ ); and – wildcard in place of labels. Depending on which of these 4 features are used, we have 16 classes of trees. For example, (→, )-trees refers to the situation when we allow wildcard and only → as θi ’s, and (↓∗ , →, →∗ , )-trees refer to the full grammar (1). 4 Consistency problem As already described in the introduction, we consider the consistency problem for XML incomplete descriptions in the presence of integrity constraints (keys and inclusion dependencies). More formally, let ∆ be a set of unary XML integrity constraints. We consider the following problem: Problem: Consistency(∆) Input: an incomplete description t Question: is there a tree T ∈ Rep(t) so that T |= ∆? We also look at the version with DTDs d, namely: Problem: Consistency(∆, d) Input: an incomplete description t Question: is there a tree T ∈ Rep(t) so that T |= ∆ and T |= d? We classify the complexity of the problem depending on the structure of incomplete trees, ranging from basic incomplete trees (that do not use any of the ↓∗ , →, →∗ , features) to incomplete trees described by the full grammar (1). Since for each of the version of Consistency we have 4 parameters that can be set, we have a total of 32 cases to consider: 16 without DTDs, and 16 with DTDs. For a class of incomplete trees we say that the consistency problem (or the consistency problem with DTDs) is – in PTIME, if it can be solved in PTIME given an input tree from the class; – NP-complete, if it can be solved in NP given an input tree from the class, and, for some fixed set of unary constraints ∆ (and a DTD d for the case of consistency with DTDs), then problem Consistency(∆) (respectively, Consistency(∆, d)) is NP-complete. Our main result is as follows. Theorem 2 (Dichotomy). 1. The consistency problem (without DTDs) is in PTIME for basic incomplete trees, →-incomplete trees and →∗ -incomplete trees; for the remaining 13 classes of incomplete trees it is NP-complete. 2. The consistency problem with DTDs is NP-complete for every class, from basic incomplete trees to (↓∗ , →, →∗ , )-incomplete trees. The proof of the theorem is organized as follows. In section 4.1 we explore the general NP upper bound for the consistency problem, and explore the tractable fragment of (→)-incomplete trees. Afterwards, in section 4.2, we provide tight lower bounds to show that Consistency becomes intractable under the addition of DTDs, wildcards or transitive closures. It should be noticed that all the lower bounds presented in this paper hold if one considers schema definitions that are more expressive than DTDs, such as XSD or Relax NG [14]. Whether the upper bounds still hold is an open question, that will be part of our future work. 4.1 Upper bounds In [5], the authors show a general NP upper bound for the Consistency(d) problem (that is, considering only a fixed DTD and no integrity constraints). Interestingly, we show that the presence of unary integrity constraints in our problem does not alter the complexity of the consistency problem. Theorem 3. Consistency(∆, d) is in NP, for each fixed DTD d and set ∆ of unary XML integrity constraints. Proof sketch: One can prove this by combining two previously known results. The first is the one already mentioned that, for each fixed DTD d, the problem Consistency(d) is in NP [5]. The proof in [5] uses the following fact: If t is an incomplete description and the set Repd (t) = Rep(t) ∩ {T | T |= d} is nonempty (i.e. Consistency(d) is true for t), then Repd (t) contains a tree of polynomial size. The second result that we use is the following: Theorem 4 ([12]). There is a polynomial time algorithm that given a DT D d and a set of unary XML integrity constraints ∆, constructs an integer matrix A and an integer vector b such that there exists an XML tree T that conforms to d and satisfies ∆ if and only if Ax = b has an integer solution. Intuitively, the solution for Ax = b represents the number of nodes in the tree that satisfies d and ∆ that are labeled with each label ℓ in the alphabet. Moreover, it was shown in [12] that the solution of the system Ax = b provides an algorithm for constructing a tree that conforms to d and satisfies ∆. One can prove Theorem 3 by combining the two results mentioned above. Intuitively, an NP algorithm for solving Consistency(∆, d) should do the fol- lowing on input t: 1. Construct from ∆ and d a set of equations Ax = b as shown in Theorem 4. 2. Guess a polynomial size tree T that belongs to Repd (t). 3. Construct in polynomial time a set of linear equations ΓT that represent the shape of T , and augment the set of equations Ax = b with ΓT . Let E be the augmented set of equations. 4. Check whether there is an integer solution for E. Clearly, the whole process can be done in nondeterministic polynomial time. Intuitively, the solutions of the sets of equations E will represent, for every ℓ ∈ Σ, the number of ℓ labeled nodes that must be introduced when extending T into a tree T ′ that conforms to d and satisfies ∆. As usual, the technical details behind this proof are far more complicated that the intuition provided above. We comment more about those technical details in the appendix. 2 Tractable cases. The only tractable cases are obtained when we do not allow DTDs as inputs nor wildcards in the incomplete descriptions, and by severely limiting the features that may allow two formulas in an incomplete description to be witnessed by a same node in a repair: this is the case for (→)-incomplete trees and (→∗ )-incomplete trees. But before proving this result, we make a cru- cial observation about the interaction of inclusion constraints in the consistency problem when no DTD is considered. The following proposition shows that the inclusion constraints can be ignored when checking Consistency w.r.t. incom- plete trees without DTD’s: Proposition 1. For every incomplete description t, and every set ∆ of unary XML constraints, let ∆K be the set of key constraints of ∆ (notice that a foreign key is defined as a union of a key and an integrity constraint). Then Consistency(∆) is true for t if and only if Consistency(∆K ) is true for t. Proof. The direction from left to right is obvious (the same tree will suffice). For the other direction, let t be an incomplete description and ∆ be a set of unary constraints, such that Consistency(∆K ) is true for t, where ∆K is as previously defined. Select a tree T ∈ Rep(t) ∩ {T | T |= ∆K }, and consider a tree T ′ built as follows: for every inclusion constraint ϕ of the form ℓ1 [@a1 ] ⊆ ℓ2 [@a2 ] in ∆, and for every ℓ1 -node s of T such that there is no ℓ2 -node s′ in T with the value of its @a2 -attribute equal to the value of the @a1 -attribute of s, add to T ′ as a child of the root a node s′ that satisfies this property. It is easy to see that T ′ |= ∆K . Further, by the construction of T ′ , it is also the case that T ′ |= ∆. Since T ′ ∈ Rep(t), this finishes the proof of our claim. 2 We now prove the tractable upper bound Proposition 2. There exists a PTIME algorithm for solving Consistency(∆) for (→)-incomplete trees and (→∗ )-incomplete trees. Proof. Notice that, just as with complete XML documents, it is possible to define a relational representation of an incomplete description t. Roughly speak- ing, given an incomplete description t over an alphabet Σ of labels and A of attributes, the relational representation of t, denoted as reℓ(t), is a structure over the vocabulary τΣ,A that is defined as expected, in a way that resembles the shredding of a tree under the well known edge relational representation (see [5] for a precise definition). Thus, given a (→)-incomplete description t and a set of keys ∆ (due to Proposition 1 we do not consider inclusion dependencies), it is possible to define a chase procedure over reℓ(t) so that t is consistent under ∆ if and only if there exists an accepting chase sequence on reℓ(t). The chase sequence will intuitively collapse all formulas in t that are forced by ∆ to represent only one node in every tree T ∈ Rep(t). Thus, for example, if t contains two formulas α1 = ℓ[@a = a, @b = b] and α2 = ℓ[@a = a, @b = x] and ∆ contains the key ℓ.@a → ℓ, the chase will intuitively collapse α1 and α2 so that they now represent only one node, since every tree T that satisfies ∆ cannot have two ℓ-nodes with the same value for their @a-attribute. We omit the formal definition of this procedure and the proof of its correct- ness and soundness, since they can easily be obtained from the chase procedure for incomplete trees defined in [5]. The proof for the case of (→∗ )-incomplete descriptions is analogous, albeit in this case the procedure must take into account the fact that formulas of the form f1 →∗ f2 may be collapsed as well by the chase procedure. 2 4.2 Lower Bounds As the following theorem [5] shows, the consistency problem is already difficult when considering only DTDs (no integrity constraints). Proposition 3 ([5]). There exists a DTD d such that Consistency(d) is NP- complete for basic incomplete trees. Thus, under the presence of DTDs one cannot obtain tractability even for the most basic incomplete descriptions. On the other hand, it is easy to see that the consistency problem is tractable if we do not consider neither DTDs nor schema constraints. In fact, it can be proved that every (↓∗ , →, →∗ , )-incomplete tree is consistent [5]. However, as we shall study, adding integrity constraints to the consistency problem easily yields to intractability. To begin with, the presence of wildcards is surprisingly problematic. Proposition 4. There exists a set of unary keys ∆ such that Consistency(∆) is NP-hard for ( )-incomplete trees. The proof of this result can be found in the appendix. Notably, the incomplete trees constructed in that proof do not make use of the union operator for forests. It follows then that there exists a set of unary keys ∆ such that Consistency(∆) is NP-hard for ( )-incomplete trees in which the union operator is not used. Continuing with the lower bounds, the transitive closure operators prove also to be a problematic feature, even in the absence of the union operator for forests. Proposition 5. There exist sets of unary keys ∆1 and ∆2 such that the problems – Consistency(∆1 ) for (↓∗ )-incomplete trees, and – Consistency(∆2 ) for (→, →∗ )-incomplete trees are NP-hard, even if incomplete trees are not allowed to use the union operator. Proof sketch: We only have to prove hardness. Further, we only show the re- duction for (→, →∗ )-incomplete trees; the reduction for (↓∗ )-incomplete trees is similar. We use a reduction from the shortest common superstring problem. Given a set S = {s1 , . . . , sn } of strings over a fixed alphabet Σ and a positive integer K, the shortest common superstring problem is the problem of deciding whether there exists a string w ∈ Σ ∗ , with |w| ≤ K, such that each string s ∈ S is a substring of w, i.e. w = w0 sw1 for some w0 , w1 ∈ Σ ∗ . First, define Σ ′ to be the alphabet {st, mid, end, r}, and let A be the set of attributes {@id, @e}. We fix ∆ to be the following set of unary keys: {st.@id → st, end.@id → end}. From S we construct a (→, →∗ )-incomplete tree t as follows. The incomplete description t is defined as rhtK →∗ ts1 →∗ . . . →∗ tsn i. Here, tK refers to the tree sthst[@id = 1] → mid → mid . . . → mid → end[@id = 1]i, that contains exactly K nodes labeled mid (we assume that 1 is a data value different from each letter in Σ). Further, for each string si = a1 · · · am of S, the tree tsi is defined as tsi := sthst[@id = 1] →∗ mid[@e = a1 ] → · · · → mid[@e = am ] →∗ end[@id = 1]i. Intuitively, in order to restore the consistency of t with respect to ∆, one must collapse each tree of the form tsi into tK . Since for each node s of the tree there is at most one data value c such that (s, c) belongs to A@e , the fact that this collapse is possible implies that there exists a common superstring of the elements of S of length K. It is not hard to prove then that Rep(t) ∩ {T | T |= ∆} =6 ∅ if and only if there is a superstring of S of length at most K. Details can be found in the appendix. 2 5 Future work Our next goal is to understand the complexity of query answering in the setting of incomplete information and integrity constraints. It was shown in [5] that certain answers can be computed by naı̈ve evaluation for trees whose structure is fully specified, but whose attributes can be assigned null values. This result can be extended to trees without any information on the sibling order, as long as queries do not mention it. Outside of these classes, computing certain answers is intractable. Hence, we plan to understand how the complexity of query evalu- ation is affected for these classes of incomplete trees if keys and/or foreign keys are used. Acknowledgments The first author is supported FONDECYT grant 11080011, the second and the third author by EPSRC grant G049165 and FET-Open project FoX (Foundation of XML). References 1. S. Abiteboul, P. Kanellakis, G. Grahne. On the representation and querying of sets of possible worlds. TCS 78 (1991), 158–187. 2. S. Abiteboul, L. Segoufin, V. Vianu. Representing and querying XML with incom- plete information. In PODS’01, pages 150–161. 3. M. Arenas, W. Fan, L. Libkin. On the complexity of verifying consistency of XML specifications. SIAM J. Comput. 38 (2008), 841–880. 4. P. Atzeni, N. Morfuni. Functional dependencies and constraints on null values in database relations. Information and Control 70(1): 1–31 (1986). 5. P. Barceló, L. Libkin, A. Poggi, C. Sirangelo. XML With incomplete information: Models, Properties and Query Answering. In PODS09 237-246 (2009). 6. M. Benedikt, W. Fan, F. Geerts. XPath satisfiability in the presence of DTDs. J. ACM 55(2): (2008). 7. H. Björklund, W. Martens, T. Schwentick. Conjunctive query containment over trees. DBPL’07, pages 66–80. 8. H. Björklund, W. Martens, T. Schwentick. Optimizing conjunctive queries over trees using schema information. MFCS’08, pages 132–143. 9. A. Calı̀, D. Lembo, R. Rosati. On the decidability and complexity of query an- swering over inconsistent and incomplete databases. PODS’03, pages 260-271. 10. D. Calvanese, G. De Giacomo, M. Lenzerini. Semi-structured data with constraints and incomplete information. In Description Logics, 1998. 11. D. Calvanese, G. De Giacomo, M. Lenzerini. Representing and reasoning on XML documents: a description logic approach. J. Log. Comput. 9 (1999), 295–318. 12. W. Fan and L. Libkin. On XML integrity constraints in the presence of DTDs. J. ACM 49 (2002), 368–406. 13. T. Imielinski, W. Lipski. Incomplete information in relational databases. J. ACM 31 (1984), 761–791. 14. G. Jan Bex, W. Martens, F. Neven, T. Schwentick. Expressiveness of XSDs: from Practice to Theory, There and Back Again. WWW 2005, pages 712-721 (2005). 15. G. Jan Bex, F. Neven, J. Van den Bussche. DTD versus XML Schema: A Practical Study. WEBDB04, pages 79–84 (2004). 16. Y. Kanza, W. Nutt, Y. Sagiv. Querying incomplete information in semistructured data. JCSS 64 (2002), 655–693. 17. A. Laender, M. Moro, C. Nascimento, P. Martins. An X-Ray on Web-Available XML Schemas. SIGMOD Record 38(1) (2009), 37-42. 18. M. Levene, G. Loizou. Axiomatisation of functional dependencies in incomplete relations. Theoretical Computer Science 206 (1998), 283–300. 19. C.A. Tovey. A simplified satisfiability problem. Discrete Appl. Math. 8 (1984), pp. 8589.