<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>On Incomplete XML Documents with Integrity Constraints</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pablo Barcelo´</string-name>
          <email>A@a</email>
          <email>A@bi</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leonid Libkin</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Juan Reutter</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Chile</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Informatics, University of Edinburgh</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>We consider incomplete specifications of XML documents in the presence of schema information and integrity constraints. We show that integrity constraints such as keys and foreign keys affect consistency of such specifications. We prove that the consistency problem for incomplete specifications with keys and foreign keys can always be solved in NP. We then show a dichotomy result, classifying the complexity of the problem as NP-complete or PTIME, depending on the precise set of features used in incomplete descriptions.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        While much is known about the transfer and extension of traditional relational
tools to XML data, the study of incomplete information in XML has not yet
received much attention. Various papers considered specific tasks related to the
handling of incomplete information in XML. For example, [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] concentrated on
incompleteness arising in a setting where the structure of a document is revealed
by a sequence of queries, [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ] expressed incompleteness by means of
description logic theories, and [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] showed how to deal with incompleteness in query
results.
      </p>
      <p>
        In relational theory, incompleteness of information has been studied
independently of any particular application, with two seminal papers providing the
foundation of the theory of databases with incomplete information. The paper
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] by Imielinski and Lipski introduced tables as a representation mechanism
for incompleteness, and studied query evaluation over different types of tables.
The paper [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] by Abiteboul, Kanellakis, and Grahne then answered most
fundamental questions related to the complexity of computational problems associated
with incompleteness.
      </p>
      <p>
        A recent paper [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] attempted to re-do the basic results of [
        <xref ref-type="bibr" rid="ref1 ref13">1, 13</xref>
        ] in the XML
context. It defined incomplete XML documents, and looked at two basic classes
of problems related to them:
Representation Given an incomplete description of a document, does it
represent some document (i.e., is it consistent)? And can it represent a specific
complete document?
Querying How does one answer queries posed over incomplete descriptions?
Specifically, how does one compute answers to queries in a way that is
consistent with every document that is represented by the incomplete description,
and what is the complexity?
      </p>
      <p>
        While [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] answered these questions for many types of incomplete descriptions,
even in the presence of XML schemas, it did not look at the issue of documents
with integrity constraints. However, integrity constraints are ubiquitous in the
XML context: many documents are generated from databases that typically
specify keys and inclusion constraints, and such constraints have now found
their way into the standards for describing XML documents.
      </p>
      <p>
        So we would like to see how the key tasks of handling incompleteness in XML
behave when integrity constraints enter the picture. In the relational context,
it is known that constraints often change the complexity of many tasks, from
dependency implication to representation to querying [
        <xref ref-type="bibr" rid="ref18 ref4 ref9">4, 18, 9</xref>
        ].
      </p>
      <p>In this paper we deal with the Representation task. As the membership
problem (is a complete document represented by an incomplete description?) is not
affected by the presence of constraints, we concentrate on the following:
Consistency Problem Given a schema S, an incomplete description of a
document t, and a set Δ of constraints, are they consistent? That is, is there a
complete document T represented by t that conforms to the schema S and
satisfies all the constraints in Δ?</p>
      <p>To see why constraints change the picture for XML with incomplete
information, consider three incomplete descriptions of documents below.
r
r
r</p>
      <p>Putting a(x) next to a node means that it is labeled a, and the value of its
attribute is not known; putting a(1) means that the value of the attribute is 1.
Note that variables can be re-used, i.e., we have na¨ıve nulls.</p>
      <p>Without integrity constraints, all three descriptions are consistent: we can
assign any value to x, and any ordering to children that agrees with the
horizontal edge we have. Now look at the tree (1) and assume that we have a constraint
saying that the attribute of a-nodes is a key. Since we did not specify any
relationship between the children of the a-nodes in this tree, the description is
still consistent, as it is satisfied by a tree with just one a-child of the root. But
tree (2) is not consistent: the horizontal edge tells us that there are two distinct
a-nodes with the same value of their attribute, which therefore cannot be a key.
a(x)
a(x)
a(x)
a(x)
a(x)
b(x)
(1)
(2)</p>
      <p>a(1)
(3)</p>
      <p>Now let us look at tree (3). Assume that we have the same key information
about a. The description is consistent: one can set x to be any value other than
1. Likewise, if we say that the attribute of b is a key, the description remains
consistent. However, if we add an inclusion constraint that attribute values of
a-nodes are among the attribute values of b-nodes, the tree is becoming
inconsistent: the key constraint tells us that x 6= 1 and thus the inclusion constraint is
violated. In order to restore consistency, a tree that “completes” this description
must add a new b-node under the root with attribute value 1.</p>
      <p>
        The consistency problem for schemas and constraints (without
incompleteness) was studied in [
        <xref ref-type="bibr" rid="ref12 ref3">12, 3</xref>
        ], with the main result stating that it is:
– undecidable, if non-unary constraints are used (i.e., n-attribute keys and
foreign keys, for n &gt; 1); and
– NP-complete for unary constraints.
      </p>
      <p>Hence, in the paper we consider only unary constraints. Our main questions are:
(a) Do upper bounds on the complexity of the problem continue to hold in the
presence of incomplete information?
(b) What is the precise complexity of the consistency problem?</p>
      <sec id="sec-1-1">
        <title>Our main results answer these questions as follows:</title>
        <p>(a) For DTDs, unary constraints, and incomplete information, we retain the NP
upper bound.
(b) With DTDs, even very simple instances of the problem are NP-complete.</p>
        <p>Without DTDs, we prove a dichotomy theorem, classifying the complexity
as either NP-complete or PTIME, depending on what features are allowed
in incomplete descriptions.</p>
        <p>Organization Basic notations are given in Section 2. Incomplete descriptions of
XML documents are presented in Section 3. The consistency problem is described
in Section 4. We state the main result and prove it. Due to space limitations,
some proofs are shown in the appendix.
2</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Preliminaries</title>
      <p>XML documents and DTDs Assume that we have the following disjoint
countably infinite sets:
– Labels of possible names of element types (that is, node labels in trees);
– Attr of attribute names; we precede them with an @ to distinguish them
from element types;
– I of node ids; and
– D of attribute values (e.g., strings).</p>
      <p>We formally define trees as two-sorted relational structures over node ids and
attribute values.</p>
      <p>For finite sets of labels and attributes, Σ ⊂ Labels and A ⊂ Attr , define the
vocabulary
where all relations are binary except for the Pℓ’s, which are unary. A tree T is
a 2-sorted structure of vocabulary τΣ,A, i.e. T = hV, D, τΣ,Ai, where V ⊂ I is a
finite set of node ids, D ⊂ D is a finite set of data values, and
– E, NS are the child and the next-sibling relations, so that hV, E, NSi is an
ordered unranked tree; we also use E∗ and NS∗ to denote their
reflexivetransitive closures (respectively, descendant or self, and younger sibling or
self).
– each A@ai assigns values of attribute @ai to nodes, i.e. it is a subset of V × D
such that at most one pair (s, c) is present for each s ∈ V ;
– Pℓ are labeling predicates: s ∈ V belongs to Pℓ iff it is labeled ℓ; as usual,
we assume that the Pℓ’s are pairwise disjoint.</p>
      <p>We refer to a node that is labeled ℓ as an ℓ-node, and to the attribute @a of
a node as its @a-attribute.</p>
      <p>A DTD over a set Σ ⊂ Labels of labels and A ⊂ Attr of attributes is a triple
d = (r, ρ, α), where r ∈ Σ, and ρ is a mapping from Σ to regular languages over
Σ − {r}, and α is a mapping from Σ to subsets of A. As usual, r is the root, and
in a tree T that conforms to d (written as T |= d), for each node s labeled ℓ, the
set of labels of its children, read left-to-right, forms a string in the language of
ρ(ℓ), and the set of attributes of s is precisely α(ℓ). We assume, for complexity
results, that regular languages are given by NFAs.</p>
      <p>
        XML Integrity Constraints We consider keys, inclusion constraints and
foreign keys as our integrity constraints. They are the most common constraints in
relational databases, and are common in XML as well, as many documents are
generated from databases. Moreover, these sets of constraints are similar to, but
more general than XML ID/IDREF specifications, and can be used to model
most of the key/keyref specifications of XML Schema used in practice [
        <xref ref-type="bibr" rid="ref15 ref17">17, 15</xref>
        ].
      </p>
      <p>Let Σ ⊂ Labels and A ⊂ Attr , and let T be an XML tree. Then a constraint
ϕ over Σ and A is one of the following:
– Key ℓ.X → ℓ, where ℓ ∈ Σ and X is a set of attributes from A. The XML tree
T satisfies ϕ, denoted by T |= ϕ iff for every ℓ-node s in T , X is contained
in the set of attributes of s, and, in addition, T satisfies</p>
      <p>^
∀x∀y Pℓ(x) ∧ Pℓ(y) ∧
→ x = y.
– Inclusion Constraint ℓ1[X ] ⊆ ℓ2[Y ], where ℓ1, ℓ2 ∈ Σ and X = @a1, . . . , @an
and Y = @b1, . . . , @bn are nonempty lists of attributes from A of the same
length. We write T |= ϕ iff for every ℓ1-node (resp. ℓ2-node) s in T , X (resp.</p>
      <p>Y ) is contained in the set of attributes of s, and, in addition, T satisfies
∀x ∀u1 · · · ∀un</p>
      <p>Pℓ1(x) ∧ ^
1≤i≤n</p>
      <p>A@ai(x, ui) → ∃yPℓ2(y) ∧ ^
1≤i≤n
– Foreign Key: A combination of an inclusion constraint and a key constraint,
namely ℓ1[X] ⊆FK ℓ2[Y ] if ℓ1[X] ⊆ ℓ2[Y ] and ℓ2.Y → ℓ2. We write T |= ϕ if
T satisfies both the inclusion and the key constraint.</p>
      <p>As usual, a key ℓ.X → ℓ indicates that two nodes labeled ℓ cannot have the
same X-attribute values (i.e., X-attributes uniquely determine the node), an
inclusion constraint ℓ1[X] ⊆ ℓ2[Y ] indicates that the list of X-attribute values
of every ℓ1 node must match the list of Y -attribute values of an ℓ2-node, and
a foreign key ℓ1[X] ⊆FK ℓ2[Y ] indicates that X is a foreign key of ℓ1-nodes
referencing the key Y of ℓ2-nodes.</p>
      <p>A constraint is called unary if all sets of attributes involved are singletons.
That is, unary keys are ℓ.@a → ℓ and unary inclusion constraints are ℓ1[@a1] ⊆
ℓ2[@a2].</p>
      <sec id="sec-2-1">
        <title>Consistency of constraints and DTDs The consistency problem for con</title>
        <p>straints and DTDs is as follows: given a DTD d and a set Δ of keys, inclusion
constraints, and foreign keys, is there a tree T so that T |= d and T |= Δ? Of
course by T |= Δ we mean T |= ϕ for every ϕ in Δ.</p>
        <p>The following is known.</p>
        <p>
          Theorem 1 ([
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]).
1. The consistency problem for DTDs and constraints is undecidable, even if
all sets of attributes involved in constraints have cardinality at most 2.
2. The consistency problem for DTDs and unary constraints is NP-complete.
        </p>
        <p>In view of this result, in what follows we only consider unary constraints.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>XML with Incomplete Information</title>
      <p>
        We follow the model of incompleteness in XML proposed in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. That model
extends, in a natural way, the notion of tables [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] to XML documents. We shall
not use every single feature of the model of [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], trying to keep the description
reasonable, but the features we consider are sufficient for studying the interaction
between incompleteness and constraints.
      </p>
      <p>Roughly speaking, incomplete XML trees can occur as a result of missing
some of the following information:
(a) attribute values (they can be replaced with variables)
(b) node labels (they can be replaced by wildcards );
(c) precise vertical relationship between nodes (we can use descendant edges in
addition to child edges);
(d) precise horizontal relationship between nodes (using younger-sibling edges
instead of next-sibling).</p>
      <p>All these types of incompleteness are represented by means of tree/forest
descriptions. An ℓ-node with m attributes will be described as β = ℓ[@a1 =
z1, . . . , @am = zm], where
– ℓ ∈ Σ ∪ { } (label or wildcard);
– @a1, . . . , @am are attribute names, and each zi is a variable, or a constant
from D.</p>
      <p>Incomplete documents are given by means of incomplete tree descriptions (t)
and incomplete forest descriptions (f ):</p>
      <p>t := βhf ihhf ′ii
f, f ′ := ε | t1θ1t2θ2 . . . θktk+1 | f, f ′
(1)
where each ti is a tree, and each θi is either → or →∗. Informally, a tree βhf ihhf ′ii
has the node denoted by β as the root, a forest f of children, and a forest f ′
of descendants. A forest is either empty, or a sequence of trees with specified →
and →∗ relationships between their roots, or a union of forests.</p>
      <p>For example, the tree (3) from the example in the introduction is given as
follows:</p>
      <p>rhβa(x) → βa(1), βb(x)i,
where βa(x) = a[@a = x], βa(1) = a[@a = 1], and βb(x) = b[@b = x], assuming
that the attributes of a- and b-nodes are called @a and @b, respectively.</p>
      <p>
        There are two ways to give the semantics: by satisfaction of incomplete
descriptions in trees, and by homomorphisms between relational representations.
We use the former here; both are used, and shown to be equivalent, in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>Let z¯ be the set of all variables (nulls) used in t. Given a valuation ν : z¯ → D,
and a node s of T , we use the semantic notion (T , ν, s) |= t: intuitively, it means
that a complete tree T matches t at node s, if nulls are interpreted according to
ν. Then we define</p>
      <p>Rep(t) = {T | (T , ν, s) |= t for some node s and valuation ν}.</p>
      <p>We now define (T , ν, s) |= t, as well as (T , ν, S) |= f (which means that T
matches f at a set S of roots of subtrees in T ). We assume that ν is the identity
when applied to data values from D.</p>
      <p>– (T , ν, s) |= ℓ[@a1 = z1, . . . , @am = zm] iff s is labeled ℓ (if ℓ 6= ) and the
value of each attribute @ai of s is ν(zi) (i.e., (s, ν(zi)) ∈ A@ai ).
– (T , ν, s) |= βhf ihhf ′ii iff (T , ν, s) |= β and there is a set S of children of s such
that (T , ν, S) |= f and a set S′ of descendants of s such that (T , ν, S′) |= f ′.
– (T , ν, ∅) |= ε;
– (T , ν, S) |= t1θ1t2θ2 . . . θktk+1 iff there exists a sequence s1, . . . , sk+1 of
elements from S, in which every element from S appears at least once, and
such that (T , ν, si) |= ti for each i ≤ k + 1, and (si, si+1) is in NS whenever
θi is →, and in NS∗ whenever θi is →∗, for each i ≤ k.
– (T , ν, S) |= f1, f2 iff S = S1 ∪ S2 such that (T , ν, Si) |= fi, for i = 1, 2.</p>
      <p>The minimum we need to describe a tree structure is the child edges and the
union of forests, hence we assume that those are always present. In other words,
the minimal grammar we consider is t:=βhf i, f :=ε | t | f, f , with β using only
labels from Σ. We refer to incomplete descriptions given by this grammar as
basic incomplete trees.</p>
      <p>Additional features are:
– next sibling →;
– younger sibling →∗;
– descendant hhf ii (which is also represented by ↓∗); and
– wildcard in place of labels.</p>
      <p>Depending on which of these 4 features are used, we have 16 classes of trees. For
example, (→, )-trees refers to the situation when we allow wildcard and only →
as θi’s, and (↓∗, →, →∗, )-trees refer to the full grammar (1).
4</p>
    </sec>
    <sec id="sec-4">
      <title>Consistency problem</title>
      <p>As already described in the introduction, we consider the consistency problem for
XML incomplete descriptions in the presence of integrity constraints (keys and
inclusion dependencies). More formally, let Δ be a set of unary XML integrity
constraints. We consider the following problem:</p>
      <p>Problem: Consistency(Δ)
Input: an incomplete description t</p>
      <p>Question: is there a tree T ∈ Rep(t) so that T |= Δ?</p>
      <sec id="sec-4-1">
        <title>We also look at the version with DTDs d, namely:</title>
        <p>Problem: Consistency(Δ, d)
Input: an incomplete description t</p>
        <p>Question: is there a tree T ∈ Rep(t) so that T |= Δ and T |= d?
We classify the complexity of the problem depending on the structure of
incomplete trees, ranging from basic incomplete trees (that do not use any of
the ↓∗, →, →∗, features) to incomplete trees described by the full grammar (1).
Since for each of the version of Consistency we have 4 parameters that can
be set, we have a total of 32 cases to consider: 16 without DTDs, and 16 with
DTDs.</p>
        <p>For a class of incomplete trees we say that the consistency problem (or the
consistency problem with DTDs) is
– in PTIME, if it can be solved in PTIME given an input tree from the class;
– NP-complete, if it can be solved in NP given an input tree from the class,
and, for some fixed set of unary constraints Δ (and a DTD d for the case
of consistency with DTDs), then problem Consistency(Δ) (respectively,
Consistency(Δ, d)) is NP-complete.</p>
        <p>Our main result is as follows.</p>
        <sec id="sec-4-1-1">
          <title>Theorem 2 (Dichotomy).</title>
          <p>1. The consistency problem (without DTDs) is in PTIME for basic incomplete
trees, →-incomplete trees and →∗-incomplete trees; for the remaining 13
classes of incomplete trees it is NP-complete.
2. The consistency problem with DTDs is NP-complete for every class, from
basic incomplete trees to (↓∗, →, →∗, )-incomplete trees.</p>
          <p>
            The proof of the theorem is organized as follows. In section 4.1 we explore the
general NP upper bound for the consistency problem, and explore the tractable
fragment of (→)-incomplete trees. Afterwards, in section 4.2, we provide tight
lower bounds to show that Consistency becomes intractable under the addition
of DTDs, wildcards or transitive closures. It should be noticed that all the lower
bounds presented in this paper hold if one considers schema definitions that are
more expressive than DTDs, such as XSD or Relax NG [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ]. Whether the upper
bounds still hold is an open question, that will be part of our future work.
4.1
          </p>
        </sec>
        <sec id="sec-4-1-2">
          <title>Upper bounds</title>
          <p>
            In [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ], the authors show a general NP upper bound for the Consistency(d)
problem (that is, considering only a fixed DTD and no integrity constraints).
Interestingly, we show that the presence of unary integrity constraints in our
problem does not alter the complexity of the consistency problem.
Theorem 3. Consistency(Δ, d) is in NP, for each fixed DTD d and set Δ of
unary XML integrity constraints.
          </p>
          <p>
            Proof sketch: One can prove this by combining two previously known results.
The first is the one already mentioned that, for each fixed DTD d, the problem
Consistency(d) is in NP [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]. The proof in [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ] uses the following fact: If t is an
incomplete description and the set Repd(t) = Rep(t) ∩ {T | T |= d} is nonempty
(i.e. Consistency(d) is true for t), then Repd(t) contains a tree of polynomial
size.
          </p>
          <p>
            The second result that we use is the following:
Theorem 4 ([
            <xref ref-type="bibr" rid="ref12">12</xref>
            ]). There is a polynomial time algorithm that given a DT D d
and a set of unary XML integrity constraints Δ, constructs an integer matrix A
and an integer vector b such that there exists an XML tree T that conforms to
d and satisfies Δ if and only if Ax = b has an integer solution.
          </p>
          <p>
            Intuitively, the solution for Ax = b represents the number of nodes in the
tree that satisfies d and Δ that are labeled with each label ℓ in the alphabet.
Moreover, it was shown in [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ] that the solution of the system Ax = b provides
an algorithm for constructing a tree that conforms to d and satisfies Δ.
          </p>
          <p>One can prove Theorem 3 by combining the two results mentioned above.
Intuitively, an NP algorithm for solving Consistency(Δ, d) should do the
following on input t:</p>
          <p>Clearly, the whole process can be done in nondeterministic polynomial time.
Intuitively, the solutions of the sets of equations E will represent, for every
ℓ ∈ Σ, the number of ℓ labeled nodes that must be introduced when extending
T into a tree T ′ that conforms to d and satisfies Δ. As usual, the technical
details behind this proof are far more complicated that the intuition provided
above. We comment more about those technical details in the appendix. 2
Tractable cases. The only tractable cases are obtained when we do not allow
DTDs as inputs nor wildcards in the incomplete descriptions, and by severely
limiting the features that may allow two formulas in an incomplete description
to be witnessed by a same node in a repair: this is the case for (→)-incomplete
trees and (→∗)-incomplete trees. But before proving this result, we make a
crucial observation about the interaction of inclusion constraints in the consistency
problem when no DTD is considered. The following proposition shows that the
inclusion constraints can be ignored when checking Consistency w.r.t.
incomplete trees without DTD’s:
Proposition 1. For every incomplete description t, and every set Δ of unary
XML constraints, let ΔK be the set of key constraints of Δ (notice that a
foreign key is defined as a union of a key and an integrity constraint). Then
Consistency(Δ) is true for t if and only if Consistency(ΔK ) is true for t.
Proof. The direction from left to right is obvious (the same tree will suffice).
For the other direction, let t be an incomplete description and Δ be a set of
unary constraints, such that Consistency(ΔK ) is true for t, where ΔK is as
previously defined. Select a tree T ∈ Rep(t) ∩ {T | T |= ΔK }, and consider a tree
T ′ built as follows: for every inclusion constraint ϕ of the form ℓ1[@a1] ⊆ ℓ2[@a2]
in Δ, and for every ℓ1-node s of T such that there is no ℓ2-node s′ in T with the
value of its @a2-attribute equal to the value of the @a1-attribute of s, add to T ′
as a child of the root a node s′ that satisfies this property. It is easy to see that
T ′ |= ΔK . Further, by the construction of T ′, it is also the case that T ′ |= Δ.
Since T ′ ∈ Rep(t), this finishes the proof of our claim. 2</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>We now prove the tractable upper bound</title>
        <p>Proposition 2. There exists a PTIME algorithm for solving Consistency(Δ)
for (→)-incomplete trees and (→∗)-incomplete trees.</p>
        <p>
          Proof. Notice that, just as with complete XML documents, it is possible to
define a relational representation of an incomplete description t. Roughly
speaking, given an incomplete description t over an alphabet Σ of labels and A of
attributes, the relational representation of t, denoted as reℓ(t), is a structure
over the vocabulary τΣ,A that is defined as expected, in a way that resembles
the shredding of a tree under the well known edge relational representation (see
[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] for a precise definition).
        </p>
        <p>Thus, given a (→)-incomplete description t and a set of keys Δ (due to
Proposition 1 we do not consider inclusion dependencies), it is possible to define
a chase procedure over reℓ(t) so that t is consistent under Δ if and only if there
exists an accepting chase sequence on reℓ(t).</p>
        <p>The chase sequence will intuitively collapse all formulas in t that are forced
by Δ to represent only one node in every tree T ∈ Rep(t). Thus, for example,
if t contains two formulas α1 = ℓ[@a = a, @b = b] and α2 = ℓ[@a = a, @b = x]
and Δ contains the key ℓ.@a → ℓ, the chase will intuitively collapse α1 and α2
so that they now represent only one node, since every tree T that satisfies Δ
cannot have two ℓ-nodes with the same value for their @a-attribute.</p>
        <p>
          We omit the formal definition of this procedure and the proof of its
correctness and soundness, since they can easily be obtained from the chase procedure
for incomplete trees defined in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>The proof for the case of (→∗)-incomplete descriptions is analogous, albeit
in this case the procedure must take into account the fact that formulas of the
form f1 →∗ f2 may be collapsed as well by the chase procedure. 2
4.2</p>
        <sec id="sec-4-2-1">
          <title>Lower Bounds</title>
          <p>
            As the following theorem [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ] shows, the consistency problem is already difficult
when considering only DTDs (no integrity constraints).
          </p>
          <p>
            Proposition 3 ([
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]). There exists a DTD d such that Consistency(d) is
NPcomplete for basic incomplete trees.
          </p>
          <p>
            Thus, under the presence of DTDs one cannot obtain tractability even for the
most basic incomplete descriptions. On the other hand, it is easy to see that the
consistency problem is tractable if we do not consider neither DTDs nor schema
constraints. In fact, it can be proved that every (↓∗, →, →∗, )-incomplete tree
is consistent [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]. However, as we shall study, adding integrity constraints to the
consistency problem easily yields to intractability. To begin with, the presence
of wildcards is surprisingly problematic.
          </p>
          <p>Proposition 4. There exists a set of unary keys Δ such that Consistency(Δ)
is NP-hard for ( )-incomplete trees.</p>
          <p>The proof of this result can be found in the appendix. Notably, the incomplete
trees constructed in that proof do not make use of the union operator for forests.
It follows then that there exists a set of unary keys Δ such that Consistency(Δ)
is NP-hard for ( )-incomplete trees in which the union operator is not used.</p>
          <p>Continuing with the lower bounds, the transitive closure operators prove also
to be a problematic feature, even in the absence of the union operator for forests.
Proposition 5. There exist sets of unary keys Δ1and Δ2 such that the problems
– Consistency(Δ1) for (↓∗)-incomplete trees, and
– Consistency(Δ2) for (→, →∗)-incomplete trees
are NP-hard, even if incomplete trees are not allowed to use the union operator.
Proof sketch: We only have to prove hardness. Further, we only show the
reduction for (→, →∗)-incomplete trees; the reduction for (↓∗)-incomplete trees
is similar. We use a reduction from the shortest common superstring problem.
Given a set S = {s1, . . . , sn} of strings over a fixed alphabet Σ and a positive
integer K, the shortest common superstring problem is the problem of deciding
whether there exists a string w ∈ Σ∗, with |w| ≤ K, such that each string s ∈ S
is a substring of w, i.e. w = w0sw1 for some w0, w1 ∈ Σ∗.</p>
          <p>First, define Σ′ to be the alphabet {st, mid, end, r}, and let A be the set of
attributes {@id, @e}. We fix Δ to be the following set of unary keys: {st.@id →
st, end.@id → end}.</p>
          <p>From S we construct a (→, →∗)-incomplete tree t as follows. The incomplete
description t is defined as rhtK →∗ ts1 →∗ . . . →∗ tsn i. Here, tK refers to the
tree sthst[@id = 1] → mid → mid . . . → mid → end[@id = 1]i, that contains
exactly K nodes labeled mid (we assume that 1 is a data value different from
each letter in Σ). Further, for each string si = a1 · · · am of S, the tree tsi is
defined as
Intuitively, in order to restore the consistency of t with respect to Δ, one
must collapse each tree of the form tsi into tK . Since for each node s of
the tree there is at most one data value c such that (s, c) belongs to A@e,
the fact that this collapse is possible implies that there exists a common
superstring of the elements of S of length K. It is not hard to prove then that
Rep(t) ∩ {T | T |= Δ} 6= ∅ if and only if there is a superstring of S of length at
most K. Details can be found in the appendix. 2
5</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Future work</title>
      <p>
        Our next goal is to understand the complexity of query answering in the setting
of incomplete information and integrity constraints. It was shown in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] that
certain answers can be computed by na¨ıve evaluation for trees whose structure
is fully specified, but whose attributes can be assigned null values. This result
can be extended to trees without any information on the sibling order, as long as
queries do not mention it. Outside of these classes, computing certain answers
is intractable. Hence, we plan to understand how the complexity of query
evaluation is affected for these classes of incomplete trees if keys and/or foreign keys
are used.
      </p>
      <p>Acknowledgments The first author is supported FONDECYT grant 11080011, the
second and the third author by EPSRC grant G049165 and FET-Open project FoX
(Foundation of XML).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>S.</given-names>
            <surname>Abiteboul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kanellakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Grahne</surname>
          </string-name>
          .
          <article-title>On the representation and querying of sets of possible worlds</article-title>
          .
          <source>TCS</source>
          <volume>78</volume>
          (
          <year>1991</year>
          ),
          <fpage>158</fpage>
          -
          <lpage>187</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>S.</given-names>
            <surname>Abiteboul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Segoufin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vianu</surname>
          </string-name>
          .
          <article-title>Representing and querying XML with incomplete information</article-title>
          .
          <source>In PODS'01</source>
          , pages
          <fpage>150</fpage>
          -
          <lpage>161</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>M.</given-names>
            <surname>Arenas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Libkin</surname>
          </string-name>
          .
          <article-title>On the complexity of verifying consistency of XML specifications</article-title>
          .
          <source>SIAM J. Comput</source>
          .
          <volume>38</volume>
          (
          <year>2008</year>
          ),
          <fpage>841</fpage>
          -
          <lpage>880</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>P.</given-names>
            <surname>Atzeni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Morfuni</surname>
          </string-name>
          .
          <article-title>Functional dependencies and constraints on null values in database relations</article-title>
          .
          <source>Information and Control</source>
          <volume>70</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>31</lpage>
          (
          <year>1986</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. P. Barcel´o,
          <string-name>
            <given-names>L.</given-names>
            <surname>Libkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Poggi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sirangelo</surname>
          </string-name>
          .
          <article-title>XML With incomplete information: Models, Properties and Query Answering</article-title>
          . In PODS09 237-
          <fpage>246</fpage>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>M.</given-names>
            <surname>Benedikt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Geerts</surname>
          </string-name>
          .
          <article-title>XPath satisfiability in the presence of DTDs</article-title>
          .
          <source>J. ACM</source>
          <volume>55</volume>
          (
          <issue>2</issue>
          )
          <string-name>
            <surname>:</surname>
          </string-name>
          (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. H. Bj¨orklund, W. Martens,
          <string-name>
            <given-names>T.</given-names>
            <surname>Schwentick</surname>
          </string-name>
          .
          <article-title>Conjunctive query containment over trees</article-title>
          .
          <source>DBPL'07</source>
          , pages
          <fpage>66</fpage>
          -
          <lpage>80</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. H. Bj¨orklund, W. Martens,
          <string-name>
            <given-names>T.</given-names>
            <surname>Schwentick</surname>
          </string-name>
          .
          <article-title>Optimizing conjunctive queries over trees using schema information</article-title>
          .
          <source>MFCS'08</source>
          , pages
          <fpage>132</fpage>
          -
          <lpage>143</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. A. Cal`ı, D. Lembo,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rosati</surname>
          </string-name>
          .
          <article-title>On the decidability and complexity of query answering over inconsistent and incomplete databases</article-title>
          .
          <source>PODS'03</source>
          , pages
          <fpage>260</fpage>
          -
          <lpage>271</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>D.</given-names>
            <surname>Calvanese</surname>
          </string-name>
          , G. De Giacomo,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lenzerini</surname>
          </string-name>
          .
          <article-title>Semi-structured data with constraints and incomplete information</article-title>
          .
          <source>In Description Logics</source>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>D.</given-names>
            <surname>Calvanese</surname>
          </string-name>
          , G. De Giacomo,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lenzerini</surname>
          </string-name>
          .
          <article-title>Representing and reasoning on XML documents: a description logic approach</article-title>
          .
          <source>J. Log. Comput</source>
          .
          <volume>9</volume>
          (
          <issue>1999</issue>
          ),
          <fpage>295</fpage>
          -
          <lpage>318</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>W.</given-names>
            <surname>Fan</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Libkin</surname>
          </string-name>
          .
          <article-title>On XML integrity constraints in the presence of DTDs</article-title>
          .
          <source>J. ACM</source>
          <volume>49</volume>
          (
          <year>2002</year>
          ),
          <fpage>368</fpage>
          -
          <lpage>406</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>T.</given-names>
            <surname>Imielinski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lipski</surname>
          </string-name>
          .
          <article-title>Incomplete information in relational databases</article-title>
          .
          <source>J. ACM</source>
          <volume>31</volume>
          (
          <year>1984</year>
          ),
          <fpage>761</fpage>
          -
          <lpage>791</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>G.</given-names>
            <surname>Jan Bex</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Martens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Neven</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Schwentick</surname>
          </string-name>
          .
          <article-title>Expressiveness of XSDs: from Practice to Theory, There and Back Again</article-title>
          .
          <source>WWW</source>
          <year>2005</year>
          , pages
          <fpage>712</fpage>
          -
          <lpage>721</lpage>
          (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>G.</given-names>
            <surname>Jan Bex</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Neven</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Van den Bussche.</surname>
          </string-name>
          <article-title>DTD versus XML Schema: A Practical Study</article-title>
          .
          <source>WEBDB04</source>
          , pages
          <fpage>79</fpage>
          -
          <lpage>84</lpage>
          (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kanza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Nutt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sagiv</surname>
          </string-name>
          .
          <article-title>Querying incomplete information in semistructured data</article-title>
          .
          <source>JCSS</source>
          <volume>64</volume>
          (
          <year>2002</year>
          ),
          <fpage>655</fpage>
          -
          <lpage>693</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <given-names>A.</given-names>
            <surname>Laender</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Moro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Nascimento</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martins</surname>
          </string-name>
          .
          <article-title>An X-Ray on Web-Available XML Schemas</article-title>
          .
          <source>SIGMOD Record</source>
          <volume>38</volume>
          (
          <issue>1</issue>
          ) (
          <year>2009</year>
          ),
          <fpage>37</fpage>
          -
          <lpage>42</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>M. Levene</surname>
            ,
            <given-names>G. Loizou.</given-names>
          </string-name>
          <article-title>Axiomatisation of functional dependencies in incomplete relations</article-title>
          .
          <source>Theoretical Computer Science</source>
          <volume>206</volume>
          (
          <year>1998</year>
          ),
          <fpage>283</fpage>
          -
          <lpage>300</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>A</article-title>
          .
          <string-name>
            <surname>Tovey</surname>
          </string-name>
          .
          <article-title>A simplified satisfiability problem</article-title>
          .
          <source>Discrete Appl. Math. 8</source>
          (
          <issue>1984</issue>
          ), pp.
          <fpage>8589</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>