=Paper= {{Paper |id=None |storemode=property |title=A Direct Translation from XPath to Nondeterministic Automata |pdfUrl=https://ceur-ws.org/Vol-749/paper7.pdf |volume=Vol-749 |dblpUrl=https://dblp.org/rec/conf/amw/FrancisDL11 }} ==A Direct Translation from XPath to Nondeterministic Automata== https://ceur-ws.org/Vol-749/paper7.pdf
          A Direct Translation from XPath to
             Nondeterministic Automata

              Nadime Francis1 , Claire David2 , and Leonid Libkin3
                                     1
                                      ENS-Cachan
                          2
                              Univ Paris-Est Marne-la-Vallée
                                 3
                                   Univ of Edinburgh



      Abstract. Since navigational aspects of XPath correspond to first-order
      definability, it has been proposed to use the analogy with the very suc-
      cessful technique of translating LTL into automata, and produce efficient
      translations of XPath queries into automata on unranked trees. These
      translations can then be used for a variety of reasoning tasks such as
      XPath consistency, or optimization, under XML schema constraints. In
      the verification scenarios, translations into both nondeterministic and al-
      ternating automata are used. But while a direct translation from XPath
      into alternating automata is known, only an indirect translation into non-
      deterministic automata - going via intermediate logics - exists. A direct
      translation is desirable as most XML specifications have particularly nice
      translations into nondeterministic automata and it is natural to use such
      automata to reason about XPath and schemas. The goal of the paper
      is to produce such a direct translation of XPath into nondeterministic
      automata.


1   Introduction
XML reasoning tasks have been addressed in many recent papers. Typical prob-
lems include consistency of type declarations and constraints [1, 2, 16], or of
schema specifications and navigational properties [3, 11] or containment of XPath
expressions [4, 5, 10, 18, 26]. Static analysis tasks are often addressed using logic
and automata-based techniques [6]. Thus, due to the well-known correspondences
between XML formalisms and logics and automata on trees, it is natural to apply
logic and automata techniques to reasoning about XML.
     Indeed, several attempts have been made to find generic logic/automata tools
applicable in a variety of XML reasoning tasks. The idea of such approaches is
roughly as follows. Suppose the task at hand is the well-known XPath contain-
ment problem under DTDs: given a DTD d and two XPath expressions e1 and
e2 , is it true that d |= e1 ⊆ e2 ? In other words, is it true that for each XML tree
T that conforms to d, the expression e1 selects a subset of nodes selected by e2 ?
Suppose we can somehow capture by an automaton Ae the set of nodes selected
by the expression e1 ∧ ¬e2 , i.e., counterexamples to containment. Schemas such
as DTDs, Relax NG, and others, are easily translated into tree automata (in
fact most designs of schemas for XML are based on tree automata). So assume
that we have an automaton Ad that accepts trees that conform to d. Then to
see whether d |= e1 ⊆ e2 one simply checks whether

                                 L(Ad × Ae ) 6= ∅

since a tree in L(Ad × Ae ) is a counterexample to d |= e1 ⊆ e2 . Throughout the
paper, L(A) is the language accepted by automaton A.
     This is essentially the same as the main idea of automata-based verification
[6], whose goal is to check whether a program P satisfies a specification ϕ. One
views an abstraction of P as an automaton AP , constructs an automaton A¬ϕ
capturing counterexamples to the specification (i.e., program executions that do
not satisfy ϕ), and checks for nonemptiness of L(AP × A¬ϕ ). If the language is
nonempty, it contains counterexamples to the specification, i.e., bugs in P .
     In the verification context, one most commonly deals with automata over ω-
words, as they properly capture traces of program behavior and specifications.
In the XML context, we deal with queries on finite trees, and finite (unranked)
tree automata. The idea of adapting verification techniques to static analysis of
XML is therefore very natural and it has been explored in a number of papers
[5, 7, 10, 11, 13, 15, 18].
     In the classical verification scenario, the specification logic is most often LTL
(linear temporal logic), which happens to have the same expressive power as
first-order logic (FO). In our context, we deal with XPath, whose navigational
capabilities are expressible in FO and which, with an addition of a certain condi-
tional construct, captures FO [19]. If we want to adapt existing LTL-to-automata
translations, there are two main approaches to it: translating formulae into non-
deterministic, or alternating automata [30, 29]. The difference between these ap-
proaches is where the complexity lies – in the translation itself, or in checking
nonemptiness of languages defined by automata:
             PP
                 PAutomata
                   PP              Nondeterministic Alternating
             Task       PP
                         P
             Translation             exponential    polynomial
             Nonemptiness checking   polynomial     exponential

    Both techniques have their advantages; the choice is really about the task
to which algorithmic and optimization techniques will be applied. In the XML
context, there are at least two reasons to seriously consider translating into
nondeterministic automata. First, most schema formalisms come from nonde-
terministic tree automata, so these translations appear to be better suited for
reasoning tasks involving schemas. Second, one may hope to use a rich arsenal
of optimization tools developed for the original Vardi-Wolper translation [30]
of LTL into nondeterministic Büchi automata [8, 30, 12] and adapt it for XML
reasoning tasks.
    Both approaches have been explored in the XML context. In [5], a direct
translation from XPath to a certain type of alternating automata on unranked
trees is given. In [15], translations into nondeterministic automata are explored,
but the logic used for translation is not XPath, but rather an LTL-like logic
on trees, first introduced in [25] and used later in [19]. While the translation
is single-exponential, this is not satisfactory since the translation from XPath
into that logic is single-exponential itself! Curiously enough, this does not make
the whole translation from XPath into automata double-exponential, but show-
ing this requires a careful analysis of the structure of subformulas of formulas
which are translations of XPath queries. However, this intermediate step, with
its exponential blowup, seems to preclude any possibility of providing efficient
implementations of XPath-to-nondeterministic automata translations, as very
little of the original XPath query can still be seen and used in its translation.
     Thus, we would like to have a direct translation from XPath queries into
nondeterministic tree automata. This is the goal of this paper. We provide such
a translation; in fact, we do a bit more. Since XPath node queries select nodes
from documents, we provide a translation into query automata [20, 22] which not
only accept or reject trees, but also select nodes from them. For example, for
the task of checking containment of queries, this will give us not just a yes/no
answer, but a concise description of all counterexamples to containment.
     The translation will be single-exponential: in case of nondeterministic au-
tomata, one cannot do better than this, as LTL fragments requiring this com-
plexity are easily coded in XPath. As the exact language, we use Conditional
XPath of [19], which is FO-complete; it extends the standard XPath with an
analog of the temporal ‘until’ operator.
     The translation follows the spirit of the Vardi-Wolper translation, and thus
one can expect that optimization techniques developed for the latter (e.g., those
in [8, 30, 12]) can be adapted for our construction.
Organization In Section 2 we describe XML documents (unranked trees), au-
tomata for them, and query automata. In Section 3 we give the syntax and se-
mantics of (conditional) XPath. The translation from XPath to query automata
is given in Section 4. Its correctness is shown in Section 5.


2   Preliminaries
XML documents as unranked trees XML documents are normally ab-
stracted as labeled unranked trees [21, 14, 27]. Nodes in unranked trees are ele-
ments of N∗ , i.e. strings of natural numbers. We write s · s′ for the concatenation
of strings, and ε for the empty string. The basic binary relations on N∗ are:
 – the child relation: s ≺ch s′ if s′ = s · i, for some i ∈ N, and
 – the next-sibling relation: s′ ≺ns s′′ if s′ = s · i and s′′ = s · (i + 1) for some
   s ∈ N∗ and i ∈ N.
The descendant relation ≺∗ch and the younger sibling relation ≺∗ns are the
reflexive-transitive closures of ≺ch and ≺ns .
     An unranked tree domain D is a finite prefix-closed subset of N∗ such that
s · i ∈ D implies s · j ∈ D for all j < i. If Σ is a finite alphabet, a Σ-labeled
unranked tree is a pair T = (D, λ), where D is a tree domain and λ is a labeling
function λ : D → Σ.
Unranked tree automata A nondeterministic unranked tree automaton
(cf. [21, 27]) over Σ-labeled trees is a triple A = (Q, F, δ) where Q is a finite
                                                                                        ∗
set of states, F ⊆ Q is the set of final states, and δ is a mapping Q × Σ → 2Q
such that each δ(q, a) is a regular language over Q. We assume that each δ(q, a)
is given as an NFA (nondeterministic finite automaton). A run of A on a tree
T = (D, λ) is a function ρA : D → Q such that if s ∈ D is a node with n children,
and λ(s) = a, then the string ρA (s · 0) · · · ρA (s · (n − 1)) is in δ(ρA (s), a). Thus,
if s is a leaf labeled a, then ρA (s) = q implies that ε ∈ δ(q, a). A run is accepting
if ρA (ε) ∈ F , and a tree is accepted by A if an accepting run exists. Sets of trees
accepted by automata A are called regular and denoted by L(A).
     Unranked tree automata are relevant in the context of schemas for XML
document. There are multiple notions of schemas. What is common for such
notions is that their structural aspects are subsumed by unranked tree automata,
see [17] for several examples. More, translations from various schema formalisms
into automata are usually very effective [17], and thus automata are naturally
viewed as an abstraction of schemas in the XML literature. So when we speak of
XML schemas, we shall assume that they are given by unranked tree automata.

Query automata It is well known that automata capture the expressiveness
of MSO (monadic second order logic) sentences over finite and infinite strings
and trees [28]. The model of query automata [22] captures the expressiveness of
MSO formulae ϕ(x) with one free first-order variable – that is, MSO-definable
unary queries. We present here a nondeterministic version, as in [20, 9].
    A query automaton (QA) for Σ-labeled unranked trees is a tuple QA =
(Q, F, Qs , δ), where (Q, F, δ) is a nondeterministic unranked tree automaton,
and Qs ⊆ Q is the set of selecting states. The runs of QA on a tree T are defined
as the runs of (Q, F, δ) on T . Each run ρ of QA on a tree T = (D, λ) defines
the set Sρ (T ) = {s ∈ D | ρ(s) ∈ Qs } of nodes assigned to a selecting state. The
unary query defined by QA is then, under the existential semantics,
                        [
            QA∃ (T ) =    {Sρ (T ) | ρ is an accepting run of QA on T }.

Alternatively,
T                 one can define QA∀ (T ) under the universal semantics as
  {Sρ (T ) | ρ is an accepting run of QA on T }. Both semantics capture the class
of unary MSO queries [20].
    To eliminate the need to choose the semantics, and problems related to it (for
example, the most natural constructing for the complement of a QA under the
existential semantics produces a QA under the universal semantics), one can use
single-run query automata. Such QA satisfy the conditions that for every tree T ,
and every two accepting runs ρ1 and ρ2 on T , we have Sρ1 (T ) = Sρ2 (T ). For such
QAs, we can unambiguously define the set of selected nodes as QA(T ) = Sρ (T ),
where ρ is an arbitrarily chosen accepting run.
    This is not a restriction in terms of the expressiveness, as the following result
is known:
Proposition 1. (see [9, 23, 24, 15]) For every query automaton there exists an
equivalent single-run query automaton.
    It is known that nonemptiness problem for QAs (single-run, or under the
existential semantics) is solvable in polynomial time. Single-run QAs are easily
closed under intersection: the usual product construction works. Moreover, if
one takes a product A × QA of a tree automaton and a single-run QA (where
selecting states are pairs containing a selecting state of QA), the result is a
single-run QA, and the nonemptiness problem for it is solvable in polynomial
time too. This makes single-run QAs convenient for reasoning tasks, as outlined
in the introduction.


3    Conditional XPath queries

We present a first-order complete extension of XPath, called conditional XPath,
or CXPath [19]. To keep notations in translations manageable, we introduce very
minor modifications to the syntax: we use an existential quantifier E instead of
the usual XPath node test brackets [ ], and we use symbols for the main XPath
axes, rather then their verbose analogs.
   The node formulae α and path formulae β of CXPath are given by:

                      α, α′    :=   a   | ¬α     | α ∨ α′     |   Eβ
           ′                               ∗                 ∗
    β, β       :=   ?α |   step | step         | (step/?α)        | β/β ′   |   β ∨ β′
where step ∈ {≺ch , ≺−             −
                       ch , ≺ns , ≺ns }.
     Given a tree T = (D, λ), the semantics of a node formula is a set of nodes
[[α]]T ⊆ D, and the semantics of a path formula is a binary relation [[β]]T ⊆ D×D
given by the following rules. We use R∗ to denote the reflexive-transitive closure
of the relation R, and π1 (R) to denote its first projection. The semantics is
formally defined as follow:


[[a]]T = {s ∈ D | λ(s) = a}         [[?α]]T = {(s, s) | s ∈ [[α]]T }
[[¬α]]T = D − [[α]]T                [[step]]T = {(s, s′ ) | s, s′ ∈ D and (s, s′ ) ∈ step}
[[α ∨ α′ ]]T = [[α]]T ∪ [[α′ ]]T    [[β ∨ β ′ ]]T = [[β]]T ∪ [[β ′ ]]T
[[Eβ]]T = π1 ([[β]]T )              [[step∗ ]]T = [[step]]∗T
                                    [[β/β ′ ]]T = [[β]]T ◦ [[β ′ ]]T
                                    [[(step/?α)∗ ]]T = [[(step/?α)]]∗T

     CXPath node formulae α define unary queries: such a query selects the set
[[α]]T from a tree T . It is known that these capture precisely unary FO queries
on trees [19]. Hence, they can be translated into query automata, although the
standard translation from FO into automata necessarily involves non-elementary
blowup. The following was shown in [15]:

Theorem 1. Every CXPath node formula α can be translated, in exponential
time, into an equivalent single-run query automaton (of exponential size).
    The translation went in two steps:
 1. First, CXPath node formulae were translated, in single-exponential time,
    into formulae of a temporal logic TLtree . Those formulae could be of size
    exponential in the size of the original CXPath formula.
 2. TLtree formulae were translated, again in single-exponential time, into
    exponential-size QAs.
While this appears to give a double-exponential algorithm, a careful analysis of
the translations showed that it did not occur, i.e., the state-space of the resulting
QA was always single-exponential in the size of the original CXPath formula.
    Still, as we explained before, it is undesirable to have such a two-stage trans-
lation, as it practically eliminates any chance of adapting algorithmic techniques
that have been developed for the LTL-to-automata translation. So now we pro-
vide a direct translation for CXPath to automata.


4    CXPath-to-query automata translation
The direct translation from CXPath α queries into single-run QAs (Q, δ, F, Qs )
is done in the following steps:
 1. First, we define the set sub(α) of subformulae of α (analog of the Fischer-
    Ladner closure).
 2. Then we let the set Q be a subset of 2sub(α) satisfying certain properties.
 3. Then we define the set of final states.
 4. Selection states are those states q ⊆ sub(α) that contain α.
 5. Finally, we define the transition function: this is the most involved part of
    the translation.
Intuitively the automaton is such that there is exactly one accepting run for each
tree which labels each node by the sets of all subformula from sub(α) it satisfies.

Subformulae of a CXP formula As standard in translations from logic to
automata, we push all the negations down the formula in a way that they are all
in front of either an atom (a) or a path formula (Eβ), so we assume all formulae
are in such a form (we push negations inside subformulae as well).
We define, for a given node formula α, its subformulae sub(α) as follows:

    sub(a) = {a}                    sub(α ∧ α′ ) = {α ∧ α′ } ∪ sub(α) ∪ sub(α′ )
    sub(¬α) = {¬α} ∪ sub(α)         sub(α ∨ α′ ) = {α ∨ α′ } ∪ sub(α) ∪ sub(α′ )
    sub(step) = {step}              sub(β ∨ β ′ ) = {β ∨ β ′ } ∪ sub(β) ∪ sub(β ′ )
    sub(step∗ ) = {step∗ }          sub(β/β ′ ) = {β/β ′ } ∪ sub(β) ∪ sub(β ′ )
    sub(?α) = {?α} ∪ sub(α)         sub((step/?α)∗ ) = {(step/?α)∗ } ∪ sub(α)
    sub(Eβ) = {Eβ} ∪ sub(β)

   In the following, we refer to step, step∗ , and (step/?α)∗ as navigational
connectives.
States Now fix a node formula α0 . The set of states of the query automaton
QAα0 is Q ⊂ 2sub(α0 ) where each q ∈ Q satisfies the following conditions for
each formula γ ∈ sub(α0 ):
 – If γ = α1 ∨ α2 , then γ ∈ q iff either α1 ∈ q or α2 ∈ q.
 – If γ = α1 ∧ α2 , then γ ∈ q iff both α1 ∈ q and α2 ∈ q.
 – If γ = β1 ∨ β2 , then γ ∈ q iff either β1 ∈ q or β2 ∈ q.
 – If γ =?α, then γ ∈ q iff α ∈ q.
 – If γ = Eβ, then γ ∈ q iff β ∈ q.
 – If γ = β1 /β2 , where β1 does not contain any navigational connectives, then
   γ ∈ q iff β1 ∈ q and β2 ∈ q.
 – If γ is a node formula of the form a or Eβ, then γ ∈ q iff ¬γ ∈/ q.
 – If γ = β2 /β1 where β2 is step∗ or (step/?α)∗ , then β1 ∈ q implies β2 /β1 ∈ q.
 – If γ = step∗ or γ = (step/?α)∗ then γ ∈ q.
 – Each state q contains exactly one formula of the form a.

Final states The set of final states F ⊂ Q is the set of states q such that for
                         −
any step ∈ {≺ns , ≺−
                   ns , ≺ch }:
 – q does not contain any formula of the form step or step/β.
 – If q contains a formula of the form β ′ /β, where β ′ is of the form step∗ or
   (step/?α)∗ then q also contains β.

Selecting states The selecting states Qs are simply the states containing α0 .
In general, we denote by Qu the set of states containing a certain path or node
formula u. Thus Qs = Qα0 .

Transition function The transition function δ is defined as follows, for each
q ∈ Q and a ∈ Σ :

                      δ(q, a) = Lns ∩ Lup (q) ∩ Ldown (q, a)
where the languages Ldown (q, a), Lup (q), and Lns are to be defined shortly.
Intuitively the language Lns checks horizontal constraints. It is independent
from q and a and describes valid sequences of siblings w.r.t. the formula α0 .
The language Lup (q) is used to check upward conditions. It ensures that every
child of a node assigned to q is assigned to a state such that its upward formula
are compatible with q. Finally, the language Ldown (q, a) is used to check local
and downward constraints. For each node labeled by a and assigned to a state
q it ensures that the labelling of its children is compatible with the downward
formula of q. It also checks that a is compatible with q.

The language Lns of horizontal conditions Lns is the set of words from
q0 q1 · · · qn−1 qn for n ∈ N which respects the following constraints.
 – Conditions for the leftmost state q0
    • q0 does not contain any formula of the form ≺−   ns /β
    • From each formula β ′ /β ∈ sub(α0 ) where β ′ is either ≺−∗      −      ∗
                                                               ns or (≺ns /?α) ,
                  ′
      β ∈ q0 iff β /β ∈ q0 .
 – Conditions for the rightmost state qn
    • qn does not contain any formula of the form ≺ns /β
    • From each formula β ′ /β ∈ sub(α0 ) where β ′ is either ≺∗ns or (≺ns /?α)∗ ,
      β ∈ qn iff β ′ /β ∈ qn .
 – Conditions for two consecutive states qi , qi+1 .
    • For any path formula β in qi
        ∗ If β =≺ns /β ′ then β ′ ∈ qi+1 .
        ∗ If β =≺∗ns /β ′ then either β ′ ∈ qi or ≺∗ns /β ′ ∈ qi+1 .
        ∗ If β = (≺ns /?α)∗ /β ′ then either β ′ ∈ qi or (≺ns /?α)∗ /β ′ ∈ qi+1
          and α ∈ qi+1 .
    • For any path formula β in qi+1
        ∗ If β =≺−        ′        ′
                    ns /β then β ∈ qi .
        ∗ If β =≺ns /β then either β ′ ∈ qi+1 or ≺−∗
                    −∗     ′                                    ′
                                                         ns /β ∈ qi .
        ∗ If β = (≺ns /?α) /β then either β ∈ qi+1 or (≺−
                       −        ∗  ′                ′                     ∗   ′
                                                                   ns /?α) /β ∈ qi
          and α ∈ qi .
    • The set sub(α0 ) − qi does not contains any formula of the form ≺ns or
      ≺∗ns or (≺ns /?α)∗ .
    • For any formula β in sub(α0 ) − qi
        ∗ If β =≺ns /β ′ then β ′ ∈  / qi+1 .
        ∗ If β =≺∗ns /β ′ then β ′ ∈ / qi and ≺∗ns /β ′ ∈
                                                        / qi+1 .
        ∗ If β = (≺ns /?α)∗ /β ′ then β ′ ∈   / qi and either (≺ns /?α)∗ /β ′ ∈
                                                                              / qi+1
          or α ∈/ qi+1 .
    • The set sub(α0 ) − qi+1 does not contains any formula of the form ≺−         ns
      or ≺−∗          −
           ns or (≺ns /?α) .
                               ∗

    • For any formula β in sub(α0 ) − qi+1
        ∗ If β =≺−        ′
                    ns /β then β ∈
                                   ′
                                     / qi .
        ∗ If β =≺ns /β then β ′ ∈
                    −∗     ′
                                     / qi+1 and ≺−∗       ′
                                                    ns /β ∈ / qi .
                       −       ∗  ′         ′
        ∗ If β = (≺ns /?α) /β then β ∈        / qi+1 and either (≺−       ∗
                                                                   ns /?α) /β ∈
                                                                               ′
                                                                                 / qi
          or α ∈/ qi .
It is easy to build a DFA that checks all the conditions and thus recognises Lns .

The language Lup (q) of upward conditions We define Lup (q) = Qup (q)∗ ,
where Qup (q) is the set of states that are legal children of q wrt formulae of the
form β or β/β ′ where β is ≺−       −∗       −
                              ch , ≺ch or (≺ch /?α) .
                                                     ∗
                   ′
Formally, a state q belongs to Qup (q) if it satisfies the following conditions:
 – For each path formula β in q ′
    • If β =≺−      ′         ′
              ch /β then β ∈ q.
    • If β =≺−∗
              ch  /β ′
                       then either   β ′ ∈ q ′ or ≺−∗      ′
                                                   ch /β ∈ q.
    • If β = (≺ch /?α) /β then either β ∈ q or (≺−
                −         ∗ ′                  ′    ′               ∗   ′
                                                             ch /?α) /β ∈ q and α ∈ q.
 – The set sub(α0 ) − q does not contains any formula of the form ≺−
                        ′                                                             −∗
                                                                               ch or ≺ch
        −      ∗
   or (≺ch /?α) .
 – For each path formula β in sub(α0 ) − q ′
    • If β =≺−      ′
              ch /β then β ∈
                              ′
                                / q.
              −∗
    • If β =≺ch /β then β ′ ∈
                     ′
                                / q ′ and ≺−∗  ch /β ∈
                                                      ′
                                                        / q.
                −         ∗ ′
    • If β = (≺ch /?α) /β then β ∈    ′
                                         / q and either (≺−
                                            ′                         ∗   ′
                                                               ch /?α) /β ∈ / q or α ∈
                                                                                     / q.
The language Ldown (q, a) of local and downward conditions Let
subdown (α0 ) be is the set of formulae from sub(α0 ) of the form either β or
β/β ′ where β is ≺ch , ≺∗ch or (≺ch /?α)
                                    T .
                                        ∗

We define Ldown (q, a) = L(a, q, a)∩ u∈subdown (α0 ) L(u, q, a) where each L(u, q, a)
is defined as follows:
 – Propositional letters
    • L(a, q, a) = Q∗ if a ∈ q and ∅ otherwise.
 – Path conditions when u ∈ q
    • If u =≺ch then L(u, q, a) = Q+ .
    • If u =≺∗ch or u = (≺ch /?α)∗ then L(u, q, a) = Q∗ .
    • If u =≺ch /β ′ then L(u, q, a) = Q∗ Qβ ′ Q∗ .
    • If u =≺∗ch /β ′ then either β ′ ∈ q or L(u, q, a) = Q∗ Q≺∗ch /β ′ Q∗ .
    • If u = (≺ch /?α)∗ /β ′ then either β ′ ∈ q
      or L(u, q, a) = Q∗ (Q(≺ch /?α)∗ /β ′ ∩ Qα )Q∗ .
 – Path conditions when u ∈  /q
    • If u =≺ch then L(u, q, a) = ǫ.
    • If u =≺∗ch or u = (≺ch /?α)∗ then L(u, q, a) = ∅.
    • If u =≺ch /β ′ then L(u, q, a) = Q∗ Qβ ′ Q∗ .
    • If u =≺∗ch /β ′ then either β ′ ∈ q and L(u, q, a) = ∅, or β ′ ∈       / q and
      L(u, q, a) = Q∗ Q≺∗ch /β ′ Q∗ .
    • If u = (≺ch /?α)∗ /β ′ then either β ′ ∈ q and L(u, q, a) = ∅, or β ′ ∈ / q and
      L(u, q, a) = Q∗ (Q(≺ch /?α)∗ /β ′ ∩ Qα )Q∗ .

5   Correctness of the construction
In this section, we prove that the correctness of our construction.
Theorem 2. Let α0 be a CXPath node formula.
1. The automaton QAα0 is a single-run query automaton such that
   [[α0 ]]T = QAα0 (T ) for every tree T .

2. The automaton QAα0 is of size exponential in the size of α0 and can be
   constructed in single-exponential time
    The second point of the theorem follows directly from the construction given
in the previous section. The first point of Theorem 2 follows from two proposi-
tions, presented below.

Consistency and Maximality We show that an accepting run of the au-
tomaton labels each node of the input tree by a state which represents exactly
all subformulae of the original formula which are true at the node. As an imme-
diate corollary, the automaton has at most one accepting run per tree, and that
those accepting runs select the same nodes as the CXPath formula.
    First, we show that each accepting run of the automaton is both consistent
and maximal in the following sense. Recall that our trees are of the form T =
(D, λ), where D is the domain, and λ is the labelling function. Throughout the
section, we fix a node CXPath formula α0 .
Definition 1 (Consistency). Let ρ be a labelling function D → 2sub(α0 ) , for
a given tree T (D, λ) and a given node formula α0 . We say that ρ is consistent
if, for each node s ∈ D, the following is true:
 – for each node formula α ∈ ρ(s), we have s ∈ [[α]]T ;
 – for each path formula β ∈ ρ(s), there is at least one pair (s, s′ ) in [[β]]T .

Definition 2 (Maximality). Let ρ be a labelling function D → 2sub(α0 ) , for a
given tree T (D, λ) and a given node formula α0 . We say that ρ is maximal if,
for each node s ∈ D, the following is true:
 – for each node formula α ∈ sub(α0 ), if s ∈ [[α]]T , then α ∈ ρ(s);
 – for each path formula β ∈ sub(α0 ), if [[β]]T contains at least one pair of the
   form (s, s′ ), then β ∈ ρ(s).

Proposition 2 (Consistency and maximality). If ρ is an accepting run of
QAα0 on a given tree T and for a given node formula α0 , then ρ is both consistent
and maximal.

    This proposition is proved by a mutual induction on the structure of node
and path formulae occurring in sub(α0 ). We present a few sample cases below;
full details are in the appendix.
    Let s be a node of the tree, a its label, α a node formula, and β a path
formula. We suppose, in the cases of consistency, that both α and β are in ρ(s),
and in the cases of maximality, that both α and β are true at s. In the following
we denote by basic constraints, the constraints given in the definition of the
states of the automaton in Section 4.

 – Consistency: some base cases.
    • If α = a′ then a′ = a so α is true at s.(Otherwise, a ∈/ ρ(s) as each states
      contains at most one propositional letter so L(a, ρ(s), a) is empty, and
      thus ρ cannot be an accepting run.)
    • If β =≺ch then L(≺ch , ρ(s), a) = Q+ , thus s has a least one child, so β
      is true at s.
    • If β =≺−  ch then the only case where β might not be true is at the root.
      But this cannot happen, as any state containing ≺−   ch cannot be final. So
      β is true at s.
 – Maximality: some base cases.
    • If α = a′ then a′ = a, because α is true at s. If α ∈          / ρ(s) then
      L(a, ρ(s), a) = ∅ and ρ cannot be accepting. Thus, α ∈ ρ(s).
    • If β =≺ch then s is not a leaf. If β ∈  / ρ(s) then L(β, ρ(s), a) = ǫ which
      contradicts the fact that s is not a leaf. Thus, β ∈ ρ(s).
    • If β =≺−   ch then s is not the root, which implies that s has a parent
      s′ , labelled by the state ρ(s′ ). Then, we have ρ(s) ∈ Lup (ρ(s′ )), and all
      states in Lup contain β. Thus, β ∈ ρ(s).
 – Consistency: some induction cases.
    • If α = ¬α′ then, because of basic constraints, α′ ∈/ ρ(s). By the induction
      hypothesis for maximality, α′ cannot be true at s. So α is true at s.
    • If α = α1 ∨ α2 then, because of basic constraints, either α1 or α2 is in
      ρ(s). Then, by the induction hypothesis for consistency, either α1 or α2
      is true at s. So α is true at s.
    • If α = Eβ then, because of basic constraints, β ∈ ρ(s). By the induction
      hypothesis for consistency, β is true at s, i.e. contains at least one element
      when evaluated starting from s. So α is true at s.
    • If β =?α then, because of basic constraints, ?α ∈ ρ(s). By the induction
      hypothesis for consistency, α is true at s. So the interpretation of β
      contains at least (s, s), hence β is satisfied at s.
 – Maximality: some induction cases.
    • If α = ¬α′ then, α′ cannot be true at s. By the induction hypothe-
      sis for consistency, this means that α′ ∈   / ρ(s). Then, because of basic
      constraints, α ∈ ρ(s).
    • If α = α1 ∨ α2 then either α1 or α2 is true at s. By the induction
      hypothesis for maximality, either α1 or α2 is in ρ(s). So, because of basic
      constraints, α ∈ ρ(s).
    • If α = Eβ then the interpretation of β contains (s, s′ ) for some s′ , which
      means that β is true at s. By the induction hypothesis, β ∈ ρ(s). So,
      because of basic constraints, α ∈ ρ(s).
    • If β =?α then the interpretation of β contains (s, s) which means that α
      is true at s. By the induction hypothesis, α ∈ ρ(s). So, because of basic
      constraints, β ∈ ρ(s).

Accepting Runs We have proved so far that the accepting runs of the automa-
ton were consistent and maximal labellings, which means that the automaton
accepts at most the same trees than the original CXPath formula. We now prove
that each consistent and maximal labelling of a tree is an accepting run of the
automaton.
Proposition 3 (Accepting Runs). Let ρ be a maximal and consistent la-
belling of the nodes of a given tree T with respect to a given formula α0 . Then ρ
is an accepting run of QAα0 .
   This shows that the QA we constructed has indeed at most one accepting run
per tree, and that during this run, a node gets labelled by a state which contains
α0 iff α0 is true at that node. Hence, the automaton selects all the nodes which
are also selected by α0 , and this concludes the proof of Theorem 2.
Acknowledgements Most of this work was done while all three authors were
at the University of Edinburgh, supported by EPSRC grant G049165 and FET-
Open Project FoX, grant agreement 233599.

References
 1. S. Abiteboul, B. Cautis, T. Milo. Reasoning about XML update constraints. In
    PODS’07, pages 195–204.
 2. M. Arenas, W. Fan, L. Libkin. Consistency of XML specifications. In Inconsistency
    Tolerance, Springer, 2005, pages 15–41.
 3. M. Benedikt, W. Fan, F. Geerts. XPath satisfiability in the presence of DTDs. In
    PODS’05, pages 25–36.
 4. M. Bojanczyk, C. David, A. Muscholl, Th. Schwentick, L. Segoufin. Two-variable
    logic on data trees and XML reasoning. In PODS’06, pages 10–19.
 5. D. Calvanese, G. De Giacomo, M. Lenzerini, M. Y. Vardi. Regular XPath: con-
    straints, query containment and view-based answering for XML documents. In
    Logic in Databases, 2008.
 6. E. Clarke, O. Grumberg, D. Peled. Model Checking, MIT Press, 1999.
 7. R. Clark. Querying streaming XML using visibly pushdown automata. Technical
    Report, University of Illinois, 2008.
 8. M. Daniele, F. Giunchiglia, M.Y. Vardi. Improved automata generation for linear
    temporal logic. In CAV’99, pages 249–260.
 9. M. Frick, M. Grohe, C. Koch. Query evaluation on compressed trees. In LICS’03,
    pages 188-197.
10. P. Genevés and N. Layaida. A system for the static analysis of XPath. ACM TOIS
    24, 2006, 475–502.
11. P. Genevés, N. Layaida, and A. Schmitt. Efficient static analysis of XML paths
    and types. In ACM SIGPLAN Conf. on Programming Language Design and Im-
    plementation, 2007, pages 342–351.
12. R. Gerth, D. Peled, M. Vardi, and P. Wolper. Simple on-the-fly automatic verifi-
    cation of linear temporal logic. In PSTV 1995, pages 3–18.
13. A. K. Gupta, D. Suciu Stream processing of XPath queries with predicates. In
    SIGMOD 2003, pages 419–430.
14. L. Libkin. Logics for unranked trees: an overview. In ICALP’05, pages 35-50.
15. L. Libkin, C. Sirangelo. Reasoning about XML with temporal logics and automata.
    Journal of Applied Logic 8 (2010), 210–232.
16. S. Maneth, T. Perst, H. Seidl. Exact XML type checking in polynomial time. In
    ICDT 2007, pages 254–268.
17. W. Martens, F. Neven, Th. Schwentick. Simple off the shelf abstractions for XML
    schema. SIGMOD Record 36(3): 15–22 (2007).
18. M. Marx. XPath with conditional axis relations. In EDBT 2004, pages 477–494.
19. M. Marx. Conditional XPath. ACM TODS 30 (2005), 929–959.
20. F. Neven. Design and Analysis of Query Languages for Structured Documents.
    PhD Thesis, U. Limburg, 1999.
21. F. Neven. Automata, logic, and XML. In CSL 2002, pages 2–26.
22. F. Neven, Th. Schwentick. Query automata over finite trees. TCS, 275 (2002),
    633–674.
23. F. Neven, J. Van den Bussche. Expressiveness of structured document query lan-
    guages based on attribute grammars. J. ACM 49(1): 56–100 (2002).
24. J. Niehren, L. Planque, J.-M. Talbot, S. Tison. N-ary queries by tree automata.
    In DBPL 2005, pages 217–231.
25. B.-H. Schlingloff. Expressive completeness of temporal logic of trees. Journal of
    Applied Non-Classical Logics 2 (1992), 157–180.
26. Th. Schwentick. XPath query containment. SIGMOD Record 33 (2004), 101–109.
27. Th. Schwentick. Automata for XML – a survey. JCSS 73 (2007), 289–315.
28. W. Thomas. Languages, automata, and logic. In Handbook of Formal Languages,
    volume III, pages 389–455.
29. M. Y. Vardi. An automata-theoretic approach to linear temporal logic. Banff
    Higher Order Workshop, 1996.
30. M. Y. Vardi and P. Wolper. Reasoning about infinite computations. Inf.& Comput.
    115 (1994), 1–37.