<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Direct Translation from XPath to Nondeterministic Automata</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nadime Francis</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claire David</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leonid Libkin</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Univ Paris-Est Marne-la-Vall ́ee</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Univ of Edinburgh</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Since navigational aspects of XPath correspond to first-order definability, it has been proposed to use the analogy with the very successful technique of translating LTL into automata, and produce efficient translations of XPath queries into automata on unranked trees. These translations can then be used for a variety of reasoning tasks such as XPath consistency, or optimization, under XML schema constraints. In the verification scenarios, translations into both nondeterministic and alternating automata are used. But while a direct translation from XPath into alternating automata is known, only an indirect translation into nondeterministic automata - going via intermediate logics - exists. A direct translation is desirable as most XML specifications have particularly nice translations into nondeterministic automata and it is natural to use such automata to reason about XPath and schemas. The goal of the paper is to produce such a direct translation of XPath into nondeterministic automata.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        XML reasoning tasks have been addressed in many recent papers. Typical
problems include consistency of type declarations and constraints [
        <xref ref-type="bibr" rid="ref1 ref16 ref2">1, 2, 16</xref>
        ], or of
schema specifications and navigational properties [
        <xref ref-type="bibr" rid="ref11 ref3">3, 11</xref>
        ] or containment of XPath
expressions [
        <xref ref-type="bibr" rid="ref10 ref18 ref26 ref4 ref5">4, 5, 10, 18, 26</xref>
        ]. Static analysis tasks are often addressed using logic
and automata-based techniques [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Thus, due to the well-known correspondences
between XML formalisms and logics and automata on trees, it is natural to apply
logic and automata techniques to reasoning about XML.
      </p>
      <p>Indeed, several attempts have been made to find generic logic/automata tools
applicable in a variety of XML reasoning tasks. The idea of such approaches is
roughly as follows. Suppose the task at hand is the well-known XPath
containment problem under DTDs: given a DTD d and two XPath expressions e1 and
e2, is it true that d |= e1 ⊆ e2? In other words, is it true that for each XML tree
T that conforms to d, the expression e1 selects a subset of nodes selected by e2?
Suppose we can somehow capture by an automaton Ae the set of nodes selected
by the expression e1 ∧ ¬e2, i.e., counterexamples to containment. Schemas such
as DTDs, Relax NG, and others, are easily translated into tree automata (in
fact most designs of schemas for XML are based on tree automata). So assume
that we have an automaton Ad that accepts trees that conform to d. Then to
see whether d |= e1 ⊆ e2 one simply checks whether</p>
      <p>L(Ad × Ae) 6= ∅
since a tree in L(Ad × Ae) is a counterexample to d |= e1 ⊆ e2. Throughout the
paper, L(A) is the language accepted by automaton A.</p>
      <p>
        This is essentially the same as the main idea of automata-based verification
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], whose goal is to check whether a program P satisfies a specification ϕ. One
views an abstraction of P as an automaton AP , constructs an automaton A¬ϕ
capturing counterexamples to the specification (i.e., program executions that do
not satisfy ϕ), and checks for nonemptiness of L(AP × A¬ϕ). If the language is
nonempty, it contains counterexamples to the specification, i.e., bugs in P .
      </p>
      <p>
        In the verification context, one most commonly deals with automata over
ωwords, as they properly capture traces of program behavior and specifications.
In the XML context, we deal with queries on finite trees, and finite (unranked)
tree automata. The idea of adapting verification techniques to static analysis of
XML is therefore very natural and it has been explored in a number of papers
[
        <xref ref-type="bibr" rid="ref10 ref11 ref13 ref15 ref18 ref5 ref7">5, 7, 10, 11, 13, 15, 18</xref>
        ].
      </p>
      <p>
        In the classical verification scenario, the specification logic is most often LTL
(linear temporal logic), which happens to have the same expressive power as
first-order logic (FO). In our context, we deal with XPath, whose navigational
capabilities are expressible in FO and which, with an addition of a certain
conditional construct, captures FO [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. If we want to adapt existing LTL-to-automata
translations, there are two main approaches to it: translating formulae into
nondeterministic, or alternating automata [
        <xref ref-type="bibr" rid="ref29 ref30">30, 29</xref>
        ]. The difference between these
approaches is where the complexity lies – in the translation itself, or in checking
nonemptiness of languages defined by automata:
      </p>
      <p>PPPAPutPomPaPtPa
Task
Translation
Nonemptiness checking</p>
      <p>Nondeterministic Alternating
exponential
polynomial
polynomial
exponential</p>
      <p>
        Both techniques have their advantages; the choice is really about the task
to which algorithmic and optimization techniques will be applied. In the XML
context, there are at least two reasons to seriously consider translating into
nondeterministic automata. First, most schema formalisms come from
nondeterministic tree automata, so these translations appear to be better suited for
reasoning tasks involving schemas. Second, one may hope to use a rich arsenal
of optimization tools developed for the original Vardi-Wolper translation [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]
of LTL into nondeterministic Bu¨chi automata [
        <xref ref-type="bibr" rid="ref12 ref30 ref8">8, 30, 12</xref>
        ] and adapt it for XML
reasoning tasks.
      </p>
      <p>
        Both approaches have been explored in the XML context. In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], a direct
translation from XPath to a certain type of alternating automata on unranked
trees is given. In [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], translations into nondeterministic automata are explored,
but the logic used for translation is not XPath, but rather an LTL-like logic
on trees, first introduced in [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] and used later in [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. While the translation
is single-exponential, this is not satisfactory since the translation from XPath
into that logic is single-exponential itself! Curiously enough, this does not make
the whole translation from XPath into automata double-exponential, but
showing this requires a careful analysis of the structure of subformulas of formulas
which are translations of XPath queries. However, this intermediate step, with
its exponential blowup, seems to preclude any possibility of providing efficient
implementations of XPath-to-nondeterministic automata translations, as very
little of the original XPath query can still be seen and used in its translation.
      </p>
      <p>
        Thus, we would like to have a direct translation from XPath queries into
nondeterministic tree automata. This is the goal of this paper. We provide such
a translation; in fact, we do a bit more. Since XPath node queries select nodes
from documents, we provide a translation into query automata [
        <xref ref-type="bibr" rid="ref20 ref22">20, 22</xref>
        ] which not
only accept or reject trees, but also select nodes from them. For example, for
the task of checking containment of queries, this will give us not just a yes/no
answer, but a concise description of all counterexamples to containment.
      </p>
      <p>
        The translation will be single-exponential: in case of nondeterministic
automata, one cannot do better than this, as LTL fragments requiring this
complexity are easily coded in XPath. As the exact language, we use Conditional
XPath of [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], which is FO-complete; it extends the standard XPath with an
analog of the temporal ‘until’ operator.
      </p>
      <p>
        The translation follows the spirit of the Vardi-Wolper translation, and thus
one can expect that optimization techniques developed for the latter (e.g., those
in [
        <xref ref-type="bibr" rid="ref12 ref30 ref8">8, 30, 12</xref>
        ]) can be adapted for our construction.
      </p>
      <p>Organization In Section 2 we describe XML documents (unranked trees),
automata for them, and query automata. In Section 3 we give the syntax and
semantics of (conditional) XPath. The translation from XPath to query automata
is given in Section 4. Its correctness is shown in Section 5.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Preliminaries</title>
      <p>
        XML documents as unranked trees XML documents are normally
abstracted as labeled unranked trees [
        <xref ref-type="bibr" rid="ref14 ref21 ref27">21, 14, 27</xref>
        ]. Nodes in unranked trees are
elements of N∗, i.e. strings of natural numbers. We write s · s′ for the concatenation
of strings, and ε for the empty string. The basic binary relations on N∗ are:
– the child relation: s ≺ch s′ if s′ = s · i, for some i ∈ N, and
– the next-sibling relation: s′ ≺ns s′′ if s′ = s · i and s′′ = s · (i + 1) for some
s ∈ N∗ and i ∈ N.
      </p>
      <p>The descendant relation ≺ch and the younger sibling relation ≺∗ns are the
∗
reflexive-transitive closures of ≺ch and ≺ns.</p>
      <p>An unranked tree domain D is a finite prefix-closed subset of N∗ such that
s · i ∈ D implies s · j ∈ D for all j &lt; i. If Σ is a finite alphabet, a Σ-labeled
unranked tree is a pair T = (D, λ), where D is a tree domain and λ is a labeling
function λ : D → Σ.</p>
      <p>
        Unranked tree automata A nondeterministic unranked tree automaton
(cf. [
        <xref ref-type="bibr" rid="ref21 ref27">21, 27</xref>
        ]) over Σ-labeled trees is a triple A = (Q, F, δ) where Q is a finit∗e
set of states, F ⊆ Q is the set of final states, and δ is a mapping Q × Σ → 2Q
such that each δ(q, a) is a regular language over Q. We assume that each δ(q, a)
is given as an NFA (nondeterministic finite automaton). A run of A on a tree
T = (D, λ) is a function ρA : D → Q such that if s ∈ D is a node with n children,
and λ(s) = a, then the string ρA(s · 0) · · · ρA(s · (n − 1)) is in δ(ρA(s), a). Thus,
if s is a leaf labeled a, then ρA(s) = q implies that ε ∈ δ(q, a). A run is accepting
if ρA(ε) ∈ F , and a tree is accepted by A if an accepting run exists. Sets of trees
accepted by automata A are called regular and denoted by L(A).
      </p>
      <p>
        Unranked tree automata are relevant in the context of schemas for XML
document. There are multiple notions of schemas. What is common for such
notions is that their structural aspects are subsumed by unranked tree automata,
see [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] for several examples. More, translations from various schema formalisms
into automata are usually very effective [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], and thus automata are naturally
viewed as an abstraction of schemas in the XML literature. So when we speak of
XML schemas, we shall assume that they are given by unranked tree automata.
Query automata It is well known that automata capture the expressiveness
of MSO (monadic second order logic) sentences over finite and infinite strings
and trees [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]. The model of query automata [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] captures the expressiveness of
MSO formulae ϕ(x) with one free first-order variable – that is, MSO-definable
unary queries. We present here a nondeterministic version, as in [
        <xref ref-type="bibr" rid="ref20 ref9">20, 9</xref>
        ].
      </p>
      <p>A query automaton (QA) for Σ-labeled unranked trees is a tuple QA =
(Q, F, Qs, δ), where (Q, F, δ) is a nondeterministic unranked tree automaton,
and Qs ⊆ Q is the set of selecting states. The runs of QA on a tree T are defined
as the runs of (Q, F, δ) on T . Each run ρ of QA on a tree T = (D, λ) defines
the set Sρ(T ) = {s ∈ D | ρ(s) ∈ Qs} of nodes assigned to a selecting state. The
unary query defined by QA is then, under the existential semantics,</p>
      <p>QA∃(T ) = [{Sρ(T ) | ρ is an accepting run of QA on T }.</p>
      <p>
        Alternatively, one can define QA∀(T ) under the universal semantics as
T{Sρ(T ) | ρ is an accepting run of QA on T }. Both semantics capture the class
of unary MSO queries [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
      </p>
      <p>To eliminate the need to choose the semantics, and problems related to it (for
example, the most natural constructing for the complement of a QA under the
existential semantics produces a QA under the universal semantics), one can use
single-run query automata. Such QA satisfy the conditions that for every tree T ,
and every two accepting runs ρ1 and ρ2 on T , we have Sρ1 (T ) = Sρ2 (T ). For such
QAs, we can unambiguously define the set of selected nodes as QA(T ) = Sρ(T ),
where ρ is an arbitrarily chosen accepting run.</p>
      <p>
        This is not a restriction in terms of the expressiveness, as the following result
is known:
Proposition 1. (see [
        <xref ref-type="bibr" rid="ref15 ref23 ref24 ref9">9, 23, 24, 15</xref>
        ]) For every query automaton there exists an
equivalent single-run query automaton.
      </p>
      <p>It is known that nonemptiness problem for QAs (single-run, or under the
existential semantics) is solvable in polynomial time. Single-run QAs are easily
closed under intersection: the usual product construction works. Moreover, if
one takes a product A × QA of a tree automaton and a single-run QA (where
selecting states are pairs containing a selecting state of QA), the result is a
single-run QA, and the nonemptiness problem for it is solvable in polynomial
time too. This makes single-run QAs convenient for reasoning tasks, as outlined
in the introduction.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Conditional XPath queries</title>
      <p>
        We present a first-order complete extension of XPath, called conditional XPath,
or CXPath [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. To keep notations in translations manageable, we introduce very
minor modifications to the syntax: we use an existential quantifier E instead of
the usual XPath node test brackets [ ], and we use symbols for the main XPath
axes, rather then their verbose analogs.
      </p>
      <p>The node formulae α and path formulae β of CXPath are given by:
α, α′
:=</p>
      <p>a | ¬α | α ∨ α′ |
β, β′
:=
?α | step | step∗
| (step/?α)∗</p>
      <p>Eβ
| β/β′
| β ∨ β′
−
where step ∈ {≺ch, ≺ch, ≺ns, ≺n−s}.</p>
      <p>Given a tree T = (D, λ), the semantics of a node formula is a set of nodes
[[α]]T ⊆ D, and the semantics of a path formula is a binary relation [[β]]T ⊆ D ×D
given by the following rules. We use R∗ to denote the reflexive-transitive closure
of the relation R, and π1(R) to denote its first projection. The semantics is
formally defined as follow:
[[a]]T = {s ∈ D | λ(s) = a}
[[¬α]]T = D − [[α]]T
[[α ∨ α ]]T = [[α]]T ∪ [[α′]]T</p>
      <p>′
[[Eβ]]T = π1([[β]]T )
[[?α]]T = {(s, s) | s ∈ [[α]]T }
[[step]]T = {(s, s′) | s, s′ ∈ D and (s, s′) ∈ step}
[[β ∨ β′]]T = [[β]]T ∪ [[β′]]T
[[step∗]]T = [[step]]∗T
[[β/β′]]T = [[β]]T ◦ [[β′]]T
[[(step/?α)∗]]T = [[(step/?α)]]∗T</p>
      <p>
        CXPath node formulae α define unary queries: such a query selects the set
[[α]]T from a tree T . It is known that these capture precisely unary FO queries
on trees [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. Hence, they can be translated into query automata, although the
standard translation from FO into automata necessarily involves non-elementary
blowup. The following was shown in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]:
Theorem 1. Every CXPath node formula α can be translated, in exponential
time, into an equivalent single-run query automaton (of exponential size).
      </p>
      <p>The translation went in two steps:
1. First, CXPath node formulae were translated, in single-exponential time,
into formulae of a temporal logic TLtree. Those formulae could be of size
exponential in the size of the original CXPath formula.
2. TLtree formulae were translated, again in single-exponential time, into
exponential-size QAs.</p>
      <p>While this appears to give a double-exponential algorithm, a careful analysis of
the translations showed that it did not occur, i.e., the state-space of the resulting
QA was always single-exponential in the size of the original CXPath formula.</p>
      <p>Still, as we explained before, it is undesirable to have such a two-stage
translation, as it practically eliminates any chance of adapting algorithmic techniques
that have been developed for the LTL-to-automata translation. So now we
provide a direct translation for CXPath to automata.
4</p>
    </sec>
    <sec id="sec-4">
      <title>CXPath-to-query automata translation</title>
      <p>The direct translation from CXPath α queries into single-run QAs (Q, δ, F, Qs)
is done in the following steps:
1. First, we define the set sub(α) of subformulae of α (analog of the
Fischer</p>
      <p>Ladner closure).
2. Then we let the set Q be a subset of 2sub(α) satisfying certain properties.
3. Then we define the set of final states.
4. Selection states are those states q ⊆ sub(α) that contain α.
5. Finally, we define the transition function: this is the most involved part of
the translation.</p>
      <p>Intuitively the automaton is such that there is exactly one accepting run for each
tree which labels each node by the sets of all subformula from sub(α) it satisfies.
Subformulae of a CXP formula As standard in translations from logic to
automata, we push all the negations down the formula in a way that they are all
in front of either an atom (a) or a path formula (Eβ), so we assume all formulae
are in such a form (we push negations inside subformulae as well).
We define, for a given node formula α, its subformulae sub(α) as follows:
sub(a) = {a}
sub(¬α) = {¬α} ∪ sub(α)
sub(step) = {step}
sub(step∗) = {step∗}
sub(?α) = {?α} ∪ sub(α)
sub(Eβ) = {Eβ} ∪ sub(β)
sub(α ∧ α′) = {α ∧ α′} ∪ sub(α) ∪ sub(α′)
sub(α ∨ α′) = {α ∨ α′} ∪ sub(α) ∪ sub(α′)
sub(β ∨ β′) = {β ∨ β′} ∪ sub(β) ∪ sub(β′)
sub(β/β′) = {β/β′} ∪ sub(β) ∪ sub(β′)
sub((step/?α)∗) = {(step/?α)∗} ∪ sub(α)</p>
      <p>In the following, we refer to step, step∗, and (step/?α)∗ as navigational
connectives.</p>
      <p>States Now fix a node formula α0. The set of states of the query automaton
QAα0 is Q ⊂ 2sub(α0) where each q ∈ Q satisfies the following conditions for
each formula γ ∈ sub(α0):
– If γ = α1 ∨ α2, then γ ∈ q iff either α1 ∈ q or α2 ∈ q.
– If γ = α1 ∧ α2, then γ ∈ q iff both α1 ∈ q and α2 ∈ q.
– If γ = β1 ∨ β2, then γ ∈ q iff either β1 ∈ q or β2 ∈ q.
– If γ =?α, then γ ∈ q iff α ∈ q.
– If γ = Eβ, then γ ∈ q iff β ∈ q.
– If γ = β1/β2, where β1 does not contain any navigational connectives, then
γ ∈ q iff β1 ∈ q and β2 ∈ q.
– If γ is a node formula of the form a or Eβ, then γ ∈ q iff ¬γ ∈/ q.
– If γ = β2/β1 where β2 is step∗ or (step/?α)∗, then β1 ∈ q implies β2/β1 ∈ q.
– If γ = step∗ or γ = (step/?α)∗ then γ ∈ q.</p>
      <p>– Each state q contains exactly one formula of the form a.</p>
      <p>Final states The set of final states F ⊂ Q is the set of states q such that for
any step ∈ {≺ns, ≺n−s, ≺c−h}:
– q does not contain any formula of the form step or step/β.
– If q contains a formula of the form β′/β, where β′ is of the form step∗ or
(step/?α)∗ then q also contains β.</p>
      <p>Selecting states The selecting states Qs are simply the states containing α0.
In general, we denote by Qu the set of states containing a certain path or node
formula u. Thus Qs = Qα0.</p>
      <p>Transition function The transition function δ is defined as follows, for each
q ∈ Q and a ∈ Σ :</p>
      <p>δ(q, a) = Lns ∩ Lup(q) ∩ Ldown(q, a)
where the languages Ldown(q, a), Lup(q), and Lns are to be defined shortly.
Intuitively the language Lns checks horizontal constraints. It is independent
from q and a and describes valid sequences of siblings w.r.t. the formula α0.
The language Lup(q) is used to check upward conditions. It ensures that every
child of a node assigned to q is assigned to a state such that its upward formula
are compatible with q. Finally, the language Ldown(q, a) is used to check local
and downward constraints. For each node labeled by a and assigned to a state
q it ensures that the labelling of its children is compatible with the downward
formula of q. It also checks that a is compatible with q.</p>
      <p>The language Lns of horizontal conditions Lns is the set of words from
q0q1 · · · qn−1qn for n ∈ N which respects the following constraints.
– Conditions for the leftmost state q0
• q0 does not contain any formula of the form ≺n−s /β
• From each formula β′/β ∈ sub(α0) where β′ is either ≺n−s∗ or (≺n−s /?α)∗,
β ∈ q0 iff β′/β ∈ q0.
– Conditions for the rightmost state qn
• qn does not contain any formula of the form ≺ns /β
• From each formula β′/β ∈ sub(α0) where β′ is either ≺∗ns or (≺ns /?α)∗,
β ∈ qn iff β′/β ∈ qn.
– Conditions for two consecutive states qi, qi+1.</p>
      <p>• For any path formula β in qi
∗ If β =≺ns /β′ then β′ ∈ qi+1.
∗ If β =≺∗ns /β′ then either β</p>
      <p>′ ∈ qi or ′≺∈∗nsq/iβo′r∈( ≺qin+s1./?α)∗/β′ ∈ qi+1
∗ If β = (≺ns /?α)∗/β′ then either β</p>
      <p>and α ∈ qi+1.
• For any path formula β in qi+1
∗ If β =≺n−s /β′ then β′ ∈ q .
∗ If β =≺n−s∗ /β′ then eitheri β′ ∈ qi+1 or ≺n−s∗ /β′ ∈ qi.
∗ If β = (≺n−s /?α)∗/β′ then either β′ ∈ qi+1 or (≺n−s /?α)∗/β′ ∈ qi
and α ∈ qi.
• T≺hnse osert(≺sunbs(α/?0α))−∗.qi does not contains any formula of the form ≺ns or
∗
• For any formula β in sub(α0) − qi
∗ If β =≺ns /β′ then β′ ∈/ qi+1.
∗ If β =≺∗ns /β′ then β′ ∈/ qi and ≺∗ns /β′ ∈/ qi+1.
∗ If β = (≺ns /?α)∗/β′ then β′ ∈/ qi and either (≺ns /?α)∗/β′ ∈/ qi+1
or α ∈/ qi+1.
• The set sub(α0) − qi+1 does not contains any formula of the form ≺n−s
or ≺n−s∗ or (≺n−s /?α)∗.
• For any formula β in sub(α0) − qi+1
∗ If β =≺n−s /β′ then β′ ∈/ qi.
∗ If β =≺n−s∗ /β′ then β′ ∈/ qi+1 and ≺n−s∗ /β′ ∈/ qi.
∗ If β = (≺n−s /?α)∗/β′ then β′ ∈/ qi+1 and either (≺n−s /?α)∗/β′ ∈/ qi
or α ∈/ qi.</p>
      <p>It is easy to build a DFA that checks all the conditions and thus recognises Lns.
The language Lup(q) of upward conditions We define Lup(q) = Qup(q)∗,
where Qup(q) is the set of states that are legal children of q wrt formulae of the
form β or β/β′ where β is ≺ch, ≺c−h∗ or (≺c−h /?α)∗.</p>
      <p>−
Formally, a state q′ belongs to Qup(q) if it satisfies the following conditions:
– For each path formula β in q′</p>
      <p>−
• If β =≺c−h∗ /β′ then β′ ∈ q.
• If β =≺chc−h//β?′αt)h∗e/nβ′eitthheenr eβi′th∈erq′βo′r∈≺qc−′ho∗r/(β≺′ c−∈h q/.?α)∗/β′ ∈ q and α ∈ q.
• If β = (≺ −∗
– The set sub(α0) − q′ does not contains any formula of the form ≺c−h or ≺ch
or (≺c−h /?α)∗.
– For each path formula β in sub(α0) − q′
• If β =≺c−−h∗ /β′ then β′ ∈/ q.</p>
      <p>•• IIff ββ ==≺( ≺chc−h//β?′αt)h∗e/nβ′βt′h∈/enq′βa′n∈/dq≺′ac−hn∗d/eβi′th∈/erq.(≺c−h /?α)∗/β′ ∈/ q or α ∈/ q.
The language Ldown(q, a) of local and downward conditions Let
subdown(α0) be is the set of formulae from sub(α0) of the form either β or
β/β′ where β is ≺ch, ≺ch or (≺ch /?α)∗.</p>
      <p>∗
We define Ldown(q, a) = L(a, q, a)∩Tu∈subdown(α0) L(u, q, a) where each L(u, q, a)
is defined as follows:
– Propositional letters</p>
      <p>• L(a, q, a) = Q∗ if a ∈ q and ∅ otherwise.
– Path conditions when u ∈ q
• If u =≺ch then L(u, q, a) = Q+.
• If u =≺c∗h or u = (≺ch /?α)∗ then L(u, q, a) = Q∗.
• If u =≺ch /β′ then L(u, q, a) = Q∗Qβ′Q∗.
• If u =≺c∗h /β′ then either β′ ∈ q or L(u, q, a) = Q∗Q≺c∗h/β′Q∗.
• If u = (≺ch /?α)∗/β′ then either β′ ∈ q</p>
      <p>or L(u, q, a) = Q∗(Q(≺ch/?α)∗/β′ ∩ Qα)Q∗.
– Path conditions when u ∈/ q
• If u =≺ch then L(u, q, a) = ǫ.
• If u =≺ch or u = (≺ch /?α)∗ then L(u, q, a) = ∅.</p>
      <p>∗
• If u =≺ch /β′ then L(u, q, a) = Q∗Qβ′Q∗.
• If u =≺ch /β′ then either β′ ∈ q and L(u, q, a) = ∅, or β′ ∈/ q and
∗
• ILf(uu,=q,(a≺) c=h /Q?∗αQ)∗≺/c∗βh /′βt′hQe∗n. either β′ ∈ q and L(u, q, a) = ∅, or β′ ∈/ q and</p>
      <p>L(u, q, a) = Q∗(Q(≺ch/?α)∗/β′ ∩ Qα)Q∗.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Correctness of the construction</title>
      <p>In this section, we prove that the correctness of our construction.
Theorem 2. Let α0 be a CXPath node formula.
1. The automaton QAα0 is a single-run query automaton such that
[[α0]]T = QAα0 (T ) for every tree T .
2. The automaton QAα0 is of size exponential in the size of α0 and can be
constructed in single-exponential time</p>
      <p>The second point of the theorem follows directly from the construction given
in the previous section. The first point of Theorem 2 follows from two
propositions, presented below.</p>
      <p>Consistency and Maximality We show that an accepting run of the
automaton labels each node of the input tree by a state which represents exactly
all subformulae of the original formula which are true at the node. As an
immediate corollary, the automaton has at most one accepting run per tree, and that
those accepting runs select the same nodes as the CXPath formula.</p>
      <p>First, we show that each accepting run of the automaton is both consistent
and maximal in the following sense. Recall that our trees are of the form T =
(D, λ), where D is the domain, and λ is the labelling function. Throughout the
section, we fix a node CXPath formula α0.</p>
      <p>Definition 1 (Consistency). Let ρ be a labelling function D → 2sub(α0), for
a given tree T (D, λ) and a given node formula α0. We say that ρ is consistent
if, for each node s ∈ D, the following is true:
– for each node formula α ∈ ρ(s), we have s ∈ [[α]]T ;
– for each path formula β ∈ ρ(s), there is at least one pair (s, s′) in [[β]]T .
Definition 2 (Maximality). Let ρ be a labelling function D → 2sub(α0), for a
given tree T (D, λ) and a given node formula α0. We say that ρ is maximal if,
for each node s ∈ D, the following is true:
– for each node formula α ∈ sub(α0), if s ∈ [[α]]T , then α ∈ ρ(s);
– for each path formula β ∈ sub(α0), if [[β]]T contains at least one pair of the
form (s, s′), then β ∈ ρ(s).</p>
      <p>Proposition 2 (Consistency and maximality). If ρ is an accepting run of
QAα0 on a given tree T and for a given node formula α0, then ρ is both consistent
and maximal.</p>
      <p>This proposition is proved by a mutual induction on the structure of node
and path formulae occurring in sub(α0). We present a few sample cases below;
full details are in the appendix.</p>
      <p>Let s be a node of the tree, a its label, α a node formula, and β a path
formula. We suppose, in the cases of consistency, that both α and β are in ρ(s),
and in the cases of maximality, that both α and β are true at s. In the following
we denote by basic constraints, the constraints given in the definition of the
states of the automaton in Section 4.</p>
      <p>– Consistency: some base cases.</p>
      <p>• If α = a′ then a′ = a so α is true at s.(Otherwise, a ∈/ ρ(s) as each states
contains at most one propositional letter so L(a, ρ(s), a) is empty, and
thus ρ cannot be an accepting run.)
• If β =≺ch then L(≺ch, ρ(s), a) = Q+, thus s has a least one child, so β
is true at s.
• If β =≺c−h then the only case where β might not be true is at the root.</p>
      <p>But this cannot happen, as any state containing ≺c−h cannot be final. So
β is true at s.
– Maximality: some base cases.</p>
      <p>• If α = a′ then a′ = a, because α is true at s. If α ∈/ ρ(s) then</p>
      <p>L(a, ρ(s), a) = ∅ and ρ cannot be accepting. Thus, α ∈ ρ(s).
• If β =≺ch then s is not a leaf. If β ∈/ ρ(s) then L(β, ρ(s), a) = ǫ which
contradicts the fact that s is not a leaf. Thus, β ∈ ρ(s).</p>
      <p>−
• If β =≺ch then s is not the root, which implies that s has a parent
s′, labelled by the state ρ(s′). Then, we have ρ(s) ∈ Lup(ρ(s′)), and all
states in Lup contain β. Thus, β ∈ ρ(s).
– Consistency: some induction cases.</p>
      <p>• If α = ¬α′ then, because of basic constraints, α′ ∈/ ρ(s). By the induction
hypothesis for maximality, α′ cannot be true at s. So α is true at s.
• If α = α1 ∨ α2 then, because of basic constraints, either α1 or α2 is in
ρ(s). Then, by the induction hypothesis for consistency, either α1 or α2
is true at s. So α is true at s.
• If α = Eβ then, because of basic constraints, β ∈ ρ(s). By the induction
hypothesis for consistency, β is true at s, i.e. contains at least one element
when evaluated starting from s. So α is true at s.
• If β =?α then, because of basic constraints, ?α ∈ ρ(s). By the induction
hypothesis for consistency, α is true at s. So the interpretation of β
contains at least (s, s), hence β is satisfied at s.
– Maximality: some induction cases.</p>
      <p>• If α = ¬α′ then, α′ cannot be true at s. By the induction
hypothesis for consistency, this means that α′ ∈/ ρ(s). Then, because of basic
constraints, α ∈ ρ(s).
• If α = α1 ∨ α2 then either α1 or α2 is true at s. By the induction
hypothesis for maximality, either α1 or α2 is in ρ(s). So, because of basic
constraints, α ∈ ρ(s).
• If α = Eβ then the interpretation of β contains (s, s′) for some s′, which
means that β is true at s. By the induction hypothesis, β ∈ ρ(s). So,
because of basic constraints, α ∈ ρ(s).
• If β =?α then the interpretation of β contains (s, s) which means that α
is true at s. By the induction hypothesis, α ∈ ρ(s). So, because of basic
constraints, β ∈ ρ(s).</p>
      <p>Accepting Runs We have proved so far that the accepting runs of the
automaton were consistent and maximal labellings, which means that the automaton
accepts at most the same trees than the original CXPath formula. We now prove
that each consistent and maximal labelling of a tree is an accepting run of the
automaton.</p>
      <p>Proposition 3 (Accepting Runs). Let ρ be a maximal and consistent
labelling of the nodes of a given tree T with respect to a given formula α0. Then ρ
is an accepting run of QAα0 .</p>
      <p>This shows that the QA we constructed has indeed at most one accepting run
per tree, and that during this run, a node gets labelled by a state which contains
α0 iff α0 is true at that node. Hence, the automaton selects all the nodes which
are also selected by α0, and this concludes the proof of Theorem 2.
Acknowledgements Most of this work was done while all three authors were
at the University of Edinburgh, supported by EPSRC grant G049165 and
FETOpen Project FoX, grant agreement 233599.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>S.</given-names>
            <surname>Abiteboul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Cautis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Milo</surname>
          </string-name>
          .
          <article-title>Reasoning about XML update constraints</article-title>
          .
          <source>In PODS'07</source>
          , pages
          <fpage>195</fpage>
          -
          <lpage>204</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>M.</given-names>
            <surname>Arenas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Libkin</surname>
          </string-name>
          .
          <article-title>Consistency of XML specifications</article-title>
          . In Inconsistency Tolerance, Springer,
          <year>2005</year>
          , pages
          <fpage>15</fpage>
          -
          <lpage>41</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>M.</given-names>
            <surname>Benedikt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Geerts</surname>
          </string-name>
          .
          <article-title>XPath satisfiability in the presence of DTDs</article-title>
          . In PODS'
          <volume>05</volume>
          , pages
          <fpage>25</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>M.</given-names>
            <surname>Bojanczyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>David</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Muscholl</surname>
          </string-name>
          , Th. Schwentick,
          <string-name>
            <given-names>L.</given-names>
            <surname>Segoufin</surname>
          </string-name>
          .
          <article-title>Two-variable logic on data trees and XML reasoning</article-title>
          .
          <source>In PODS'06</source>
          , pages
          <fpage>10</fpage>
          -
          <lpage>19</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>D.</given-names>
            <surname>Calvanese</surname>
          </string-name>
          , G. De Giacomo,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lenzerini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Y.</given-names>
            <surname>Vardi. Regular XPath</surname>
          </string-name>
          <article-title>: constraints, query containment and view-based answering for XML documents</article-title>
          . In Logic in Databases,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>E.</given-names>
            <surname>Clarke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grumberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Peled</surname>
          </string-name>
          . Model Checking, MIT Press,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>R.</given-names>
            <surname>Clark</surname>
          </string-name>
          .
          <article-title>Querying streaming XML using visibly pushdown automata</article-title>
          .
          <source>Technical Report</source>
          , University of Illinois,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>M.</given-names>
            <surname>Daniele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Giunchiglia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.Y.</given-names>
            <surname>Vardi</surname>
          </string-name>
          .
          <article-title>Improved automata generation for linear temporal logic</article-title>
          .
          <source>In CAV'99</source>
          , pages
          <fpage>249</fpage>
          -
          <lpage>260</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>M.</given-names>
            <surname>Frick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Grohe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Koch</surname>
          </string-name>
          .
          <article-title>Query evaluation on compressed trees</article-title>
          .
          <source>In LICS'03</source>
          , pages
          <fpage>188</fpage>
          -
          <lpage>197</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. P.
          <article-title>Genev´es and</article-title>
          <string-name>
            <given-names>N.</given-names>
            <surname>Layaida</surname>
          </string-name>
          .
          <article-title>A system for the static analysis of XPath</article-title>
          .
          <source>ACM TOIS 24</source>
          ,
          <year>2006</year>
          ,
          <fpage>475</fpage>
          -
          <lpage>502</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. P.
          <article-title>Genev´es, N. Layaida, and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Schmitt</surname>
          </string-name>
          .
          <article-title>Efficient static analysis of XML paths and types</article-title>
          .
          <source>In ACM SIGPLAN Conf. on Programming Language Design and Implementation</source>
          ,
          <year>2007</year>
          , pages
          <fpage>342</fpage>
          -
          <lpage>351</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>R.</given-names>
            <surname>Gerth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Peled</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vardi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Wolper</surname>
          </string-name>
          .
          <article-title>Simple on-the-fly automatic verification of linear temporal logic</article-title>
          .
          <source>In PSTV 1995</source>
          , pages
          <fpage>3</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>A. K. Gupta</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <article-title>Suciu Stream processing of XPath queries with predicates</article-title>
          .
          <source>In SIGMOD 2003</source>
          , pages
          <fpage>419</fpage>
          -
          <lpage>430</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>L.</given-names>
            <surname>Libkin</surname>
          </string-name>
          .
          <article-title>Logics for unranked trees: an overview</article-title>
          .
          <source>In ICALP'05</source>
          , pages
          <fpage>35</fpage>
          -
          <lpage>50</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15. L.
          <string-name>
            <surname>Libkin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Sirangelo</surname>
          </string-name>
          .
          <article-title>Reasoning about XML with temporal logics and automata</article-title>
          .
          <source>Journal of Applied Logic</source>
          <volume>8</volume>
          (
          <year>2010</year>
          ),
          <fpage>210</fpage>
          -
          <lpage>232</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16. S. Maneth,
          <string-name>
            <given-names>T.</given-names>
            <surname>Perst</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Seidl</surname>
          </string-name>
          .
          <article-title>Exact XML type checking in polynomial time</article-title>
          .
          <source>In ICDT 2007</source>
          , pages
          <fpage>254</fpage>
          -
          <lpage>268</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <given-names>W.</given-names>
            <surname>Martens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Neven</surname>
          </string-name>
          , Th. Schwentick.
          <article-title>Simple off the shelf abstractions for XML schema</article-title>
          .
          <source>SIGMOD Record</source>
          <volume>36</volume>
          (
          <issue>3</issue>
          ):
          <fpage>15</fpage>
          -
          <lpage>22</lpage>
          (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Marx. XPath with conditional axis relations</article-title>
          .
          <source>In EDBT 2004</source>
          , pages
          <fpage>477</fpage>
          -
          <lpage>494</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <given-names>M.</given-names>
            <surname>Marx</surname>
          </string-name>
          .
          <article-title>Conditional XPath</article-title>
          .
          <source>ACM TODS 30</source>
          (
          <year>2005</year>
          ),
          <fpage>929</fpage>
          -
          <lpage>959</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <given-names>F.</given-names>
            <surname>Neven</surname>
          </string-name>
          .
          <article-title>Design and Analysis of Query Languages for Structured Documents</article-title>
          .
          <source>PhD Thesis</source>
          , U. Limburg,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <given-names>F.</given-names>
            <surname>Neven</surname>
          </string-name>
          . Automata, logic, and XML.
          <source>In CSL 2002</source>
          , pages
          <fpage>2</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <given-names>F.</given-names>
            <surname>Neven</surname>
          </string-name>
          , Th. Schwentick.
          <article-title>Query automata over finite trees</article-title>
          .
          <source>TCS</source>
          ,
          <volume>275</volume>
          (
          <year>2002</year>
          ),
          <fpage>633</fpage>
          -
          <lpage>674</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <given-names>F.</given-names>
            <surname>Neven</surname>
          </string-name>
          , J. Van den Bussche.
          <article-title>Expressiveness of structured document query languages based on attribute grammars</article-title>
          .
          <source>J. ACM</source>
          <volume>49</volume>
          (
          <issue>1</issue>
          ):
          <fpage>56</fpage>
          -
          <lpage>100</lpage>
          (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>J. Niehren</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Planque</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-M. Talbot</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Tison</surname>
          </string-name>
          .
          <article-title>N-ary queries by tree automata</article-title>
          .
          <source>In DBPL 2005</source>
          , pages
          <fpage>217</fpage>
          -
          <lpage>231</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25. B.
          <string-name>
            <surname>-H. Schlingloff</surname>
          </string-name>
          .
          <article-title>Expressive completeness of temporal logic of trees</article-title>
          .
          <source>Journal of Applied Non-Classical Logics</source>
          <volume>2</volume>
          (
          <year>1992</year>
          ),
          <fpage>157</fpage>
          -
          <lpage>180</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Th</surname>
          </string-name>
          . Schwentick.
          <article-title>XPath query containment</article-title>
          .
          <source>SIGMOD Record</source>
          <volume>33</volume>
          (
          <year>2004</year>
          ),
          <fpage>101</fpage>
          -
          <lpage>109</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Th</surname>
          </string-name>
          . Schwentick.
          <article-title>Automata for XML - a survey</article-title>
          .
          <source>JCSS</source>
          <volume>73</volume>
          (
          <year>2007</year>
          ),
          <fpage>289</fpage>
          -
          <lpage>315</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28. W. Thomas.
          <article-title>Languages, automata, and logic</article-title>
          .
          <source>In Handbook of Formal Languages</source>
          , volume III, pages
          <fpage>389</fpage>
          -
          <lpage>455</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <given-names>M. Y.</given-names>
            <surname>Vardi</surname>
          </string-name>
          .
          <article-title>An automata-theoretic approach to linear temporal logic</article-title>
          .
          <source>Banff Higher Order Workshop</source>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>M. Y. Vardi</surname>
            and
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Wolper</surname>
          </string-name>
          .
          <article-title>Reasoning about infinite computations</article-title>
          .
          <source>Inf.&amp; Comput</source>
          .
          <volume>115</volume>
          (
          <year>1994</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>