=Paper=
{{Paper
|id=Vol-229/paper-4
|storemode=property
|title=Coupling Fragments of XPath with XML Indexing and Query Decomposition
|pdfUrl=https://ceur-ws.org/Vol-229/paper4.pdf
|volume=Vol-229
|dblpUrl=https://dblp.org/rec/conf/icdt/FletcherGWGP07
}}
==Coupling Fragments of XPath with XML Indexing and Query Decomposition==
<pdf width="1500px">https://ceur-ws.org/Vol-229/paper4.pdf</pdf>
<pre>
 Coupling Fragments of XPath with XML Indexing and
               Query Decomposition

   George H.L. Fletcher1 , Dirk Van Gucht1 , Yuqing Wu1 , Marc Gyssens2 , and Jan
                                    Paredaens3
                              1
                               Indiana University, Bloomington
                   {gefletch,vgucht,yuqwu}@cs.indiana.edu
                  2
                    Hasselt University & Transnational University Limburg
                            marc.gyssens@uhasselt.be
                                  3
                                    University of Antwerp
                             jan.paredaens@ua.ac.be


       Abstract. Recent studies have proposed structural summary techniques for path-
       query evaluation on semi-structured data sources. One major line of this research
       has been the introduction of the DataGuide, 1-index, 2-index, and A(k) indices,
       and subsequent investigations and generalizations. Another recent study has con-
       sidered structural characterizations of fragments of XPath, the standard path nav-
       igation language for XML documents. In this paper we provide a methodology
       on XPath query processing that couples these two areas of research on structural
       indices and query languages. To illustrate this methodology, we apply it to couple
       an upward-only XPath fragment with the A(k) and P (k) structural indices. With
       an eye towards applying this result to XPath query processing, we (1) show how
       upward-only XPath expressions can be evaluated directly on the corresponding
       indices and (2) leverage these results to develop generic techniques for making
       effective use of A(k) and P (k) indices for more general, frequently occurring
       XPath expressions.


1 Introduction
XML (eXtensible Markup Language) [2] provides a standard data format that is flexible
enough to be customized for various domains. An increasing number of applications
use XML as their data model or as the format for exchanging data among heteroge-
neous data repositories for the purpose of collaboration, data integration, and informa-
tion sharing, and querying XML document databases.
    XQuery is currently the most popular XML query language [4]. The fundamental
building block of XQuery is XPath [5]. XPath expressions are used to specify small
node-labeled trees which match portions of the hierarchical XML data. How to support
efficient access to XML data using XPath continues to be a critical research problem in
this domain.
    In XPath query evaluation, indices similar to those used in relational database sys-
tems – namely, value indices on tags and text values – are first used, together with
structural join algorithms [3, 1, 21]. This approach turns out to be simple and efficient.
 Index         Description        Index Nodes        Labeling        Query Evaluation
 DataGuide [7] Every      unique                     Can not use in- Target node set reach-
               label path of a                       coming path for able by an XPath ex-
               source     appear                     node labeling.  pression is the superset
               exactly once in                                       of the result.
               the index graph.
 Strong        Distinguishes      Nodes that share Labeled by the Can be used to directly
 DataGuide [7] nodes that are the same incom- unique incoming answer linear XPath
 1-index [14] reachable        by ing path up to the path up to the query of any length.
               different paths. root                 root
 2-index[14] Distinguishes        Pairs (vectors) of                 Can be used to di-
 T-index[14] pairs (vectors) of nodes that share                     rectly answer queries
               nodes that shares the same paths                      that match a given tem-
               the same set of                                       plate.
               paths.
 A(k)-index Distinguishes         Nodes that share                   Can be used to evalu-
 [12]          nodes with in- the same incom-                        ate linear XPath expres-
               coming path up ing path, up to                        sion directly when local
               to length k.       length k                           similarity is sufficient.
 D(k)-         Uses local index Nodes that shares                    Need verification when
 index[17]     to adapt to query the same incom-                     the XPath query expres-
 M (k)-        workload           ing path, accord-                  sion is longer than k.
 index [9]                        ing to local simi-
                                  larity.
          Table 1. Comparison Among Structural Indices for Semi-Structured Data.


However, the structural containment relationships native to XML data are not directly
captured by value indices.
    To directly capture the structural information of XML data, a family of XML in-
dices has been introduced. DataGuide [7] was the first to be proposed, followed by
the 1-index [14], which is based on a notion of bisimulation among nodes in an XML
document. These indices can be used to evaluate path expression accurately without
accessing the original data graph. Milo and Suciu [14] also introduced the 2-index and
T-index, based on similarity of pairs (vectors) of nodes. Unfortunately, these and other
early structural indices tend to be too large for practical use because they typically
maintain too fine-grained structural information about the document [10, 18].
    To remedy this, Kaushik et al. introduced the A(k)-index [12] which uses a notion of
bisimilarity on nodes relativized to paths of length k from nodes. This captures localized
structural information of a document, and can support path expressions of length up to
k. Focusing just on local similarity, the A(k)-index can be substantially smaller than
the 1-index and others.
    Several works have investigated maintenance and tuning of the A(k) indices. The
D(k)-index [17] and M (k)-index [9] extend the A(k)-index to adapt to query work-
load. Yi et al. [20] developed update techniques for the A(k)-index and 1-index. Finally,
the integrated use of structural and values indices has been explored [11], and there have
also been investigations on covering indices [10, 18] and index selection [16, 19]. The
characteristics of many of these structural indices are summarized in Table 1.
    The introduction of structural indices for XML data has lead to significant improve-
ments in the performance of XPath expression evaluation. Furthermore, great advances
have been made on the theoretical and practical underpinnings of expression process-
ing [13]. However, to date there lacks a general framework for investigating some of the
most fundamental questions about using indices in expression evaluation. Such ques-
tions include:

 1. For which classes of XPath expressions are structural indices ideally suited?
 2. Then, for these classes, how are its expressions optimally evaluated with the index?
 3. Can the answers to these questions be bootstrapped to provide general techniques
    for evaluation of arbitrary XPath expressions?

     To answer question (1), we follow a general, theoretical approach to providing struc-
tural characterizations of fragments of XPath recently proposed by Gyssens et al. [8]. In
that paper, the well-known concept of semantic indistinguishability of objects and rela-
tionships in a query language (in our case, nodes and paths in the tree-representations
of the XML documents, and XPath) is related to the purely structural, topological, in-
distinguishability of nodes and paths in these tree-representations. For example, in that
paper it is shown that for a downward fragment of XPath, semantic indistinguishability
of nodes is equivalent to downward-bisimilarity indistinguishability of these nodes.
     Recaputilating, recent studies have (1) introduced structural indexes for efficient
XPath query processing and (2) related semantic-indistinguishability with structural
indistinguishabilty of nodes and paths for certain fragments of XPath. In this paper we
introduce a new methodology that couples these two areas in a principled way, and that
is aimed for the benefit of efficiently evaluating classes of XPath expressions.
     To illustrate this methodology, we apply it to couple the important concept of A(k)
and P (k) structural indices with a certain upward XPath fragment. With an eye towards
applying this result to XPath query processing, we (1) show how upward-only XPath
expressions can be evaluated directly on the corresponding A(k) and P (k) indices and
(2) leverage these results to develop XPath query-decomposition techniques for making
effective use of A(k) and P (k) indices for more general, but, in practice, frequently
occurring XPath expressions.
     In particular, we give a precise characterization of the family of A(k)-indices in
terms of a downward fragment of XPath (Section 3). With this result in hand, we an-
swer question (2) by showing how expressions in these fragments can be evaluated
directly in terms of the A(k) and P (k) indices (Section 3). Finally, to address question
(3), we leverage our results on (1) and (2) to develop techniques for making effec-
tive use of A(k) and P (k) indices for important practical classes of XPath (Section
4). These results identify a new methodology for the study of the interaction between
structural indices and the evaluation of XPath expressions. Furthermore, the nature of
this methodology is such that it sheds light on techniques with immediate practical ap-
plication. As such, our methodology is a good example of how the theory and practice
for XPath query processing can be nicely linked.
                                                                                     Projects
                                                                                n0

                  Department                                                                                                            Department
                                 n1                                                                                                n2

           Name                            Project                 Project                            Project                             Name     n7       Web n8
      n3                              n4                      n5                                                n6

"Design"                                                                                                                                      "Marketing" "http://"
           Project                    Name       Lead      Name          Lead        Web          Name      Lead           Project             Project
                     n9          n10          n11       n12           n13                n14    n15      n16         n17                 n18

                                "D100"      "Sato"      "D200"     "Dubois"          "http://" "A100" "Ivanova"
               Name         Lead                                                                           Name             Lead          Name       Lead
            n19           n20                                                                                   n21 n22                 n23       n24

           "D100a" "Smith"                                                                                "A100a" "Chen" "A100b" "Adamo"


     Fig. 1. Fragment of an XML document. For reference, non-leaf nodes are given unique IDs.

2 Preliminary Notions
2.1 The XPath-Algebra
We first introduce the XML data model and algebraization of the navigational core
of XPath, following the presentation of Gyssens et al. [8]. In this paper, documents
are finite unordered node-labeled trees. More formally, a document D is a 4-tuple
(V, Ed, r, λ), with V the finite set of nodes, Ed ⊆ V × V the set of edges, r ∈ V
the root, and λ : V → L a node-labeling function into a countably infinite set of labels
L. A visualization of a fragment of a document is given in Figure 1. Useful notions of
height and distance in a document are defined as follows.
Definition 1. Let D = (V, Ed, r, λ) be a document and n, m ∈ V . The distance from
n to m in D, distance(n, m), is the length of the path from n to m in D. The height of
D, height(D), is the length of the longest path in D starting from the root r. In other
words, height(D) = max{distance(r, n) | n ∈ V }.
      We next present an algebraization of the logical navigational core of XPath.
Definition 2. The XPath-algebra consists of the primitives ε, ∅, ↓, ↑, and ` together with
the operations on expressions E1 ¦ E2 , E1 [E2 ], E1 ∪ E2 , E1 ∩ E2 , and E1 − E2 .4
Given a document D = (V, Ed, r, λ), the semantics of an XPath-algebra expression
E on D, denoted E(D), is always a binary relation over V . The semantics for each
operation is given in Table 2.
    The XPath-algebra semantics reflects a “global” perspective of expressions being
evaluated on an entire document. There is also a “local” semantic perspective, in which
expressions are viewed as working on a particular node, as follows.

Definition 3. Let E be an XPath-algebra expression and let D = (V, Ed, r, λ) be a
document. For m ∈ V , E(D)(m) = {n ∈ V | (m, n) ∈ E(D)}.

    To illustrate the XPath-algebra, the XPath query for retrieving all departments with
websites on the document of Figure 1, /Projects/Department[./Web], can be formu-
lated in the XPath-algebra with the expression Projects ¦ ↓ ¦ Department[↓ ¦ Web]
 4
     Since we are concerned in this study with reasoning on a fixed document of which the height
     is known, it is not necessary to introduce ancestor or descendant axes.
                                                   ε(D) = {(m, m) | m ∈ V }
                                                  ∅(D) = ∅
                                                 ↓ (D) = Ed
                                                 ↑ (D) = Ed−1
                                                   `(D) = {(m, m) | m ∈ V and λ(m) = `}
                                     E1 ∪ E2 (D) = E1 (D) ∪ E2 (D)
                                     E1 ∩ E2 (D) = E1 (D) ∩ E2 (D)
                                     E1 − E2 (D) = E1 (D) − E2 (D)
                                     E1 ¦ E2 (D) = {(m, n) | ∃w : (m, w) ∈ E1 (D)
                                                                                               & (w, n) ∈ E2 (D)}
                                      E1 [E2 ](D) = {(m, n) ∈ E1 (D)
                                                                                    | ∃w : (n, w) ∈ E2 (D)}.

                                                       Table 2. XPath-Algebra Semantics


                                                                                                                           Department
                                                                                                                               n1

                                                                                                                  Name                     Project
                       Department                                     Department                                   n3                       n4, n5
                          n1                                             n1

                                                           Name                      Project                              Name             Project           Lead
                                                            n3                       n4, n5                              n10, n12            n9             n11, n13
        Name             Project         Lead
   n3, n10, n12, n19    n4, n5, n9     n11, n13, n20
                                                                    Name             Project         Lead                           Name             Lead
                                                                  n10, n12, n19        n9         n11, n13, n20                      n19              n20


                       A(0)                                                       A(1)                                              A(2)
  Fig. 2. Full A(k) indices for the “Design” Department subtree in the document of Figure 1.

evaluated “locally” at the root node n0 . As further illustrations of XPath-algebra ex-
pressions, consider:
E1 . “Retrieve all department names in projects.”
                         Projects ¦ ↓ ¦ Department ¦ ↓ ¦ Name
E2 . “Retrieve the leaders of projects that have websites.”
                      Department ¦ ↓ ¦ Project[↓ ¦ Web] ¦ ↓ ¦ Lead
E3 . “Retrieve the leaders of projects that do not have websites.”
                       Department ¦ ↓ ¦ Project ¦ ↓ ¦ Lead − E2
E4 . “Retrieve the names of all departments that have a website and a project with a
     website.”
            Projects ¦ ↓ ¦ Department[↓ ¦ Web][↓ ¦ Project ¦ ↓ ¦ Web] ¦ ↓ ¦ Name
E5 . “Retrieve all projects which are sub-projects of projects with a website.”
                              Project[↑ ¦ Project ¦ ↓ ¦ Web]
    Recall that in the “global” semantics of the XPath-algebra, expressions evaluate to
pairs of satisfying nodes. For example, E1 (D) = {(n0 , n3 ), (n0 , n7 )} and E5 (D) =
{(n17 , n17 ), (n18 , n18 )}.
2.2   A(k) Indexes
Kaushik et al. have proposed the family of A(k)-indexes for graph-structured data,
based on a notion of node bisimilarity [12]. An A(k)-index is built on a partitioning
of the nodes of a semi-structured document. The definition of the equivalence relation
≡A(k) , specialized to XML documents, which induces this partition is as follows.
Definition 4. Let D = (V, Ed, r, λ) be a document, n1 , n2 ∈ V , and k ∈ N. We say
n1 and n2 are A(k)-equivalent in D, denoted n1 ≡A(k) n2 , if
 1. λ(n1 ) = λ(n2 ); and
 2. if k ≥ 1 then
    (a) n1 has a parent in D if and only if n2 has a parent in D; and
    (b) if n1 has parent p1 and n2 has parent p2 , then p1 ≡A(k−1) p2 .
A(k) equivalence defines partition blocks on the nodes n ∈ V of a document D =
(V, Ed, r, λ),
                        [n]k = {n0 | n0 ∈ V & n ≡A(k) n0 }
and the full A(k)-index is built directly on the collection of these partition blocks,
D/A(k) = {[n]k | n ∈ V }. We refer to this partition as simply the A(k)-index.
    Following Kaushik et al. [12], a full A(k)-index can be visualized as a graph wherein
each node represents a partition block and an edge exists from a node A to a node B
if an edge existed in the original document from a data node in A to a data node in
B. This construction is illustrated in Figure 2 on the Design Department subtree of the
document of Figure 1.
    We observe the following basic property of A(k) equivalence, which follows by a
simple induction on k.
Proposition 1. Let D = (V, Ed, r, λ) be a document, let k ∈ N, and let n1 , n2 ∈ V . If
n1 ≡A(k+1) n2 , then n1 ≡A(k) n2 .
    At first sight, we have defined an infinite class of equivalence relations on D. How-
ever, there is no point to consider A(k) equivalence beyond k = height(D), as can
also be seen by an inductive argument.
Proposition 2. Let D = (V, Ed, r, λ) be a document, and let k ≥ height(D). Then
A(k) equivalence coincides with A(height(D)) equivalence.
    We now extend A(k) equivalence as follows to capture upward paths to the root in
a document.
Definition 5. Let D = (V, Ed, r, λ) be a document and n1 , n2 ∈ V . We say n1 and n2
are A(∞)-equivalent in D, denoted n1 ≡A(∞) n2 , if
 1. λ(n1 ) = λ(n2 ); and
 2. n1 has a parent in D if and only if n2 has a parent in D; and
 3. if n1 has parent p1 and n2 has parent p2 , then p1 ≡A(∞) p2 .
    Not surprisingly, we have the following.
Proposition 3. Let D = (V, Ed, r, λ) be a document. Then A(∞) equivalence coin-
cides with A(height(D)) equivalence.
   Note that A(∞) equivalence on nodes induces the 1-index proposed by Milo and
Suciu [14] and the strong DataGuide of Goldman and Widom [7].
2.3   P (k) Indexes
We next generalize A(k) equivalence (Definition 4) to pairs of nodes.
Definition 6. Let D = (V, Ed, r, λ) be a document, k ∈ N, let m1 , n1 , m2 , n2 ∈ V
such that m1 is an ancestor of n1 and m2 is an ancestor of n2 . Then, (n1 , m1 ) and
(n2 , m2 ) are P (k) equivalent, denoted (n1 , m1 ) ≡P (k) (n2 , m2 ), if
 1. distance(n1 , m1 ) = distance(n2 , m2 ); and
 2. n1 ≡A(k) n2 .
Note that P (k) equivalence on pairs of nodes induces a “localized” version of the 2-
index proposed by Milo and Suciu [14], just as A(k) equivalence on nodes induces a
“localized” version of the 1-index [12].
    Consider the subtree D0 in the document of Figure 1 rooted at n4 . As an illustration
of the P (k)-index (i.e., the partition set D/P (k) induced by P (k) equivalence on a
document D) the P (0)-index of D0 is:
                    D0 /P (0) = {[(n10 , n10 ), (n19 , n19 )],
                                   [(n10 , n4 ), (n19 , n9 )], [(n19 , n4 )],
                                   [(n4 , n4 ), (n9 , n9 )], [(n9 , n4 )],
                                   [(n11 , n11 ), (n20 , n20 )],
                                   [(n11 , n4 ), (n20 , n9 )], [(n20 , n4 )]}
    Properties 1 and 2 of A(k) equivalence can easily be bootstrapped to properties of
P (k) equivalence.
Proposition 4. Let D = (V, Ed, r, λ) be a document, let k ∈ N and let m1 , n1 ,
m2 , n2 ∈ V such that m1 is an ancestor of n1 and m2 is an ancestor of n2 . If
(n1 , m1 ) ≡P (k+1) (n2 , m2 ) then (n1 , m1 ) ≡P (k) (n2 , m2 )
Proposition 5. Let D = (V, Ed, r, λ) be a document, and let k ≥ height(D). Then
P (k) equivalence coincides with P (height(D)) equivalence.
    As with A(k) equivalence, we extend P (k) as follows to capture upward paths to
the root in a document.
Definition 7. Let D = (V, Ed, r, λ) be a document and let m1 , n1 , m2 , n2 ∈ V such
that m1 is an ancestor of n1 and m2 is an ancestor of n2 . Then, (n1 , m1 ) and (n2 , m2 )
are P (∞) equivalent, denoted (n1 , m1 ) ≡P(∞) (n2 , m2 ), if
 1. distance(n1 , m1 ) = distance(n2 , m2 ); and
 2. n1 ≡A(∞) n2 .
    Proposition 3 for A(∞) equivalence can be bootstrapped to P (∞) equivalence in a
straightforward manner.
Proposition 6. Let D = (V, Ed, r, λ) be a document. Then P (∞) equivalence coin-
cides with P (height(D)) equivalence.
    Note that P (∞) equivalence induces the 2-index proposed by Milo and Suciu [14].
    As a closing remark, we observe from the definitions of A(k) and P (k) indices that,
given an A(k) index on a document D, it is straightforward to derive the corresponding
P (k) index for D, and vice versa.
3 Algebraic Characterizations of the A(k) and P (k) Indices
In this section, we give characterizations of the A(k) and P (k) family of indices based
on notions of indistinguishability of nodes and pairs of nodes in fragments of the XPath-
algebra. The notion of language indistinguishability on nodes is made precise in the
following definition, following Gyssens et al. [8].

Definition 8. Let D = (V, Ed, r, λ) be a document, m1 , m2 ∈ V , and F a fragment
of the XPath-algebra. We say m1 and m2 are indistinguishable by F , denoted m1 ≡F
m2 , if for any expression E in F , it is the case that E(D)(m1 ) = ∅ if and only if
E(D)(m2 ) = ∅.

Language indistinguishability on pairs of nodes is defined as follows.
Definition 9. Let D = (V, Ed, r, λ) be a document, n1 , m1 , n2 , m2 ∈ V , and F a
fragment of the XPath-algebra. We say (n1 , m1 ) and (n2 , m2 ) are indistinguishable by
F , denoted (n1 , m1 ) ≡F (n2 , m2 ), if for any expression E in F , it is the case that
(n1 , m1 ) ∈ E(D) if and only if (n2 , m2 ) ∈ E(D).
   Next, we introduce a family of algebras which we use to precisely characterize the
A(k) and P (k) indices.

Definition 10. We recursively define the upward-k XPath algebras, U (k) for each k ∈
N, as follows.
 1. U (0) = the set of XPath-algebra expressions without occurrences of the “↓” and
    “↑” operators.
 2. For k ≥ 1,
    (a) if E ∈ U (k − 1), then E ∈ U (k);
    (b) ↑ ∈ U (k);
    (c) if E1 ∈ U (k) and E2 ∈ U (k), then
           – E1 ? E2 ∈ U (k), for ? = ∪, ∩, −;
    (d) if E1 ∈ U (k1 ) and E2 ∈ U (k2 ), and k1 + k2 ≤ k, then
           – E1 ¦ E2 ∈ U (k) and E1 [E2 ] ∈ U (k); and
 3. No other expressions are in U (k).

As a very simple example of U (k) expressions, note that ↑ ¦ ↑ ¦ ↑ is in U (3) but not
in U (2).
    The following useful observation about the U (k) algebras can be shown by a simple
inductive argument.

Proposition 7. Let D = (V, Ed, r, λ) be a document, k ∈ N, m, n ∈ V , and E ∈
U (k). If (n, m) ∈ E(D), then m is an ancestor of n such that distance(n, m) ≤ k.

    The U (k) algebras are extended as follows to an algebra with unrestricted use of
the “↑” primitive.

Definition 11. The upward XPath algebra U (∞) is the set of all XPath-algebra ex-
pressions without occurrences of the ↓ primitive.
                              S
   In other words, U (∞) = k∈N U (k).
   Compositions in which the number of nested “↑” primitives is strictly greater than
height(D) must return the empty set, and are therefore not very meaningful. The fol-
lowing result should therefore not come as a surprise.

Proposition 8. The U (∞) and U (height(D)) algebras are equivalent in expressive
power.

   In particular, indistinguishability by U (height(D)) and indistinguishability by
U (∞) coincide.


3.1 Structural Characterizations of U (k) and U (∞) Indistinguishability

In this section, we provide precise structural characterizations of indistinguishability
in the upward algebras. In particular, for each k > 0, we show that on a fixed doc-
ument, indistinguishability of nodes by the U (k) algebra corresponds precisely with
A(k) equivalence of nodes, and indistinguishability of node pairs by the U (k) algebra
corresponds precisely with P (k) equivalence. We also extend these results to the full
upward algebra U (∞). We follow the methodology of Gyssens et al. [8] to establish
these correspondences.
    The following facts are crucial in the sequel.

Lemma 1. Let D = (V, Ed, r, λ) be a document, k ∈ N, and n1 , m1 , n2 ∈ V such
that m1 is an ancestor of n1 and distance(n1 , m1 ) ≤ k. If n1 ≡A(k) n2 , then there
exists m2 ∈ V such that m2 is an ancestor of n2 and (n1 , m1 ) ≡P (distance(n1 ,m1 ))
(n2 , m2 ).5 Furthermore, m1 ≡A(k−distance(n1 ,m1 )) m2 .

Proof. By induction on k. For the base case, k = 0, clearly m1 = n1 and λ(n1 ) =
λ(n2 ). The statement holds for m2 = n2 .
    For k ≥ 1, we can assume that the statement holds for k−1. If n1 ≡A(k) n2 , then ei-
ther (1) both n1 and n2 have no parents, or (2) they both have parents p1 and p2 , respec-
tively, such that p1 ≡A(k−1) p2 (by definition of A(k) equivalence). In case (1), clearly
m1 = n1 and the statement holds for m2 = n2 . In case (2), distance(p1 , m1 ) ≤ k −1,
and by the definition of A(k) equivalence, p1 ≡A(k−1) p2 . By the induction hypothesis,
there exists an ancestor m2 of p2 such that (p1 , m1 ) ≡P (distance(p1 ,m2 )) (p2 , m2 ) and
m1 ≡A(k−1−distance(p1 ,m1 )) m2 . It readily follows that (n1 , m1 ) ≡P (distance(n1 ,m1 ))
(n2 , m2 ) and m1 ≡A(k−distance(n1 ,m1 )) m2 .

Proposition 9. Let D = (V, Ed, r, λ) be a document, k ∈ N, E ∈ U (k), and n1 , m1 ,
n2 , m2 ∈ V such that m1 is an ancestor of n1 and m2 is an ancestor of n2 , and
(n1 , m1 ) ≡P (k) (n2 , m2 ). If (n1 , m1 ) ∈ E(D), then (n2 , m2 ) ∈ E(D), and vice
versa.

Proof. First observe that it follows from E ∈ U (k) and (n1 , m1 ) ∈ E(D) that
distance(n1 , m1 ) ≤ k, by Proposition 7.
 5
     And even stronger, (n1 , m1 ) ≡P (k) (n2 , m2 ).
    The proof is by induction on k. The base case, k = 0, follows straightforwardly
from the definition of P (0) equivalence and a simple structural induction on expressions
in U (0). Now assume that k ≥ 1, and that the statement holds for 0, 1, 2, . . . , k − 1.
The proof goes by structural induction on expressions in U (k). Thus, let E ∈ U (k).

 – E ∈ U (k − 1). The statement holds by the induction hypothesis.
 – E =↑. If (n1 , m1 ) ∈↑ (D), then m1 is the parent of n1 . Since (n1 , m1 ) ≡P (k)
   (n2 , m2 ), it follows in particular that m2 is the parent of n2 . We conclude that
   (n2 , m2 ) ∈↑ (D).
 – E = E1 ∪ E2 , for E1 and E2 ∈ U (k). Suppose (n1 , m1 ) ∈ E(D). Then
   (n1 , m1 ) ∈ E1 (D) or (n1 , m1 ) ∈ E2 (D). Without loss of generality, assume
   (n1 , m1 ) ∈ E1 (D). Then by structural induction, (n2 , m2 ) ∈ E1 (D), and we con-
   clude (n2 , m2 ) ∈ E(D).
 – E = E1 ∩ E2 or E = E1 − E2 , for E1 and E2 ∈ U (k). Similar to the previous
   case.
 – E = E1 ¦ E2 , for E1 ∈ U (k1 ) and E2 ∈ U (k2 ), such that k1 + k2 ≤ k. Sup-
   pose (n1 , m1 ) ∈ E(D). Then there exists a node w1 ∈ V such that (n1 , w1 ) ∈
   E1 (D) and (w1 , m1 ) ∈ E2 (D). By Lemma 7, distance(n1 , w1 ) ≤ k1 and
   distance(w1 , m1 ) ≤ k2 . By Lemma 1, there exists a node w2 ∈ V such that
   (n1 , w1 ) ≡P (distance(n1 ,w1 )) (n2 , w2 ), and w1 ≡A(k−distance(n1 ,w1 )) w2 . Since,
   k2 ≤ k − distance(n1 , w1 ), by Lemma 1, a node m0 ∈ V exists with (w1 , m1 )
   ≡P (distance(w1 ,m1 )) (w2 , m0 ). Since (n1 , w1 ) ≡P (distance(n1 ,w1 )) (n2 , w2 ), and
   (w1 , m1 ) ≡P (distance(w1 ,m1 )) (w2 , m0 ), we know, by the definition of ≡P (k1 ) and
   ≡P (k2 ) , that distance(n2 , w2 ) = distance(n1 , w1 ) and distance(w2 , m0 ) =
   distance(w1 , m1 ). Consequently, distance(n2 , m0 ) = distance(n1 , m1 ), and
   since m0 is the unique ancestor at this distance, we conclude that m0 = m2 . Thus
   (w1 , m1 ) ≡P (k2 ) (w2 , m2 ). By the induction hypothesis, we can conclude that
   (n2 , w2 ) ∈ E1 (D) and (w2 , m) ∈ E2 (D) and thus (n2 , m2 ) ∈ E(D).
 – E = E1 [E2 ], for E1 ∈ U (k1 ) and E2 ∈ U (k2 ), such that k1 + k2 ≤ k. Similar to
   the previous case.
   An important corollary of Proposition 9 is the following observation.

Proposition 10. Let E ∈ U (k), k ∈ N, and let D = (V, Ed, r, λ) be a document.
Then E(D) is the union of some partition blocks of the P (k)-index.

   We are now prepared to establish the main results of this section.

Theorem 1. Let D = (V, Ed, r, λ) be a document, n1 , n2 ∈ V , and k ∈ N. Then
n1 ≡U (k) n2 if and only if n1 ≡A(k) n2 .

Proof. (If) Suppose n1 ≡A(k) n2 and let E ∈ U (k) such that E(D)(n1 ) 6= ∅. Hence,
there exists m1 ∈ V such that (n1 , m1 ) ∈ E(D). Then, by Lemma 1 and Proposition 9,
there exists m2 ∈ V such that (n2 , m2 ) ∈ E(D), and therefore E(D)(n2 ) 6= ∅. It
follows symmetrically that if E(D)(n2 ) 6= ∅, then E(D)(n1 ) 6= ∅. We conclude that
n1 ≡U (k) n2 .
    (Only if) For the converse, assume that n1 ≡U (k) n2 . We first establish two facts.
 1. λ(n1 ) = λ(n2 ). Otherwise, consider the expression λ(n1 ) ∈ U (k). Then
    λ(n1 )(D)(n1 ) 6= ∅ and λ(n1 )(D)(n2 ) = ∅, a contradiction.
 2. if k ≥ 1, then either n1 and n2 both have parents or are both the root. Otherwise,
    assume that, e.g., n1 has a parent and n2 is the root. Consider the expression ↑.
    Clearly, ↑ (D)(n1 ) 6= ∅, but ↑ (D)(n2 ) = ∅, a contradiction.
    We now prove the statement of the theorem by induction on k. For k = 0, this
    follows immediately from Fact 1.
    Therefore, consider the case k ≥ 1, and assume the statement holds for k − 1. If n1
    and n2 are both the root, then the statement holds trivially. In the remaining case,
    we know by Facts 1 and 2 that λ(n1 ) = λ(n2 ), n1 has a parent, say p1 , and n2
    has parent, say p2 . Assume that p1 is not indistinguishable from p2 by U (k − 1).
    Then, there exists an expression E in U (k − 1) such that E(D)(p1 ) 6= ∅ and
    E(D)(p2 ) = ∅, or vice-versa. Now consider the expression F =↑ ¦E which is in
    U (k). Clearly, F (D)(n1 ) 6= ∅ and F (D)(n2 ) = ∅, or vice-versa, a contradiction.
    Thus p1 ≡U (k−1) p2 , and therefore, by the induction hypothesis, p1 ≡A(k−1) p2 ,
    from which the statement of the theorem immediately follows.

    Following a similar argument, we have the following characterization of U (k) in-
distinguishability on node pairs.

Theorem 2. Let D = (V, Ed, r, λ) be a document, n1 , m1 , n2 , m2 ∈ V , and k ∈ N.
Then (n1 , m1 ) ≡U (k) (n2 , m2 ) if and only if (n1 , m1 ) ≡P (k) (n2 , m2 ).

   We remind the reader that indistinguishability by U (∞) and indistinguishability
by U (height(D)) coincide (Proposition 8). Furthermore, by Proposition 3, A(∞)
equivalence coincides with A(height(D)) equivalence. Therefore, the result below
immediately follows.

Theorem 3. Let D = (V, Ed, r, λ) be a document and n1 , n2 ∈ V . Then n1 ≡U (∞)
n2 if and only if n1 ≡A(∞) n2 .

    Similarly, by Proposition 6, P (∞) equivalence coincides with P (height(D)) equiv-
alence. Therefore, the result below immediately follows.

Theorem 4. Let D = (V, Ed, r, λ) be a document and n1 , m1 , n1 , n2 ∈ V . Then
(n1 , m1 ) ≡U (∞) (n2 , m2 ) if and only if (n1 , m1 ) ≡P(∞) (n2 , m2 ).


4 XPath Query Evaluation with A(k) and P (k) Indices
In this section, we consider strategies to evaluate XPath expressions, leveraging the re-
sults and characterizations of the A(k) and P (k) indices (Sections 3). First, we discuss
how A(k) and P (k) indices can be used in the evaluation of upward XPath expres-
sions, as expressed in the U (∞) and U (k) algebras. Then, we focus on the evaluation
of XPath algebra query expressions that can be expressed in certain downward XPath
algebras – namely, the downward algebra D(∞) and the downward-k-algebras D(k)
(for each k ∈ N). These algebras are defined just like the upward algebras except that
one is restricted to using ↓ operations instead of ↑ operations.
4.1 Evaluating Upward Expressions

Our analysis of the A(k) and P (k) indices shows that they are the ideal index struc-
tures for evaluation of U (k) expressions. Indeed, consider an expression E ∈ U (k), a
document D, and its P (k)-index. Proposition 10 states that the evaluation of E on D,
E(D), can be evaluated as the union of certain P (k) partition blocks. Now, since A(k)
and P (k) indices are inter-derivable, it is clear that an A(k)-index on D can also be
used effectively in the evaluation of E(D).
    Actually, any U (∞) expression can also make effective use of an existing A(k)
or P (k) index. For example, consider the expression E =↑ ¦ ↑ ¦ ↑ ¦ ↑ and suppose
that we have only the P (2)-index available. Since E ∈ U (4), in general E(D) will
not necessarily be a union of some of the partition blocks of P (2). However, if we
decompose E as E1 ¦ E2 , where E1 =↑ ¦ ↑ and E2 =↑ ¦ ↑, then E(D) = E1 (D) 1
E2 (D), and since E1 and E2 are in U (2), they can obviously be evaluated directly with
the existing index as discussed above.


4.2 Evaluating Downward Expressions

As shown in XPath use cases and various XML benchmarks, over ninety percent of
XPath queries used in real world applications use navigation just along the parent-child
axis, rather than mixing the parent-child and child-parent axes. These queries can be
naturally expressed in the D(∞) and D(k) algebras.
    A natural way to evaluate D(∞) expressions is to perform top-down navigation.
We demonstrate that such expressions can also be evaluated using path decomposition
and A(k) and P (k) indices.
    The main idea behind turning top-down evaluation into bottom-up evaluation is
to convert a downward expression, or certain of its sub-expressions, into an upward ex-
pression using a technique that we will refer to as “inverting expressions.” We first illus-
trate this technique on predicate-free downward expressions, and then consider general
downward expressions.

Downward Expressions without Predicates In general, predicate-free expressions
in the downward algebras can be “inverted” into predication-free expressions in the
corresponding upward algebras using the rewrite rules shown in Table 3.


                                         E → E −1
                                          ²→²
                                         ∅→∅
                                         ↓→↑
                                         λ̂ → λ̂
                                  E1 ∪ E2 → E1−1 ∪ E2−1
                                  E1 ∩ E2 → E1−1 ∩ E2−1
                                  E1 − E2 → E1−1 − E2−1
                                  E1 ¦ E2 → E2−1 ¦ E1−1 .
                        Table 3. Inversion Rewrite Rules for D(∞).
    As discussed above, an upward algebraic expression in U (k) can be evaluated by
P (k)-index lookup and set unions. Therefore, a predicate-free downward algebraic ex-
pression in D(k) can be evaluated by inversion, P (k)-index lookup, and set unions.

Downward Expressions with Predicates Now consider the evaluation of downward
algebra expressions wherein predicate operations occur. A simple example is the ex-
pression E2 =↓ [↓]. Applied to a document, E2 evaluates to the document’s parent-
child pairs for children that have at least one child itself. As in Section 4.2, to evaluate
E2 on a document D, we could consider the concept of inverting E2 into an expres-
sion E2−1 ∈ U (∞) such that E2 (D) = (E2−1 (D))−1 . Unfortunately, this approach
does not work here. In fact, we can construct a document D2 such for each expression
F ∈ U (∞), E2 (D2 ) 6= F (D)−1 .
     Clearly E2 is equivalent to the XPath-algebra expression ↓ ¦ ↓ ¦ ↑. Notice that
this expression is neither a downward nor an upward expression. However its sub-
expression G1 =↓ ¦ ↓ is in D(2) and its sub-expression G2 =↑ is in U (1). Applying
the inversion technique described in Section 4.2 to G1 , the evaluation E2 (D) on a docu-
ment D, can be accomplished by computing the relation (G−1       1 (D))
                                                                         −1
                                                                             ./ G2 (D). And,
                                                   −1
as indicated in this section, the evaluations of G1 (D) =↑ ¦ ↑ (D) and G2 (D) =↑ (D)
can be done efficiently using the P (2) and P (1) indices of D, respectively.
     Given that the selectivity of a longer path is no larger than that of short sub-paths of
the path, evaluating G1 reduces the search space to the minimum that can be obtained
on such a chain expression. Starting from any given node, upward navigation in an
XML data tree, unlike downward ones, has one and only one route to follow, which
is to reach its parent. Therefore, it is reasonable to claim that the result of G−1 1 (D) is
substantially smaller than that of G2 (D), and the ./ operation can be further optimized
by G−11 (D) followed by an upward navigation.

  We will now consider a slightly more complicated downward expression E3 ∈
D(3) which retrieves information about leaders of projects that have a web site:

               E3 = Department ¦ ↓ ¦ Project[↓ ¦ Web] ¦ ↓ ¦ Lead.

E3 can be represented as an expression pattern tree, as illustrated in Figure 3(a). The
shaded node can be interpreted as the “answer” of E3 .


                            Department
                                                                             G3

                                         G1           G2
                            Project
                                                            G4                    G5

                    Lead         Web
                      (a)                       (b)                    (c)

                                 Fig. 3. Chain pattern tree for E3 .


    First, assume there exists a P (2)-index (i.e., where k = 2, the height of the expres-
sion). To take full advantage of the P (2)-index is to find the longest sub-expressions
that can be inverted and evaluated directly on it. As shown in Figure 3(b), there are two
natural chains of length 2, G1 and G2 , in the pattern tree of E3 . There are also natural
chains of length 1, G3 , G4 , and G5 , as shown in Figure 3(c).
    Using G1 , G2 , and G4 , the expression E3 is equivalent to the expression H1 defined
as follows:
                          H1 = ((G1 ¦ ↑) ∩ (G2 ¦ ↑)) ¦ G4 ,

and therefore, for a document D, E3 (D) can be computed as follows:
                                                  
                           ((G−1
                              1 (D))
                                     −1
                                         ./ ↑ (D))
              E3 (D) =              ∩              ./ (G−1
                                                           4 (D))
                                                                   −1
                                                                      .
                              −1     −1
                           ((G2 (D))     ./ ↑ (D))
All sub-expressions in this transformed expression of E3 are in U (2), and hence can
be be directly evaluated using the P (2) partition blocks without any navigation.
    However, it may not always be the case that P (k)-partitions exist for k larger than
or equal to the height of the query expression. Again, for E3 , consider the case when
only a P (1)-index is available. In this case, the longest path expressions that can take
advantage of the index are of length 1. Such expressions are G3 , G4 and G5 . Using
these subexpressions, E3 is equivalent with the expression H2 defined as follows:
                  H2 = (((G3 ¦ G4 ) ¦ ↑) ∩ ((G3 ¦ G5 ) ¦ ↑)) ¦ G4 .
    Similar to the generalization from E2 to E3 , the decomposition + inversion tech-
niques can be used to transform any arbitrary expression in the downward algebra
D(∞) to a set of predicate-free sub-expressions in D(∞) that can be inverted to cor-
responding expressions in the upward algebra U (∞), which in turn can be evaluated
directly by P (k)-index lookups and set unions. The decomposition is partially deter-
mined by k, the degree of local similarity in the P (k)-indices which are available.
Furthermore, this algebraic transformation only provides guidelines for evaluating each
such expression. The optimal physical plan to evaluate an XPath query depends on other
factors, such as the cardinality of the intermediate results of each sub-expression, and
the cost model of the join operations.

5 Future Directions
It is clear that many aspects of the query evaluation guidelines presented in the previous
section warrant further investigation. We close by listing a few of the more interesting
directions for research. First, a more thorough study of query decomposition and inver-
sion algorithms is crucial. Second, it is interesting to develop approaches to (1) quickly
identify which partition blocks correspond to an arbitrary given input expression and
document, and (2) to use this identification to efficiently obtain the full query result. We
have already taken steps in this direction, by obtaining “labeling” expressions which
have the property that each of these evaluates to a unique partition block. Due to space
limitations, we were unable to incorporate these results in the paper. However, in [6]
(Section 4), we have identified these labeling expressions for U (()k) and U (∞). Fi-
nally, the results of this paper are applicable in workload driven scenarios [9, 15, 17],
such as determining which indices to materialize and which to derive, as a function of
the characteristics of an observed workload.
Acknowledgments. We thank Changqing Lin for discussions on the A(k) indices.


References
 [1] Shurug Al-Khalifa, H. V. Jagadish, Jignesh M. Patel, Yuqing Wu, Nick Koudas, and Divesh
     Srivastava. Structural joins: A primitive for efficient XML query pattern matching. In
     ICDE, 2002.
 [2] T. Bray, J. Paoli, C.M. Sperberg-McQueen, and Francois Yergeau. Etensible Markup Lan-
     guage (XML) 1.0 (third edition) - W3C recommendation, 2004.
 [3] Nicolas Bruno, Nick Koudas, and Divesh Srivastava:. Holistic twig joins: optimal XML
     pattern matching. In SIGMOD, 2002.
 [4] D. Chamberlin et al. XQuery 1.0: An XML query language, W3C, 2003.
 [5] J. Clark and S. DeRose (eds.).               XML path language (XPath) version 1.0.
     http://www.w3.org/TR/XPATH.
 [6] George H. L. Fletcher, Dirk Van Gucht, Yuqing Wu, Marc Gyssens, and Jan Paredaens.
     An XPath Algebraic Characterization of A(k) and P(k) Indices with Applications to Query
     Processing. Indiana University Technical Report No. 639, October 2006.
 [7] Roy Goldman and Jennifer Widom. Dataguides: Enabling query formulation and optimiza-
     tion in semistructured databases. In VLDB, pages 436–445, 1997.
 [8] Marc Gyssens, Jan Paredaens, Dirk Van Gucht, and George H. L. Fletcher. Structural
     Characterizations of the Semantics of XPath as Navigation Tool on a Document. In ACM
     PODS, pages 318–327, 2006.
 [9] Hao He and Jun Yang. Multiresolution indexing of XML for frequent queries. In IEEE
     ICDE, 2004.
[10] Raghav Kaushik, Philip Bohannon, Jeffrey F. Naughton, and Henry F. Korth. Covering
     indexes for branching path queries. In SIGMOD, 2002.
[11] Raghav Kaushik, Rajasekar Krishnamurthy, Jeffrey F. Naughton, and Raghu Ramakrish-
     nan. On the integration of structure indexes and inverted lists. In ACM SIGMOD, 2004.
[12] Raghav Kaushik, Pradeep Shenoy, Philip Bohannon, and Ehud Gudes. Exploiting local
     similarity for efficient indexing of paths in graph structured data. In IEEE ICDE, 2002.
[13] Christoph Koch. Processing queries on tree-structured data efficiently. In ACM PODS,
     pages 213–224, 2006.
[14] Tova Milo and Dan Suciu. Index structures for path expressions. In ICDT, pages 277–295,
     1999.
[15] Jun-Ki Min, Chin-Wan Chung, and Kyuseok Shim. An Adaptive Path Index for XML Data
     using the Query Workload. Inf. Systems, 30(6):467–487, 2005.
[16] Mirella Moura Moro, Zografoula Vagena, and Vassilis J. Tsotras. Tree-pattern queries on
     a lightweight XML processor. In VLDB, 2005.
[17] Chen Qun, Andrew Lim, and Kian Win Ong. D(k)-index: An adaptive structural summary
     for graph-structured data. In SIGMOD, 2003.
[18] Prakash Ramanan. Covering indexes for XML queries: Bisimulation - simulation = nega-
     tion. In VLDB, 2003.
[19] Kanda Runapongsa, Jignesh M. Patel, Rajesh Bordawekar, and Sriram Padmanabhan.
     XIST: An XML index selection tool. In XSym, pages 219–234, 2004.
[20] Ke Yi, Hao He, Ioana Stanoi, and Jun Yang. Incremental maintenence of XML structural
     indexes. In ACM SIGMOD, pages 491–502, 2004.
[21] Chun Zhang, Jeffrey F. Naughton, David J. DeWitt, Qiong Luo, and Guy M. Lohman. On
     supporting containment queries in relational database management systems. In SIGMOD,
     2001.

</pre>