-

Coupling Fragments of XPath with XML Indexing and Query Decomposition

George H.L. Fletcher

Dirk Van Gucht

Yuqing Wu

Marc Gyssens

marc.gyssens@uhasselt.be 0

Jan Paredaens

jan.paredaens@ua.ac.be 2 0 Hasselt University & Transnational University Limburg 1 Indiana University , Bloomington , USA 2 University of Antwerp

Recent studies have proposed structural summary techniques for pathquery evaluation on semi-structured data sources. One major line of this research has been the introduction of the DataGuide, 1-index, 2-index, and A(k) indices, and subsequent investigations and generalizations. Another recent study has considered structural characterizations of fragments of XPath, the standard path navigation language for XML documents. In this paper we provide a methodology on XPath query processing that couples these two areas of research on structural indices and query languages. To illustrate this methodology, we apply it to couple an upward-only XPath fragment with the A(k) and P (k) structural indices. With an eye towards applying this result to XPath query processing, we (1) show how upward-only XPath expressions can be evaluated directly on the corresponding indices and (2) leverage these results to develop generic techniques for making effective use of A(k) and P (k) indices for more general, frequently occurring XPath expressions.

XML (eXtensible Markup Language) [2] provides a standard data format that is flexible enough to be customized for various domains. An increasing number of applications use XML as their data model or as the format for exchanging data among heterogeneous data repositories for the purpose of collaboration, data integration, and information sharing, and querying XML document databases.

XQuery is currently the most popular XML query language [4]. The fundamental building block of XQuery is XPath [5]. XPath expressions are used to specify small node-labeled trees which match portions of the hierarchical XML data. How to support efficient access to XML data using XPath continues to be a critical research problem in this domain.

In XPath query evaluation, indices similar to those used in relational database systems – namely, value indices on tags and text values – are first used, together with structural join algorithms [3, 1, 21]. This approach turns out to be simple and efficient.

Index

Description

Labeling DataGuide [7] Every unique Can not use in- Target node set reachlabel path of a coming path for able by an XPath exsource appear node labeling. pression is the superset exactly once in of the result.

the index graph.

Strong Distinguishes Nodes that share Labeled by the Can be used to directly DataGuide [7] nodes that are the same incom- unique incoming answer linear XPath 1-index [14] reachable by ing path up to the path up to the query of any length.

different paths. root root 2-index[14] Distinguishes Pairs (vectors) of Can be used to diT-index[14] pairs (vectors) of nodes that share rectly answer queries nodes that shares the same paths that match a given temthe same set of plate.

paths.

A(k)-index Distinguishes Nodes that share Can be used to evalu[12] nodes with in- the same incom- ate linear XPath exprescoming path up ing path, up to sion directly when local to length k. length k similarity is sufficient.

D(k)- Uses local index Nodes that shares Need verification when index[17] to adapt to query the same incom- the XPath query expresM (k)- workload ing path, accord- sion is longer than k. index [9] ing to local simi

larity.

However, the structural containment relationships native to XML data are not directly captured by value indices.

To directly capture the structural information of XML data, a family of XML indices has been introduced. DataGuide [7] was the first to be proposed, followed by the 1-index [14], which is based on a notion of bisimulation among nodes in an XML document. These indices can be used to evaluate path expression accurately without accessing the original data graph. Milo and Suciu [14] also introduced the 2-index and T-index, based on similarity of pairs (vectors) of nodes. Unfortunately, these and other early structural indices tend to be too large for practical use because they typically maintain too fine-grained structural information about the document [10, 18].

To remedy this, Kaushik et al. introduced the A(k)-index [12] which uses a notion of bisimilarity on nodes relativized to paths of length k from nodes. This captures localized structural information of a document, and can support path expressions of length up to k. Focusing just on local similarity, the A(k)-index can be substantially smaller than the 1-index and others.

Several works have investigated maintenance and tuning of the A(k) indices. The D(k)-index [17] and M (k)-index [9] extend the A(k)-index to adapt to query workload. Yi et al. [20] developed update techniques for the A(k)-index and 1-index. Finally, the integrated use of structural and values indices has been explored [11], and there have also been investigations on covering indices [10, 18] and index selection [16, 19]. The characteristics of many of these structural indices are summarized in Table 1.

The introduction of structural indices for XML data has lead to significant improvements in the performance of XPath expression evaluation. Furthermore, great advances have been made on the theoretical and practical underpinnings of expression processing [13]. However, to date there lacks a general framework for investigating some of the most fundamental questions about using indices in expression evaluation. Such questions include: 1. For which classes of XPath expressions are structural indices ideally suited? 2. Then, for these classes, how are its expressions optimally evaluated with the index? 3. Can the answers to these questions be bootstrapped to provide general techniques for evaluation of arbitrary XPath expressions?

To answer question ( 1 ), we follow a general, theoretical approach to providing structural characterizations of fragments of XPath recently proposed by Gyssens et al. [8]. In that paper, the well-known concept of semantic indistinguishability of objects and relationships in a query language (in our case, nodes and paths in the tree-representations of the XML documents, and XPath) is related to the purely structural, topological, indistinguishability of nodes and paths in these tree-representations. For example, in that paper it is shown that for a downward fragment of XPath, semantic indistinguishability of nodes is equivalent to downward-bisimilarity indistinguishability of these nodes.

Recaputilating, recent studies have ( 1 ) introduced structural indexes for efficient XPath query processing and ( 2 ) related semantic-indistinguishability with structural indistinguishabilty of nodes and paths for certain fragments of XPath. In this paper we introduce a new methodology that couples these two areas in a principled way, and that is aimed for the benefit of efficiently evaluating classes of XPath expressions.

To illustrate this methodology, we apply it to couple the important concept of A(k) and P (k) structural indices with a certain upward XPath fragment. With an eye towards applying this result to XPath query processing, we ( 1 ) show how upward-only XPath expressions can be evaluated directly on the corresponding A(k) and P (k) indices and ( 2 ) leverage these results to develop XPath query-decomposition techniques for making effective use of A(k) and P (k) indices for more general, but, in practice, frequently occurring XPath expressions.

In particular, we give a precise characterization of the family of A(k)-indices in terms of a downward fragment of XPath (Section 3). With this result in hand, we answer question ( 2 ) by showing how expressions in these fragments can be evaluated directly in terms of the A(k) and P (k) indices (Section 3). Finally, to address question ( 3 ), we leverage our results on ( 1 ) and ( 2 ) to develop techniques for making effective use of A(k) and P (k) indices for important practical classes of XPath (Section 4). These results identify a new methodology for the study of the interaction between structural indices and the evaluation of XPath expressions. Furthermore, the nature of this methodology is such that it sheds light on techniques with immediate practical application. As such, our methodology is a good example of how the theory and practice for XPath query processing can be nicely linked.

n0 n4 Project n5 Project

Project n6 n2 2.1 The XPath-Algebra We first introduce the XML data model and algebraization of the navigational core of XPath, following the presentation of Gyssens et al. [8]. In this paper, documents are finite unordered node-labeled trees. More formally, a document D is a 4-tuple (V; Ed; r; ¸), with V the finite set of nodes, Ed µ V £ V the set of edges, r 2 V the root, and ¸ : V ! L a node-labeling function into a countably infinite set of labels L. A visualization of a fragment of a document is given in Figure 1. Useful notions of height and distance in a document are defined as follows.

Definition 1. Let D = (V; Ed; r; ¸) be a document and n; m 2 V . The distance from n to m in D, distance(n; m), is the length of the path from n to m in D. The height of D, height(D), is the length of the longest path in D starting from the root r. In other words, height(D) = maxfdistance(r; n) j n 2 V g.

We next present an algebraization of the logical navigational core of XPath. Definition 2. The XPath-algebra consists of the primitives "; ;; #; ", and ` together with the operations on expressions E1 ¦ E2; E1[E2]; E1 [ E2; E1 \ E2, and E1 ¡ E2.4 Given a document D = (V; Ed; r; ¸), the semantics of an XPath-algebra expression E on D, denoted E(D), is always a binary relation over V . The semantics for each operation is given in Table 2.

The XPath-algebra semantics reflects a “global” perspective of expressions being evaluated on an entire document. There is also a “local” semantic perspective, in which expressions are viewed as working on a particular node, as follows.

Definition 3. Let E be an XPath-algebra expression and let D = (V; Ed; r; ¸) be a document. For m 2 V , E(D)(m) = fn 2 V j (m; n) 2 E(D)g.

To illustrate the XPath-algebra, the XPath query for retrieving all departments with websites on the document of Figure 1, =Projects=Department[:=Web], can be formulated in the XPath-algebra with the expression Projects ¦ # ¦ Department[# ¦ Web] 4 Since we are concerned in this study with reasoning on a fixed document of which the height is known, it is not necessary to introduce ancestor or descendant axes.

"(D) = f(m; m) j m 2 V g ;(D) = ; # (D) = Ed " (D) = Ed¡1 `(D) = f(m; m) j m 2 V and ¸(m) = `g E1 [ E2(D) = E1(D) [ E2(D) E1 \ E2(D) = E1(D) \ E2(D) E1 ¡ E2(D) = E1(D) ¡ E2(D) E1 ¦ E2(D) = f(m; n) j 9w : (m; w) 2 E1(D)

& (w; n) 2 E2(D)g E1[E2](D) = f(m; n) 2 E1(D)

j 9w : (n; w) 2 E2(D)g: evaluated “locally” at the root node n0. As further illustrations of XPath-algebra expressions, consider: E1. “Retrieve all department names in projects.”

Projects ¦ # ¦ Department ¦ # ¦ Name

E2. “Retrieve the leaders of projects that have websites.”

Department ¦ # ¦ Project[# ¦ Web] ¦ # ¦ Lead

E3. “Retrieve the leaders of projects that do not have websites.”

Department ¦ # ¦ Project ¦ # ¦ Lead ¡ E2 E4. “Retrieve the names of all departments that have a website and a project with a website.”

Projects ¦ # ¦ Department[# ¦ Web][# ¦ Project ¦ # ¦ Web] ¦ # ¦ Name E5. “Retrieve all projects which are sub-projects of projects with a website.”

Project[" ¦ Project ¦ # ¦ Web]

Recall that in the “global” semantics of the XPath-algebra, expressions evaluate to pairs of satisfying nodes. For example, E1(D) = f(n0; n3); (n0; n7)g and E5(D) = f(n17; n17); (n18; n18)g. 2.2

A(k) Indexes Kaushik et al. have proposed the family of A(k)-indexes for graph-structured data, based on a notion of node bisimilarity [12]. An A(k)-index is built on a partitioning of the nodes of a semi-structured document. The definition of the equivalence relation ´A(k), specialized to XML documents, which induces this partition is as follows. Definition 4. Let D = (V; Ed; r; ¸) be a document, n1; n2 2 V , and k 2 N. We say n1 and n2 are A(k)-equivalent in D, denoted n1 ´A(k) n2, if 1. ¸(n1) = ¸(n2); and 2. if k ¸ 1 then (a) n1 has a parent in D if and only if n2 has a parent in D; and (b) if n1 has parent p1 and n2 has parent p2, then p1 ´A(k¡1) p2. n1 ´A(k+1) n2, then n1 ´A(k) n2.

A(k) equivalence defines partition blocks on the nodes n 2 V of a document D = (V; Ed; r; ¸),

[n]k = fn0 j n0 2 V & n ´A(k) n0g and the full A(k)-index is built directly on the collection of these partition blocks, D=A(k) = f[n]k j n 2 V g: We refer to this partition as simply the A(k)-index.

Following Kaushik et al. [12], a full A(k)-index can be visualized as a graph wherein each node represents a partition block and an edge exists from a node A to a node B if an edge existed in the original document from a data node in A to a data node in B. This construction is illustrated in Figure 2 on the Design Department subtree of the document of Figure 1.

We observe the following basic property of A(k) equivalence, which follows by a simple induction on k.

Proposition 1. Let D = (V; Ed; r; ¸) be a document, let k 2 N, and let n1, n2 2 V . If

At first sight, we have defined an infinite class of equivalence relations on D. However, there is no point to consider A(k) equivalence beyond k = height(D), as can also be seen by an inductive argument.

Proposition 2. Let D = (V; Ed; r; ¸) be a document, and let k ¸ height(D). Then A(k) equivalence coincides with A(height(D)) equivalence.

We now extend A(k) equivalence as follows to capture upward paths to the root in a document.

Definition 5. Let D = (V; Ed; r; ¸) be a document and n1; n2 2 V . We say n1 and n2 are A( 1 )-equivalent in D, denoted n1 ´A( 1 ) n2, if 1. ¸(n1) = ¸(n2); and 2. n1 has a parent in D if and only if n2 has a parent in D; and 3. if n1 has parent p1 and n2 has parent p2, then p1 ´A( 1 ) p2.

Not surprisingly, we have the following.

Proposition 3. Let D = (V; Ed; r; ¸) be a document. Then A( 1 ) equivalence coincides with A(height(D)) equivalence.

Note that A( 1 ) equivalence on nodes induces the 1-index proposed by Milo and Suciu [14] and the strong DataGuide of Goldman and Widom [7]. 2.3 P (k) Indexes We next generalize A(k) equivalence (Definition 4) to pairs of nodes.

Definition 6. Let D = (V; Ed; r; ¸) be a document, k 2 N, let m1, n1, m2, n2 2 V such that m1 is an ancestor of n1 and m2 is an ancestor of n2. Then, (n1; m1) and (n2; m2) are P (k) equivalent, denoted (n1; m1) ´P (k) (n2; m2), if 1. distance(n1; m1) = distance(n2; m2); and 2. n1 ´A(k) n2.

Note that P (k) equivalence on pairs of nodes induces a “localized” version of the 2index proposed by Milo and Suciu [14], just as A(k) equivalence on nodes induces a “localized” version of the 1-index [12].

Consider the subtree D0 in the document of Figure 1 rooted at n4. As an illustration of the P (k)-index (i.e., the partition set D=P (k) induced by P (k) equivalence on a document D) the P (0)-index of D0 is:

D0=P (0) = f[(n10; n10); (n19; n19)]; [(n10; n4); (n19; n9)]; [(n19; n4)]; [(n4; n4); (n9; n9)]; [(n9; n4)]; [(n11; n11); (n20; n20)]; [(n11; n4); (n20; n9)]; [(n20; n4)]g

Properties 1 and 2 of A(k) equivalence can easily be bootstrapped to properties of P (k) equivalence.

Proposition 4. Let D = (V; Ed; r; ¸) be a document, let k 2 N and let m1, n1, m2, n2 2 V such that m1 is an ancestor of n1 and m2 is an ancestor of n2. If (n1; m1) ´P (k+1) (n2; m2) then (n1; m1) ´P (k) (n2; m2) Proposition 5. Let D = (V; Ed; r; ¸) be a document, and let k ¸ height(D). Then P (k) equivalence coincides with P (height(D)) equivalence.

As with A(k) equivalence, we extend P (k) as follows to capture upward paths to the root in a document.

Definition 7. Let D = (V; Ed; r; ¸) be a document and let m1, n1, m2, n2 2 V such that m1 is an ancestor of n1 and m2 is an ancestor of n2. Then, (n1; m1) and (n2; m2) are P ( 1 ) equivalent, denoted (n1; m1) ´P( 1 ) (n2; m2), if 1. distance(n1; m1) = distance(n2; m2); and 2. n1 ´A( 1 ) n2.

Proposition 3 for A( 1 ) equivalence can be bootstrapped to P ( 1 ) equivalence in a straightforward manner.

Proposition 6. Let D = (V; Ed; r; ¸) be a document. Then P ( 1 ) equivalence coincides with P (height(D)) equivalence.

Note that P ( 1 ) equivalence induces the 2-index proposed by Milo and Suciu [14].

As a closing remark, we observe from the definitions of A(k) and P (k) indices that, given an A(k) index on a document D, it is straightforward to derive the corresponding P (k) index for D, and vice versa.

Algebraic Characterizations of the A(k) and P (k) Indices In this section, we give characterizations of the A(k) and P (k) family of indices based on notions of indistinguishability of nodes and pairs of nodes in fragments of the XPathalgebra. The notion of language indistinguishability on nodes is made precise in the following definition, following Gyssens et al. [8].

Definition 8. Let D = (V; Ed; r; ¸) be a document, m1; m2 2 V , and F a fragment of the XPath-algebra. We say m1 and m2 are indistinguishable by F , denoted m1 ´F m2, if for any expression E in F , it is the case that E(D)(m1) = ; if and only if E(D)(m2) = ;.

Language indistinguishability on pairs of nodes is defined as follows.

Definition 9. Let D = (V; Ed; r; ¸) be a document, n1; m1; n2; m2 2 V , and F a fragment of the XPath-algebra. We say (n1; m1) and (n2; m2) are indistinguishable by F , denoted (n1; m1) ´F (n2; m2), if for any expression E in F , it is the case that (n1; m1) 2 E(D) if and only if (n2; m2) 2 E(D).

Next, we introduce a family of algebras which we use to precisely characterize the A(k) and P (k) indices.

Definition 10. We recursively define the upward-k XPath algebras, U (k) for each k 2 N, as follows. 1. U (0) = the set of XPath-algebra expressions without occurrences of the “#” and “"” operators. 2. For k ¸ 1, (a) if E 2 U (k ¡ 1), then E 2 U (k); (b) " 2 U (k); (c) if E1 2 U (k) and E2 2 U (k), then

– E1 ? E2 2 U (k), for ? = [; \; ¡; (d) if E1 2 U (k1) and E2 2 U (k2), and k1 + k2 · k, then

– E1 ¦ E2 2 U (k) and E1[E2] 2 U (k); and 3. No other expressions are in U (k).

As a very simple example of U (k) expressions, note that " ¦ " ¦ " is in U ( 3 ) but not in U ( 2 ).

The following useful observation about the U (k) algebras can be shown by a simple inductive argument.

Proposition 7. Let D = (V; Ed; r; ¸) be a document, k 2 N, m; n 2 V , and E 2 U (k). If (n; m) 2 E(D), then m is an ancestor of n such that distance(n; m) · k.

The U (k) algebras are extended as follows to an algebra with unrestricted use of the “"” primitive.

Definition 11. The upward XPath algebra U ( 1 ) is the set of all XPath-algebra expressions without occurrences of the # primitive.

In other words, U ( 1 ) = Sk2N U (k).

Compositions in which the number of nested “"” primitives is strictly greater than height(D) must return the empty set, and are therefore not very meaningful. The following result should therefore not come as a surprise.

Proposition 8. The U ( 1 ) and U (height(D)) algebras are equivalent in expressive power.

In particular, indistinguishability by U (height(D)) and indistinguishability by U ( 1 ) coincide. 3.1

Structural Characterizations of U (k) and U ( 1 ) Indistinguishability In this section, we provide precise structural characterizations of indistinguishability in the upward algebras. In particular, for each k > 0, we show that on a fixed document, indistinguishability of nodes by the U (k) algebra corresponds precisely with A(k) equivalence of nodes, and indistinguishability of node pairs by the U (k) algebra corresponds precisely with P (k) equivalence. We also extend these results to the full upward algebra U ( 1 ). We follow the methodology of Gyssens et al. [8] to establish these correspondences.

The following facts are crucial in the sequel.

Lemma 1. Let D = (V; Ed; r; ¸) be a document, k 2 N, and n1, m1, n2 2 V such that m1 is an ancestor of n1 and distance(n1; m1) · k. If n1 ´A(k) n2, then there exists m2 2 V such that m2 is an ancestor of n2 and (n1; m1) ´P (distance(n1;m1)) (n2; m2).5 Furthermore, m1 ´A(k¡distance(n1;m1)) m2.

Proof. By induction on k. For the base case, k = 0, clearly m1 = n1 and ¸(n1) = ¸(n2). The statement holds for m2 = n2.

For k ¸ 1, we can assume that the statement holds for k¡1. If n1 ´A(k) n2, then either ( 1 ) both n1 and n2 have no parents, or ( 2 ) they both have parents p1 and p2, respectively, such that p1 ´A(k¡1) p2 (by definition of A(k) equivalence). In case ( 1 ), clearly m1 = n1 and the statement holds for m2 = n2. In case ( 2 ), distance(p1; m1) · k ¡1, and by the definition of A(k) equivalence, p1 ´A(k¡1) p2. By the induction hypothesis, there exists an ancestor m2 of p2 such that (p1; m1) ´P (distance(p1;m2)) (p2; m2) and m1 ´A(k¡1¡distance(p1;m1)) m2. It readily follows that (n1; m1) ´P (distance(n1;m1)) (n2; m2) and m1 ´A(k¡distance(n1;m1)) m2.

Proposition 9. Let D = (V; Ed; r; ¸) be a document, k 2 N, E 2 U (k), and n1; m1; n2; m2 2 V such that m1 is an ancestor of n1 and m2 is an ancestor of n2, and (n1; m1) ´P (k) (n2; m2). If (n1; m1) 2 E(D), then (n2; m2) 2 E(D), and vice versa.

Proof. First observe that it follows from E 2 U (k) and (n1; m1) 2 E(D) that distance(n1; m1) · k, by Proposition 7. 5 And even stronger, (n1; m1) ´P (k) (n2; m2).

The proof is by induction on k. The base case, k = 0, follows straightforwardly from the definition of P (0) equivalence and a simple structural induction on expressions in U (0). Now assume that k ¸ 1, and that the statement holds for 0; 1; 2; : : : ; k ¡ 1. The proof goes by structural induction on expressions in U (k). Thus, let E 2 U (k). – E 2 U (k ¡ 1). The statement holds by the induction hypothesis. – E =". If (n1; m1) 2" (D), then m1 is the parent of n1. Since (n1; m1) ´P (k) (n2; m2), it follows in particular that m2 is the parent of n2. We conclude that (n2; m2) 2" (D). – E = E1 [ E2, for E1 and E2 2 U (k). Suppose (n1; m1) 2 E(D). Then (n1; m1) 2 E1(D) or (n1; m1) 2 E2(D). Without loss of generality, assume (n1; m1) 2 E1(D). Then by structural induction, (n2; m2) 2 E1(D), and we conclude (n2; m2) 2 E(D). – E = E1 \ E2 or E = E1 ¡ E2, for E1 and E2 2 U (k). Similar to the previous case. – E = E1 ¦ E2, for E1 2 U (k1) and E2 2 U (k2), such that k1 + k2 · k. Suppose (n1; m1) 2 E(D). Then there exists a node w1 2 V such that (n1; w1) 2 E1(D) and (w1; m1) 2 E2(D). By Lemma 7, distance(n1; w1) · k1 and distance(w1; m1) · k2. By Lemma 1, there exists a node w2 2 V such that (n1; w1) ´P (distance(n1;w1)) (n2; w2), and w1 ´A(k¡distance(n1;w1)) w2. Since, k2 · k ¡ distance(n1; w1), by Lemma 1, a node m0 2 V exists with (w1; m1) ´P (distance(w1;m1)) (w2; m0). Since (n1; w1) ´P (distance(n1;w1)) (n2; w2), and (w1; m1) ´P (distance(w1;m1)) (w2; m0), we know, by the definition of ´P (k1) and ´P (k2), that distance(n2; w2) = distance(n1; w1) and distance(w2; m0) = distance(w1; m1). Consequently, distance(n2; m0) = distance(n1; m1), and since m0 is the unique ancestor at this distance, we conclude that m0 = m2. Thus (w1; m1) ´P (k2) (w2; m2). By the induction hypothesis, we can conclude that (n2; w2) 2 E1(D) and (w2; m) 2 E2(D) and thus (n2; m2) 2 E(D). – E = E1[E2], for E1 2 U (k1) and E2 2 U (k2), such that k1 + k2 · k. Similar to the previous case.

An important corollary of Proposition 9 is the following observation.

Proposition 10. Let E 2 U (k), k 2 N, and let D = (V; Ed; r; ¸) be a document. Then E(D) is the union of some partition blocks of the P (k)-index.

We are now prepared to establish the main results of this section.

Theorem 1. Let D = (V; Ed; r; ¸) be a document, n1; n2 2 V , and k 2 N. Then n1 ´U (k) n2 if and only if n1 ´A(k) n2.

Proof. (If) Suppose n1 ´A(k) n2 and let E 2 U (k) such that E(D)(n1) 6= ;. Hence, there exists m1 2 V such that (n1; m1) 2 E(D). Then, by Lemma 1 and Proposition 9, there exists m2 2 V such that (n2; m2) 2 E(D), and therefore E(D)(n2) 6= ;. It follows symmetrically that if E(D)(n2) 6= ;, then E(D)(n1) 6= ;. We conclude that n1 ´U (k) n2.

(Only if) For the converse, assume that n1 ´U (k) n2. We first establish two facts.

Following a similar argument, we have the following characterization of U (k) indistinguishability on node pairs.

Theorem 2. Let D = (V; Ed; r; ¸) be a document, n1; m1; n2; m2 2 V , and k 2 N. Then (n1; m1) ´U (k) (n2; m2) if and only if (n1; m1) ´P (k) (n2; m2).

We remind the reader that indistinguishability by U ( 1 ) and indistinguishability by U (height(D)) coincide (Proposition 8). Furthermore, by Proposition 3, A( 1 ) equivalence coincides with A(height(D)) equivalence. Therefore, the result below immediately follows.

Theorem 3. Let D = (V; Ed; r; ¸) be a document and n1; n2 2 V . Then n1 ´U ( 1 ) n2 if and only if n1 ´A( 1 ) n2.

Similarly, by Proposition 6, P ( 1 ) equivalence coincides with P (height(D)) equivalence. Therefore, the result below immediately follows.

Theorem 4. Let D = (V; Ed; r; ¸) be a document and n1, m1, n1, n2 2 V . Then (n1; m1) ´U ( 1 ) (n2; m2) if and only if (n1; m1) ´P( 1 ) (n2; m2). 4

XPath Query Evaluation with A(k) and P (k) Indices In this section, we consider strategies to evaluate XPath expressions, leveraging the results and characterizations of the A(k) and P (k) indices (Sections 3). First, we discuss how A(k) and P (k) indices can be used in the evaluation of upward XPath expressions, as expressed in the U ( 1 ) and U (k) algebras. Then, we focus on the evaluation of XPath algebra query expressions that can be expressed in certain downward XPath algebras – namely, the downward algebra D ( 1 ) and the downward-k-algebras D (k) (for each k 2 N). These algebras are defined just like the upward algebras except that one is restricted to using # operations instead of " operations. Our analysis of the A(k) and P (k) indices shows that they are the ideal index structures for evaluation of U (k) expressions. Indeed, consider an expression E 2 U (k), a document D, and its P (k)-index. Proposition 10 states that the evaluation of E on D, E(D), can be evaluated as the union of certain P (k) partition blocks. Now, since A(k) and P (k) indices are inter-derivable, it is clear that an A(k)-index on D can also be used effectively in the evaluation of E(D).

Actually, any U ( 1 ) expression can also make effective use of an existing A(k) or P (k) index. For example, consider the expression E =" ¦ " ¦ " ¦ " and suppose that we have only the P ( 2 )-index available. Since E 2 U ( 4 ), in general E(D) will not necessarily be a union of some of the partition blocks of P ( 2 ). However, if we decompose E as E1 ¦ E2, where E1 =" ¦ " and E2 =" ¦ ", then E(D) = E1(D) 1 E2(D), and since E1 and E2 are in U ( 2 ), they can obviously be evaluated directly with the existing index as discussed above. As shown in XPath use cases and various XML benchmarks, over ninety percent of XPath queries used in real world applications use navigation just along the parent-child axis, rather than mixing the parent-child and child-parent axes. These queries can be naturally expressed in the D ( 1 ) and D (k) algebras.

A natural way to evaluate D ( 1 ) expressions is to perform top-down navigation. We demonstrate that such expressions can also be evaluated using path decomposition and A(k) and P (k) indices.

The main idea behind turning top-down evaluation into bottom-up evaluation is to convert a downward expression, or certain of its sub-expressions, into an upward expression using a technique that we will refer to as “inverting expressions.” We first illustrate this technique on predicate-free downward expressions, and then consider general downward expressions.

Downward Expressions without Predicates In general, predicate-free expressions in the downward algebras can be “inverted” into predication-free expressions in the corresponding upward algebras using the rewrite rules shown in Table 3.

As discussed above, an upward algebraic expression in U (k) can be evaluated by P (k)-index lookup and set unions. Therefore, a predicate-free downward algebraic expression in D (k) can be evaluated by inversion, P (k)-index lookup, and set unions. Downward Expressions with Predicates Now consider the evaluation of downward algebra expressions wherein predicate operations occur. A simple example is the expression E2 =# [#]: Applied to a document, E2 evaluates to the document’s parentchild pairs for children that have at least one child itself. As in Section 4.2, to evaluate E2 on a document D, we could consider the concept of inverting E2 into an expression E2¡1 2 U ( 1 ) such that E2(D) = (E2¡1(D))¡1. Unfortunately, this approach does not work here. In fact, we can construct a document D2 such for each expression F 2 U ( 1 ), E2(D2) 6= F (D)¡1.

Clearly E2 is equivalent to the XPath-algebra expression # ¦ # ¦ ". Notice that this expression is neither a downward nor an upward expression. However its subexpression G1 =# ¦ # is in D ( 2 ) and its sub-expression G2 =" is in U ( 1 ). Applying the inversion technique described in Section 4.2 to G1, the evaluation E2(D) on a document D, can be accomplished by computing the relation (G1¡1(D))¡1 ./ G2(D). And, as indicated in this section, the evaluations of G1¡1(D) =" ¦ " (D) and G2(D) =" (D) can be done efficiently using the P ( 2 ) and P ( 1 ) indices of D, respectively.

Given that the selectivity of a longer path is no larger than that of short sub-paths of the path, evaluating G1 reduces the search space to the minimum that can be obtained on such a chain expression. Starting from any given node, upward navigation in an XML data tree, unlike downward ones, has one and only one route to follow, which is to reach its parent. Therefore, it is reasonable to claim that the result of G1¡1(D) is substantially smaller than that of G2(D), and the ./ operation can be further optimized by G1¡1(D) followed by an upward navigation.

We will now consider a slightly more complicated downward expression E3 2 D ( 3 ) which retrieves information about leaders of projects that have a web site:

E3 = Department ¦ # ¦ Project[# ¦ Web] ¦ # ¦ Lead: E3 can be represented as an expression pattern tree, as illustrated in Figure 3(a). The shaded node can be interpreted as the “answer” of E3.

Lead (a)

Department

First, assume there exists a P ( 2 )-index (i.e., where k = 2, the height of the expression). To take full advantage of the P ( 2 )-index is to find the longest sub-expressions that can be inverted and evaluated directly on it. As shown in Figure 3(b), there are two natural chains of length 2, G1 and G2, in the pattern tree of E3. There are also natural chains of length 1, G3, G4, and G5, as shown in Figure 3(c).

Using G1, G2, and G4, the expression E3 is equivalent to the expression H1 defined as follows:

H1 = ((G1 ¦ ") \ (G2 ¦ ")) ¦ G4; and therefore, for a document D, E3(D) can be computed as follows: All sub-expressions in this transformed expression of E3 are in U ( 2 ), and hence can be be directly evaluated using the P ( 2 ) partition blocks without any navigation.

However, it may not always be the case that P (k)-partitions exist for k larger than or equal to the height of the query expression. Again, for E3, consider the case when only a P ( 1 )-index is available. In this case, the longest path expressions that can take advantage of the index are of length 1. Such expressions are G3, G4 and G5. Using these subexpressions, E3 is equivalent with the expression H2 defined as follows:

H2 = (((G3 ¦ G4) ¦ ") \ ((G3 ¦ G5) ¦ ")) ¦ G4:

Similar to the generalization from E2 to E3, the decomposition + inversion techniques can be used to transform any arbitrary expression in the downward algebra D ( 1 ) to a set of predicate-free sub-expressions in D ( 1 ) that can be inverted to corresponding expressions in the upward algebra U ( 1 ), which in turn can be evaluated directly by P (k)-index lookups and set unions. The decomposition is partially determined by k, the degree of local similarity in the P (k)-indices which are available. Furthermore, this algebraic transformation only provides guidelines for evaluating each such expression. The optimal physical plan to evaluate an XPath query depends on other factors, such as the cardinality of the intermediate results of each sub-expression, and the cost model of the join operations. 5

Future Directions It is clear that many aspects of the query evaluation guidelines presented in the previous section warrant further investigation. We close by listing a few of the more interesting directions for research. First, a more thorough study of query decomposition and inversion algorithms is crucial. Second, it is interesting to develop approaches to ( 1 ) quickly identify which partition blocks correspond to an arbitrary given input expression and document, and ( 2 ) to use this identification to efficiently obtain the full query result. We have already taken steps in this direction, by obtaining “labeling” expressions which have the property that each of these evaluates to a unique partition block. Due to space limitations, we were unable to incorporate these results in the paper. However, in [6] (Section 4), we have identified these labeling expressions for U (()k) and U ( 1 ). Finally, the results of this paper are applicable in workload driven scenarios [9, 15, 17], such as determining which indices to materialize and which to derive, as a function of the characteristics of an observed workload.

[1]

Shurug

Al-Khalifa ,

H. V.

Jagadish , Jignesh M. Patel , Yuqing Wu, Nick Koudas, and Divesh Srivastava . Structural joins: A primitive for efficient XML query pattern matching . In ICDE , 2002 .

[2]

Bray ,

Paoli ,

C.M.

Sperberg-McQueen ,

and Francois

Yergeau . Etensible Markup Language (XML) 1.0 (third edition) - W3C recommendation , 2004 .

[3]

Nicolas

Bruno , Nick Koudas, and Divesh Srivastava:. Holistic twig joins: optimal XML pattern matching . In SIGMOD , 2002 .

[4]

Chamberlin et al. XQuery 1 .0: An

XML

query language , W3C , 2003 .

[5]

Clark and S. DeRose (eds .). XML path language (XPath) version 1.0 . http://www.w3.org/TR/XPATH.

[6] George

H. L.

Fletcher , Dirk Van Gucht, Yuqing Wu , Marc

Gyssens , and Jan

Paredaens . An XPath Algebraic Characterization of A(k) and P(k) Indices with Applications to Query Processing . Indiana University Technical Report No. 639 , October 2006 .

[7]

Roy

Goldman and

Jennifer

Widom . Dataguides: Enabling query formulation and optimization in semistructured databases . In VLDB , pages 436 - 445 , 1997 .

[8]

Marc

Gyssens , Jan Paredaens, Dirk Van Gucht, and George

H. L.

Fletcher . Structural Characterizations of the Semantics of XPath as Navigation Tool on a Document . In ACM PODS , pages 318 - 327 , 2006 .

[9]

Hao

He and

Jun

Yang . Multiresolution indexing of XML for frequent queries . In IEEE ICDE , 2004 .

[10] Raghav

Kaushik

, Philip Bohannon,

Jeffrey F.

Naughton ,

and Henry F.

Korth . Covering indexes for branching path queries . In SIGMOD , 2002 .

[11] Raghav

Kaushik

, Rajasekar Krishnamurthy,

Jeffrey F.

Naughton , and

Raghu

Ramakrishnan . On the integration of structure indexes and inverted lists . In ACM SIGMOD , 2004 .

[12] Raghav

Kaushik

, Pradeep Shenoy, Philip Bohannon, and

Ehud

Gudes . Exploiting local similarity for efficient indexing of paths in graph structured data . In IEEE ICDE , 2002 .

[13]

Christoph

Koch . Processing queries on tree-structured data efficiently . In ACM PODS , pages 213 - 224 , 2006 .

[14]

Tova

Milo and

Dan

Suciu . Index structures for path expressions . In ICDT , pages 277 - 295 , 1999 .

[15] Jun-Ki

Min

, Chin-Wan Chung , and Kyuseok Shim . An Adaptive Path Index for XML Data using the Query Workload . Inf. Systems , 30 ( 6 ): 467 - 487 , 2005 .

[16]

Mirella

Moura Moro , Zografoula Vagena, and

Vassilis J.

Tsotras . Tree-pattern queries on a lightweight XML processor . In VLDB , 2005 .

[17] Chen

Qun

Andrew

Lim , and Kian Win Ong. D(k)-index: An adaptive structural summary for graph-structured data . In SIGMOD , 2003 .

[18]

Prakash

Ramanan . Covering indexes for XML queries: Bisimulation - simulation = negation . In VLDB , 2003 .

[19] Kanda

Runapongsa

, Jignesh M. Patel , Rajesh Bordawekar, and Sriram Padmanabhan . XIST: An XML index selection tool . In XSym, pages 219 - 234 , 2004 .

[20] Ke

, Hao He, Ioana Stanoi , and Jun Yang . Incremental maintenence of XML structural indexes . In ACM SIGMOD , pages 491 - 502 , 2004 .

[21] Chun

Zhang

, Jeffrey F. Naughton, David J. DeWitt , Qiong Luo, and Guy

Lohman . On supporting containment queries in relational database management systems . In SIGMOD , 2001 .