Challenges for View-Based Query Answering over Probabilistic XML Bogdan Cautis1 and Evgeny Kharlamov2 1 Institut Télécom; Télécom ParisTech, France cautis@telecom-paristech.fr 2 KRDB Research Centre, Free University of Bozen-Bolzano, Italy kharlamov@inf.unibz.it Abstract. This is the first and preliminary study on answering queries using views in a probabilistic XML setting. We formalize the problem and study it under the two possible semantics for XML query results: with node identifiers and in their absence. Accordingly, we consider rewrite plans that can exploit a single view, by means of compensation, and plans that can use multiple views, by means of intersection. Since in probabilistic settings queries return answers with probabilities, the problem of rewriting goes beyond the classical one of retrieving answers from views. For both semantics of XML queries, we show that, even if the XML answers can be retrieved, the computation of their probabilities might not be possible. We give restrictions that make probabilistic rewriting feasible in polynomial time, and present some initial hardness results for this problem. 1 Introduction Uncertainty is ubiquitous in data and many applications must cope with this [1]: infor- mation extraction from the World Wide Web [2], automatic schema matching in data integration [3] are inherently imprecise. This uncertainty is sometimes represented as the probability that the data is correct, as with conditional random fields [4] in informa- tion extraction, or uncertain schema mappings in information integration [5]. In other cases, only confidence in the information is provided by the system, which can be seen after renormalization as an approximation of the probability. It is thus natural to manip- ulate such probabilistic information in a probabilistic database management system [6]. Recent work has proposed models for probabilistic data, both in the relational [7,8,9] and XML [10,11,12] settings. We focus here on the latter, which is particularly adapted for the Web. A number of studies on probabilistic XML have dealt with query answering for a variety of models and query languages [13,10,11,12,11]. At the same time, query optimization over probabilistic data has received little attention. In particular, the prob- lem of answering queries using views, a key approach for optimization, has received no attention so far in both the relational and the semistructured settings. Probabilistic query evaluation could greatly benefit from such techniques, as it is often the case that computing probabilistic results is harder than in the deterministic setting. Views over XML documents can be seen as fragments of data that may be available along with the nodes selected by the query. Over a p-document, the data fragments come 2 Bogdan Cautis and Evgeny Kharlamov dPER : [1] IT- personnel [2] person [3] person [4] name [5] bonus [6] name [7] bonus [8] Rick [22] pda [24] laptop [41] Mary [51] pda [23] 25 [25] 44 [26] 50 [56] 15 Fig. 1. Example document dPER together with their probability. In the general setting, we are given a document d and a set of view queries v1 , . . . , vn . Given a query q, the goal is to understand whether one can obtain q(d), the answers of q over d, by accessing view results v1 (d), . . . , vn (d) only. For XML data, the problem was studied under the two possible semantics for XML query results: with persistent node identifiers or in their absence. For the lat- ter case, only rewrite plans that rely on a single view, by means of compensation, are possible. For the former, plans using multiple views, by means of intersection and com- pensation, are exploitable. We consider both settings and alternatives for rewritings. We present a preliminary study of the problem of answering queries using views in the probabilistic XML setting. We formalize the problem in Section 3 and we give a preliminary study of it for the two possible settings: in the absence of node Ids, in Section 4, and in their presence, in Section 5. We show that, in the probabilistic setting, the problem of answering queries using views becomes more complex and it does not reduce to its deterministic version. The reason is that query results now involve not only data trees, but also their probabilities. Hence probabilities should also be retrieved from probabilistic view results, by means of a probabilistic function computing them. Even for the simpler setting (without Ids), the existence of the probabilistic func- tion is not guaranteed by the existence of a data-retrieving rewriting. We present exam- ples of views and queries for which such a function does not exist. Based on a notion of probabilistic independence between queries, we also isolate a class of queries and views for which this function, when it exists, can be found and computed efficiently. For rewritings with intersection, we provide a sufficient condition (also based on prob- abilistic independence) that guarantees that the probabilities of query answers can be computed as a product-like formula over the probabilities of the views appearing in the intersection. We also present an NP-hardness result for deciding whether a selection of probabilistically independent views for a rewriting is possible. 2 Preliminaries We briefly define here the data and query model. Details can be found in [12,14]. XML documents. We assume the existence of a countable set of labels L that subsumes both XML tags and values. We consider an XML document as an unranked, unordered rooted tree d modeled by a set of edges edges(d), a set of nodes nodes(d), a distin- guished root node root(d) and a labeling function lbl, assigning to each node a label from L. We assume that each node n ∈ nodes(d) has a unique identifier. Challenges for View-Based Query Answering over Probabilistic XML 3 �PER : [1] IT- personnel P [2] person [3] person [4] name [5] bonus [6] name [7] bonus [20] 0.6 [11] mux mux [41] Mary [51] pda det [21] 0.75 0.25 0.3 0.1 [13] John [22] pda [52] mux [8] Rick [14] pda [16] laptop [24] laptop 0.7 0.3 [15] 25 [17] 44 [23] 25 [25] 44 [26] 50 [54] 10 [56] 15 bPER Fig. 2. Example p-document P Example 1. Consider the document dPER in Figure 1 (where PER stands for person- nel) describing the personnel of an IT department and the bonuses distributed for dif- ferent projects. The document dPER indicates that Rick worked under two projects (laptop and pda) and got bonuses of 44 and 50 in the former project and 25 in the latter one. Identifiers are written inside square brackets and labels are next to them, e.g., the node n4 is labeled name, i.e., lbl(n4 ) = name. We define a finite probability space of XML documents, or px-space for short, as a pair (D, Pr) with D a P set of documents and Pr mapping every document d to a proba- bility Pr(d) such that {Pr(d) | d ∈ D} = 1. Probabilistic documents. p-Documents give a general syntax for compactly represent- ing px-spaces. Like a document, a p-document is a tree but with two kinds of nodes: ordinary nodes, which have labels and are the same as in documents, and distributional, which are used to define the probabilistic process for generating random documents. We consider three kinds of distributional nodes: ind (for independent), mux (for mutually exclusive), and det (for deterministic). Other kinds of distributional nodes are studied in [12], but mux, det alone are enough to represent all px-spaces as p-documents [12]. Definition 2. A p-document P b is an unranked, unordered tree with a set of edges b b edges(P), nodes nodes(P), the root node root(P), b and a labeling function lbl, as- signing to each node v a label from L ∪ {ind(Pr), mux(Pr), det}. If lbl(v) is mux(Prv ) or ind(Prv ), then PrvP assigns to each child v 0 of v a probability Prv (v 0 ), and if lbl(v) = mux(Prv ), then also v0 Prv (v 0 ) ≤ 1. We require leaves and the root to be L-labeled. Example 3. Figure 2 shows a p-document P bPER (where PER stands for personnel) that has mux and det distributional nodes, shown on gray background. Node n52 is a mux node with two children n54 and n56 , where Prn52 (n54 ) = 0.7 and Prn52 (n56 ) = 0.3. A p-document P b has as associated semantics a px-space JPK b defined by the follow- ing random process. Independently for each mux(Prv ) (resp. ind(Prv )) node, we select at most one (resp. some) of its children v 0 and delete all other children along with their descendants. We do not delete any of the children of det nodes. We then remove in turn each distributional node, connecting ordinary children of deleted distributional nodes with their lowest ordinary ancestors. The result of this process is a random document P. The probability of P, Pr(P), is the product of all (i) Prv (v 0 ) for each chosen child 4 Bogdan Cautis and Evgeny Kharlamov qRBON : IT- personnel 1 vBON : IT- personnel 2 vBON : IT- personnel person person person bonus Rick bonus bonus Rick Rick bonus laptop pda laptop pda 1 2 Fig. 3. Example TP query qRBON and two TP views: vRBON and vRBON v 0 of a mux P or ind node v, (ii) 1 − Prv (v 0 ) for each not chosen child v 0 of a ind node v, (iii) 1 − v0 Prv (v ) for all children of each mux node v for which no children were 0 chosen. Note that for any P ∈ JPK b there is a unique way to generate it. Example 4. Looking again at Figures 1 and 2, one can obtain the document dPER from bPER by choosing: the left child of the mux node n11 , the right child of the mux node P n20 , and the right one of child of the mux node n52 . The marginal probability of these choices (and the probability of dPER ), is 0.135 = 0.75 × 0.6 × 0.3. Tree-Pattern queries. The language of tree-pattern queries (TP) is roughly the subset of navigational XPath with child, descendant navigation, predicates, without wildcards. Definition 5. A tree-pattern q is a non-empty, unordered, unranked rooted tree, with a set of nodes nodes(q) labeled with symbols from L, a node called the output node out(q) (i.e., tree-patterns are unary queries), and two types of edges: child edges, labeled by / and descendant edges, labeled by //. The root of q is denoted root(q). Due to space limitations, we will often write tree-patterns q in XPath notation [15], and denote this notation with xpath(q). We use lbl(q) as short notation for lbl(out(q)). We use the following graphical representation for tree-patterns: the main branch is the vertical path starting from the root, the output node is the last node of this path, and predicates are subtrees starting with side branches (see Figure 3). Example 6. Consider the query qRBON in Figure 3 (left) (where RBON stands for Rick’s bonuses) asking whether Rick has received a bonus from the project laptop and a bonus from the project pda. In this representation, single lines denote child edges and 1 2 double lines descendant edges. The other two queries vRBON and vRBON in Figure 3 (center and right) ask whether Rick has received a bonus on either of these projects. The output node in all the three queries is labeled with Rick. The semantics of tree-patterns is given using embeddings. An embedding e of a TP query q into a document d is a function from nodes(q) to nodes(d) satisfying: (i) e(root(q)) = root(d); (ii) for any n ∈ nodes(q), lbl(e(n)) = lbl(n); (iii) for any /-edge (n1 , n2 ) in q, (e(n1 ), e(n2 )) is an edge in d; (iv) for any //-edge (n1 , n2 ) in q, there is a path from e(n1 ) to e(n2 ) in d. The result q(d) of applying a tree-pattern q to a document d is the set: q(d) := {e(out(q)) | e is an embedding of q into d} Example 7. Continuing with the three queries in Figure 3, they all return {n8 } over the document dPER , since Rick has got bonuses from both of the requested projects. Challenges for View-Based Query Answering over Probabilistic XML 5 Intersections of tree-patterns. We consider in this paper the extension TP∩ of TP with respect to intersection, which denotes intersections of tree-pattern queries. TP∩ = {q1 ∩ · · · ∩ qk | k ∈ N, qi ∈ TP, and lbl(root(qi )) = lbl(root(qj )), and lbl(out(qi )) = lbl(out(qj )) for i, j ∈ {1, . . . , k}}. Tk The result of a TP∩ query q1 ∩· · ·∩qk over a document d is the set of nodes i=1 qi (d). Query equivalence and containment. A pattern q1 is contained in a pattern q2 , denoted q1 v q2 , if q1 (d) ⊆ q2 (d) for every d. Also q1 is equivalent to q2 , q1 ≡ q2 , if q1 v q2 and q2 v q1 . We discuss how to check containment of TP∩ queries in Section 5. For TP queries, containment can be decided using containment mappings [16,17] which are similar to embeddings. Intuitively, a containment mapping from q1 to q2 is a function from nodes(q1 ) to nodes(q2 ) that respects the labels of nodes and maps any two nodes connected with /-edges to nodes connected with /-edges, while nodes connected with //-edges can be mapped to any connected nodes. Then for q1 and q2 in TP, q2 v q1 iff there is a containment mapping from q1 to q2 . Note that such a mapping can be computed in polynomial time. For example, observe that qRBON is contained in both 1 2 vRBON and vRBON , while none of the latter two is contained in each other. Querying p-documents. Up to now, we have seen queries as functions over XML docu- ments outputting sets of nodes. Over a p-document P, b a query q (TP or TP∩ ) naturally b and p is the prob- yields a set of node-probability pairs (n, p), where n is a node of P, ability that q can be embedded into a random document P of P b with some e such that e(out(q)) = n; this value will also be written as Pr(n ∈ q(P)). Formally: X b := {(n, p) | ∃ d ∈ JPK: q(P) b n ∈ q(d) and p = b n∈q(d) Pr(d)}. d∈JPK: It is known [11] that TP queries can be evaluated over p-documents P b in time polyno- b ∩ mial in |P|, that is, in data-complexity. The same holds for TP queries. Example 8. Evaluation of qRBON over P bPER returns the node n8 iff the left child of the node n11 and the right child of n20 are chosen. Hence, qRBON (PbPER ) = {(n8 , 0.75 × 0.6)}. Evaluation of vRBON and vRBON over P 1 2 bPER returns {(n8 , 0.75 × (0.1 + 0.6))} and {(n8 , 0.75 × (0.3 + 0.6)}, respectively. Further notations. We introduce now some additional terminology for TP queries, which will be used in Sections 4 and 5, and can be skipped until then. The main branch mb(q) of q is the path from root(q) to out(q), and the main branch nodes of q, mbn(q), are the nodes of mb(q). A prefix qp of q is any tree-pattern that can be build from q by setting root(q) as root(qp ), setting some node n ∈ mbn(q) as out(qp ), and removing from q all mbn(q) nodes below n along with their descendants. A suffix qs of q is any subtree of q rooted at a node of mbn(q). The rank of a main branch node is the distance from it to the root, i.e., the rank of root(q) is 1 and of out(q) is |mbn(q)|. The suffix of q rooted at the node of rank k is denoted q k . For any rank k, cut(q, k) denotes the prefix of q with k main branch nodes. 6 Bogdan Cautis and Evgeny Kharlamov 3 Problem Definition We assume an infinite set of view names V disjoint from the set of labels L. By a view v we denote a tree-pattern query (that defines the view) together with its name v ∈ V. Deterministic view-based rewriting. Let d be a document and v a view. A (determinis- tic) view extension of v over d, denoted dv , is a document obtained by connecting to a root node labeled doc(v) all the documents from the set {d0 | d0 ⊆ d and root(d0 ) ∈ v(d)}. Such a document can be queried by TP-queries of the form doc(v)/lbl(v)/ . . . . If V is a set of views defined over a document d, then DVd = {dv | v ∈ V }. Let q be a query in Q ∈ {TP, TP∩ } that may use doc(v)/lbl(v) for v ∈ V , then unfoldV (q) is the query in Q obtained by replacing in q each doc(v)/lbl(v) with the definition of v. 1 2 Example 9. Two views vRBON and vRBON are in Figure 3. Their extensions over dPER are, respectively, documents (dPER )vRBON 1 and (dPER )vRBON 2 each with two nodes: the 1 root labeled, respectively, with doc(vRBON ) and doc(vRBON 2 ), and with one child n8 labeled Rick in both cases. Let v be vRBON 2 without the node labeled Rick and with the output node person. Then (dPER )v has the root labeled doc(v) to which two subdocu- ments of dPER are connected: a subdocument rooted at n2 and one at n3 . Let d be a document, q a TP-query, V a set of TP-views, and Q ∈ {TP, TP∩ }. In the deterministic setting the problem of query answering using views is to find an alternative query plan qr in Q, called a rewriting, that can be used to answer q. Formally, a Q-rewriting qr of q using V is a query qr ∈ Q such that for every document d it holds that qr (DVd ) = q(d). Clearly, this implies that unfoldV (qr ) ≡ q. The two alternatives, TP-rewritings and TP∩ -rewritings, are respectively motivated by the two possible interpretations of XML query results. In an XML document, nodes have unique Ids used by internal operators (selections, unions, joins, etc.) to manipulate data during query evaluation. Queries can then either (i) output fresh Ids for the nodes of the result, or (ii) expose (preserve) in the result the original Ids from the document. The former case corresponds to what is called the copy semantics, under which the Ids of any document in DVd are disjoint from those of d and from those of any other document in DVd . Since one cannot know if nodes from results of different views are in fact copies of the same node in d, the only possible rewritings are those that access a single document from DVd and maybe navigate inside it. A rewriting qr ∈ TP will thus be of the form doc(v)/lbl(v)[p1 ]/p2 or doc(v)/lbl(v)[p1 ]//p2 , where v ∈ V and the (possibly empty) TP-queries p1 , p2 represent the compensation of v. In the latter case, every document in DVd preserves the original Ids, which will identify nodes across different documents in DVd . One can thus formulate and exploit more complex rewritings, as node Ids can be used to intersect (join by Id) results of different views over the same input data d. TP∩ -rewritings qr extend TP-rewritings in that they can access several DVd documents at once, by first navigating in individual T documents and then intersecting the result. Thus the form of TP∩ -rewritings is i,j uij , where each uij is a TP-rewriting. Probabilistic view-based rewriting. We generalize the definition of view extension to bv is a p-document rooted at a node labeled doc(v) whose con- the probabilistic case: P tents is constructed as follows: (i) plug an unique ind-child below root(P bv ), (ii) for Challenges for View-Based Query Answering over Probabilistic XML 7 each pair (α, β) in the set {(Pb0 , p) | P b0 ⊆ P b and (root(P b0 ), p) ∈ q(P)}, b add α as subtree of this ind-node with the probability β. A set of p-documents DVP for the set of b views V and unfolding of a query over DVP is defined as in the deterministic case. b Example 10. Continuing with Example 9, extensions (P bPER )v1 (PbPER )v2 of RBON RBON b the views over PPER are p-documents with three nodes: the roots are labeled respec- 1 tively doc(vRBON ) and doc(vRBON 2 ), with one child labeled ind, that in turns has one child with the id n8 labeled Rick. The edge between the ind-node and its child is la- beled 0.75 in both cases. The extension (PbPER )v has the root labeled v, with one child labeled ind, and two p-subdocuments of P bPER rooted under the ind node: one is the p-subdocument rooted at n2 and the other one rooted at n3 . Probabilities on the edges to n2 and n3 are 1. Query answering using views in the probabilistic setting is more involved than in b is a set of node-probability pairs. Therefore, rewrite the deterministic one, since q(P) plans should deal with two sub-problems: (i) to find a query in terms of views, that retrieves the nodes N of q(P)b (this corresponds to deterministic rewriting plans) and (ii) to compute the probabilities for the nodes in N , using probabilities from DVP . Both b sub-problems require algorithms accessing p-documents DVP only. More formally: b Definition 11. Let q be a TP query, V be a set of TP views and Q ∈ {TP, TP∩ }. A probabilistic Q-rewriting Qr = (qr , fr ) of q using V is a pair of (i) a deterministic Q-rewriting qr of q using V , and (ii) a probability function fr such that for every p-documents P b and every node n of b P it holds that fr (n, DV ) = Pr(n ∈ q(P)). P b When DVP is clear from the context we will use fr (n) as short notation for fr (n, DVP ). b b For given q and V , a probabilistic rewriting problem is to find Qr . The main chal- lenge in solving this problem is to construct a probability function fr that, by definition, has access only to the p-documents in DVP . In Sections 4 and 5 we respectively show b that this is not always possible for TP and TP∩ -rewritings. 4 TP-rewrite Plans Here we discuss when probabilistic TP-rewrite plans do not exist and present cases for which they do exist and can be computed in polynomial time. We first introduce some auxiliary notation. By dn we denote the subdocument of d rooted at n. We denote the p-subdocument of P b rooted at a node n as P bn . For TP queries q1 and q2 , compensation of q1 with q2 , denoted comp(q1 , q2 ), is a TP-query obtained by deleting the first symbol from xpath(q2 ) and concatenating the rest to xpath(q1 ). For instance, the result of compensating q1 = a/b with q2 = b[c][d]/e is the concatenation of a/b and [c][d]/e, i.e., comp(q1 , q2 ) = a/b[c][d]/e. Intuitively, view’s compensation brings further navigation over the view’s result P bv , and a rewrite plan will be of the form qr = comp(doc(v)/lbl(v), q ) such that its unfolding comp(v, q k ) is equivalent k bv , its unfolding is over P to q. Note that qr is over P b while they yield the same result. We remind the reader the main result for deterministic compensations [18]: 8 Bogdan Cautis and Evgeny Kharlamov �1 : P a �2 : a P �3 : a P c mux b mux b1 0.65 0.3 1 b mux c b2 0.5 mux mux c b3 p1 p2 0.5 2 c b4 1 2 c Fig. 4. p-Documents to show non-existence of TP-rewritings (Examples 14 and 16) Fact 12. Let q and V be TP-queries. Then there exists a deterministic TP-rewriting of q over V if and only if there is v ∈ V and k ∈ N such that comp(v, q k ) ≡ q. This criterion can be verified in polynomial time [18]. Fact 12 says that, using just one view v from V , we can find all the nodes n ∈ q(d) by querying dv with q k , i.e., the data in dv suffices to extract all such n. This naturally extends to the probabilistic setting: we can find all the nodes n ∈ q(P) b by querying P bv with q k , i.e., the data in P bv it suffices to extract all ns. Note that n is in the query result q(P) iff Pr(n ∈ q(P)) > 0. Proposition 13. Let q and v be TP-queries and k ∈ N. Let qr = comp(doc(v)/lbl(v), q k ) b it holds be a deterministic TP-rewriting of q using v. Then for every p-document P Pr(n ∈ q(P)) > 0 if and only if Pr(n ∈ qr (Pv )) > 0. 4.1 Nonexistence of TP-rewrite Plans Is information in Pbv also sufficient to extract the probabilities Pr(n ∈ q(P)) for nodes n ∈ q(P)? It turns out that the answer is negative. There are q and v for which a deter- ministic rewriting qr exists but not the probabilistic one, i.e. the function fr such that for every Pb it holds that fr (n) = Pr(n ∈ q(P)) does not exist. Thus the probabilistic rewriting problem crucially different from the deterministic one. We now present two example that will give insides on this phenomenon. Example 14. Consider the query q = a/b[c] and the view v = a[.//c]/b. We now show that there is no probabilistic rewriting (qr , fr ) for q over {v}. One can see that comp(v, q 2 ) = a[.//c]/b[c] is equivalent to q, hence, qr = comp(doc(v)/lbl(v), q 2 ). Consider now two p-documents P b1 and Pb2 from Figure 4. Clearly, Pr(b ∈ q(P1 )) = 0.65 × 0.5 and Pr(b ∈ q(P2 )) = 0.5, and these probabilities are different. The func- tion fr should compute the first probability 0.325 on a p-document (P b1 )v and 0.5 on b (P2 )v , hence fr should distinguish these p-documents. While one can see that these p-documents are indistinguishable by v: (P b1 )v = (P b2 )v . Hence, fr does not exist. The problem raised in this example comes from the fact that in the unfolding of the rewriting a[.//c]//b[c] the predicate [.//c] coming from the view (i.e. located above the b-labeled node out(q)) and the predicate [c] coming from the compensation, (i.e. lo- cated below out(q)) may interact. Where the interaction is of the following kind: there is a document, e.g., d with the root a that has one child b, which in tern has one child c. Challenges for View-Based Query Answering over Probabilistic XML 9 that satisfies a[.//c]//b[c] but both c-nodes of the query should be mapped to the same c-node of d. In other words, the existence of a match for one predicate depends on the (non-)existence of a match of the other. We now introduce a condition of probabilistic independence that prevents such an interaction. This condition will be further used for both TP and TP∩ -rewritings. b and n ∈ P: TP-queries q1 and q2 are independent, denoted q1 ⊥q2 , if for every P b Pr(n ∈ (q1 ∩ q2 )(P)) = [Pr(n ∈ q1 (P)) × Pr(n ∈ q2 (P))] ÷ Pr(n ∈ P). Clearly, the view a[.//c]//b and the compensation b[c] from Example 14 are proba- bilistically dependent. Deciding probabilistic dependency is tractable: Proposition 15. For TP-queries the independence q1 ⊥q2 is decidable in polynomial time. Observe that probabilistic independence between a view and its compensation in a deterministic rewriting does not guarantee existence of a probabilistic rewriting. Example 16. Consider q = a//b[1]/b[2]/b//c and v = a//b[1]/b[2]/b. Clearly, q 2 is a compensation for v and qr = comp(v, q 2 ) is a deterministic TP-rewriting for q and v. We now present two p-documents P b3 and Pb4 that show the non-existence of a probabil- ity function fr , such that (qr , fr ) is a probabilistic TP-rewriting for q and v. Consider P b3 from Figure 4. where the upper index i on b indicates the i-th occurrence of a node la- b3 . Clearly, (P beled b in P b3 )v is a p-document with the root that has one ind child, under which two p-documents: P bb(3) with probability p2 and P bb(4) with p1 are rooted. Con- b b sider now P4 that is different from P3 in that it has an ind-node instead of the mux-node. Clearly, (Pb3 )v = (P b4 )v . Note that both P bb(3) and Pbb(4) are deterministic documents, b and in both (Pi )v there is no information on whether p1 and p2 are coming from the same distributional node or not, and if they come from the same node, then there is no information on what type of this node is. At the same time, Pr(c ∈ q(P b3 )) = p1 + p2 , while Pr(c ∈ q(P b4 )) = p1 + p2 − p1 × p2 . That is, the probabilities are different and the result depends on the kind of the probabilistic relations in which p1 and p2 are involved, i.e., on whether they are for associated with the children of a mux or ind distributional node. If a probability computation function fr for qr exists then it should be able com- pute the latter two different probabilities from the same views v(P b3 ) and v(Pb4 ), which is impossible since they have no information on the relationship between p1 and p2 . 4.2 Simple Queries and Restricted Compensations We now provide a class of queries and a class of compensated queries for which decid- ing existence of probability rewriting is based on probabilistic independence. A query q is simple if either its main branch has /-edges only, or all the nodes in mbn(q) \ out(q) reachable from (the first occurrence of) a //-edge have no predicates. Clearly, a query is not simple if it is of the form . . . // . . . [. . . ] . . . . The compensation of a view v with c is restricted if either v is simple, or mb(c) has no //-edges. Let n0 be the highest ancestor-or-self node of n occurring in v(P) b and let k = |mb(v)|. We show that under some conditions on qr and v, the probability Pr(n ∈ q(P)) 10 Bogdan Cautis and Evgeny Kharlamov INPUT : TP query q and views V OUTPUT: Set of TP-rewritings R R := ∅, Prefs := {(q1 , vi ) | vi ∈ V, q1 is a lossless prefix of q, mb(q1 ) ≡ mb(vi ), q1 v vi }; for each (q1 , vi ) ∈ Prefs do k := |mb(q1 )| = |mb(vi )| if comp(vi , q k ) ≡ q and restricted then vi0 := comp(mb(vi ), vik ) // vi w/o predicates of nodes at rank 1, . . . , k − 1 vi00 := comp(cut(vi , k − 1), mb(vi )k−1 ) // vi w/o predicates of node at rank k q 0 := comp(cut(mb(q), k), q k ) // q w/o predicates of nodes at rank 1, . . . , k − 1 if vi0 ⊥ vi00 and vi00 ⊥ q 0 then R := R ∪ {comp(doc(v)/lbl(v), q k )} Algorithm 1: TPrewrite for finding TP-rewritings qr for which fr are as in Eq. 1 that a node n occurs in q’s result is the probability Pr(n ∈ qr (Pv )) that n can be found by qr in qr (P bv ), divided by the probability Pr(n0 ∈ v k (Pvn0 )) that n0 verifies the predicates of v found on its output node out(v). The following theorem is the main result of this section, where for k = |mbn(v)| the query v 0 = comp(mb(v), v k ) is v without all predicates of (main branch) nodes of rank in {1, . . . , k − 1}, the query v 00 = comp(cut(v, k − 1), mb(v)k−1 ) is v without all predicates of the node at rank k (i.e., the output node), and q 0 = comp(cut(mb(q), k), q k ) is q without all predicates of main branch nodes of rank in {1, . . . , k − 1}. Theorem 17. Let v be a TP-view, q a TP-query, and k = |mb(v)|. Let qr be a restricted TP-rewriting of q over v. If v 0 ⊥ v 00 and v 00 ⊥ q 0 hold, then for every n ∈ q(P) and its highest ancestor-or-self node n0 ∈ v(P): 0 Pr(n ∈ q(P)) = Pr(n ∈ qr (Pv )) ÷ Pr(n0 ∈ v k (Pvn )). (1) We summarize this section with a polynomial time algorithm TPrewrite (see Al- gorithm 1), that takes an the input a TP query q and a set of views V and returns all possible TP-rewritings qr , for which the probability functions fr are as in Equation 1. In the algorithm we use so-called lossless prefixes: q1 is a lossless prefix of q, if q1 is a tree-pattern obtained from q by setting out(q1 ) as some node of mbn(q). 5 TP∩ -rewrite Plans First we discuss how to decide equivalence between TP and TP∩ queries and then pro- vide a restriction on views for which probability functions fr exist and tractable. 5.1 Equivalence and containment for TP∩ Since Q in TP∩ is a rewriting for q in TP iff unfoldV (Q) ≡ q, deciding whether a TP query q is equivalent to a TP∩ query Q is crucial in our setting. It is known [14] that one Challenges for View-Based Query Answering over Probabilistic XML 11 can rely on mappings to decide the equivalence: if q is first equivalently reformulated into the union of TP queries ∪i qi , called its possible interleavings, and can be expo- nentially large in |Q|. Interleavings capture all the possible ways to order or coalesce the main branch nodes of queries participating in the intersection. Testing q ≡ Q is coNP-hard and boils down to testing q ≡ ∪i qi , which in turn boils down to testing: if for some j: q v qj , and if for all i: qi v q. We conclude: Corollary 18. Deciding existence of a probabilistic TP∩ -rewriting for a TP query q and TP views V views is coNP-hard. The equivalence problem was however shown in [14] to be in PTIME when q be- longs to a restricted fragment of TP, called extended skeletons. As our focus in this paper is on polynomial time algorithms for view-based rewriting, it is thus natural to ask if view-based rewriting over probabilistic data remains tractable when input queries are extended skeletons and expose node Ids, while views are general TP queries. 5.2 Computing Probability Function fr for Mutually Independent TP∩ Views We start with the assumption that a deterministic rewriting qr has been found. Without loss of generality, let us assume that qr consists only of intersected views (no compen- sations of views). Let v1 , . . . , vk be the TP views of qr . Towards building the proba- bility component of the rewrite plan, fr , we give some intuition first: for a given node n ∈ q(P), since each view vi gives a probability, n ∈ vi (P), and since we are inter- ested in the probability of the intersection thereof, we might be tempted to try what is arguably the most intuitive definition for fr , the one which would simply combine by multiplication the probabilities Pr(n ∈ vi (P)). But there are two issues with this straightforward fr candidate. Regarding the first issue, a basic principle in probability theory is that the joint probability of several events equals the product of the individual probabilities only when these events are independent. For instance, for a given node n ∈ P, are Pr(n ∈ a[1](P)] and Pr(n ∈ a[2](P)] independent? The answer is obviously negative, and we can read- ily construct Pb instances over which both values are non-zero but the probability of Pr(n ∈ a[1][2](P)]) = 0. We have introduced the notion of query independence in the previous section, denoted vi ⊥vj , which guarantees that the existence of some embed- ding of vi in a given document does not depend on the existence (or non-existence) of some embedding of vj in this document. We will see now that for pairwise independent views a function fr is based on multiplication of these views’ probabilities. b and, consequently, Regarding the second issue, for each node n that appears in q(P) b b appears in each v1 (P), . . . , vk (P), we have k probability values Pr(n ∈ vi (P)). Fur- thermore, each value Pr(n ∈ vi (P)) can be seen as the product of two distinct terms: (i) the probability of n appearing in a possible world of P, b denoted Pr(n ∈ P), (ii) the probability of n being selected by vi in a possible world in which n is known to appear, denoted in the following Pr(n ∈ vi (P) | n ∈ nodes(P)). Note that the first term is independent of any particular view as it only depends on the document itself We can thus write for each vi and each pair (n, pi ) ∈ vi (P) that pi = Pr(n ∈ P) × Pr(n ∈ vi (P) | n ∈ nodes(P)). 12 Bogdan Cautis and Evgeny Kharlamov Given a deterministic rewriting qr of q formed by pairwise independent views v1 , . . . , vk , for a node n ∈ q(P)), we would thus have as the overall product the following: Y Y Pr(n ∈ vi (P)) = Pr(n ∈ P)k × Pr(n ∈ vi (P) | n ∈ nodes(P)). (2) i i Observe that in Equation 2 we account for the probability Pr(n ∈ P) too many times, once for each view that participates in the rewrite plan. But we should instead account for it only once. Hence, by dividing Equation 2 with Pr(n ∈ P)k−1 , we obtain fr : Y fr (n) := Pr(n ∈ P) × Pr(n ∈ vi (P) | n ∈ nodes(P)). (3) i Each independent view vi gives us Pr(n ∈ vi (P) | n ∈ nodes(P)), while there is now one missing ingredient in Equation 3: Pr(n ∈ P). We can compute this value only if there is a view vi ∈ V subsuming mb(q), i.e., mb(q) v vi . Summing up we conclude: Theorem 19. Let q be a TP-query, V a set of pairwise independent TP-views s.t. there is v ∈ V satisfying mb(q) v v. Let qr be a TP∩ -rewriting of q over V . Then (qr , fr ) with the fr as in Equation 3 is a probabilistic TP∩ -rewriting of q over V . The next theorem shows that for //-free q and V 0 it is hard to decide the existence of V ⊆ V 0 of pairwise independent views (by reduction from k-dimensional perfect matching). Thus, deciding whether the result of Theorem 19 is applicable even for //- free TP queries and views is intractable. This also implies that for extended skeletons it is hard to find TP∩ -rewritings using the probability function as in Equation 3. Theorem 20. Let q and views V be both of TP and without //-edges, deciding whether a TP∩ -rewriting of q using only pairwise independent views from V exists is NP-hard. 6 Conclusion This is the first study on answering queries using views over probabilistic XML. The main challenge of this problem is to find probability-retrieving functions that can access only view results, while able to compute the probabilities of XML answers. So far we studied two cases of TP and TP∩ -rewritings where these functions exist under the assumption of probabilistic independence for queries and views. In the TP case, our setting allows for polynomial time query answering in the size of both data and query. In the TP∩ case, our setting allows for polynomial time query answering in the size of data, while it is intractable in the number of views. Recall that (direct) query answering for probabilistic XML model considered here is also polynomial in data and intractable in query complexity [11]. Moreover, query answering techniques of [11] are based on an expensive dynamic programming approach. At the same time, in our TP∩ setting, probability computation is done by means of the fr function (Equation 3) that requires to compute probabilities that a node n occurs in p-documents. This computation is both conceptually and computationally easier than dynamic programming. It requires to collect the probabilities occurring on the way from the root of a p-document to the Challenges for View-Based Query Answering over Probabilistic XML 13 node n, and to multiply them. Hence, a practical implication of our study is that query answering using TP∩ -views in our restricted setting should be more efficient than direct query evaluation q(P).b As for the TP setting, reasoning over the views is tractable, while view-based query answering may require navigation in a view result P bv , which in b As for the future work, one extension is to practice is often considerably smaller than P. broaden the setting and to understand how one can cope with probabilistic dependences in queries and views. Another extension concerns data: p-documents studied in this paper have local probabilistic dependences, while there are models allowing for more complex probabilistic interactions between remote fragments of data [19]. For these types of XML data, query answering is intractable and we would like to see under which conditions we can gain tractability by relying on views. Acknowledgements. We are grateful to the anonymous reviewers for their comments. The second author is supported by the ERC FP7 grant Webdam (agreement n. 226513). References 1. Kharlamov, E., Nutt, W., Senellart, P.: Value joins are expensive over (probabilistic) XML. In: Proc. LID, Uppsala, Sweden (2011) 2. Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of Web information extrac- tion systems. IEEE TKDE 18(10) (2006) 3. Rahm, E., Bernstein, P.: A survey of approaches to automatic schema matching. VLDBJ ’01 4. Lafferty, J., McCallum, A., Pereira, F.: Conditional Random Fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. ICML. (2001) 5. Dong, X.L., Halevy, A.Y., Yu, C.: Data integration with uncertainty. VLDBJ 18(2) (2009) 6. Dalvi, N., Ré, C., Suciu, D.: Probabilistic databases: Diamonds in the dirt. CACM ’09 7. Widom, J.: Trio: A system for integrated management of data, accuracy, and lineage. In: Proc. CIDR, Online Proceedings (2005) 262–276 8. Dalvi, N., Suciu, D.: The dichotomy of conjunctive queries on probabilistic structures. In: Proc. PODS. (2007) 9. Koch, C.: MayBMS: A system for managing large uncertain and probabilistic databases. In Aggarwal, C., ed.: Managing and Mining Uncertain Data. Springer, New York, NY (2009) 10. Nierman, A., Jagadish, H.V.: ProTDB: Probabilistic data in XML. In: Proc. VLDB. (2002) 11. Kimelfeld, B., Kosharovsky, Y., Sagiv, Y.: Query evaluation over probabilistic XML. VLDBJ 18(5) (2009) 1117–1140 12. Abiteboul, S., Kimelfeld, B., Sagiv, Y., Senellart, P.: On the expressiveness of probabilistic XML models. VLDBJ 18(5) (2009) 1041–1064 13. Abiteboul, S., Chan, T.H.H., Kharlamov, E., Nutt, W., Senellart, P.: Aggregate queries for discrete and continuous probabilistic XML. In: Proc. ICDT. (2010) 14. Cautis, B., Deutsch, A., Onose, N., Vassalos, V.: Querying XML data sources that export very large sets of views. TODS (2011) 15. Benedikt, M., Koch, C.: XPath leashed. ACM Comput. Surv. 41(1) (2008) 16. Amer-Yahia, S., Cho, S., Lakshmanan, L.V.S., Srivastava, D.: Tree pattern query minimiza- tion. VLDBJ 11(4) (2002) 315–331 17. Miklau, G., Suciu, D.: Containment and equivalence for a fragment of XPath. J. ACM ’04 18. Xu, W., Özsoyoglu, Z.: Rewriting XPath queries using materialized views. In: Proc. VLDB. (2005) 121–132 19. Senellart, P., Abiteboul, S.: On the complexity of managing probabilistic XML data. In: Proc. PODS. (2007) 283–292