Challenges for View-Based Query Answering
                   over Probabilistic XML

                        Bogdan Cautis1 and Evgeny Kharlamov2
                        1
                      Institut Télécom; Télécom ParisTech, France
                       cautis@telecom-paristech.fr
             2
               KRDB Research Centre, Free University of Bozen-Bolzano, Italy
                           kharlamov@inf.unibz.it


       Abstract. This is the first and preliminary study on answering queries using
       views in a probabilistic XML setting. We formalize the problem and study it
       under the two possible semantics for XML query results: with node identifiers
       and in their absence. Accordingly, we consider rewrite plans that can exploit a
       single view, by means of compensation, and plans that can use multiple views, by
       means of intersection. Since in probabilistic settings queries return answers with
       probabilities, the problem of rewriting goes beyond the classical one of retrieving
       answers from views. For both semantics of XML queries, we show that, even if
       the XML answers can be retrieved, the computation of their probabilities might
       not be possible. We give restrictions that make probabilistic rewriting feasible in
       polynomial time, and present some initial hardness results for this problem.


1   Introduction
Uncertainty is ubiquitous in data and many applications must cope with this [1]: infor-
mation extraction from the World Wide Web [2], automatic schema matching in data
integration [3] are inherently imprecise. This uncertainty is sometimes represented as
the probability that the data is correct, as with conditional random fields [4] in informa-
tion extraction, or uncertain schema mappings in information integration [5]. In other
cases, only confidence in the information is provided by the system, which can be seen
after renormalization as an approximation of the probability. It is thus natural to manip-
ulate such probabilistic information in a probabilistic database management system [6].
     Recent work has proposed models for probabilistic data, both in the relational [7,8,9]
and XML [10,11,12] settings. We focus here on the latter, which is particularly adapted
for the Web. A number of studies on probabilistic XML have dealt with query answering
for a variety of models and query languages [13,10,11,12,11]. At the same time, query
optimization over probabilistic data has received little attention. In particular, the prob-
lem of answering queries using views, a key approach for optimization, has received
no attention so far in both the relational and the semistructured settings. Probabilistic
query evaluation could greatly benefit from such techniques, as it is often the case that
computing probabilistic results is harder than in the deterministic setting.
     Views over XML documents can be seen as fragments of data that may be available
along with the nodes selected by the query. Over a p-document, the data fragments come
2        Bogdan Cautis and Evgeny Kharlamov

      dPER :                              [1] IT- personnel
                           [2] person                           [3] person

               [4] name            [5] bonus             [6] name            [7] bonus

                [8] Rick    [22] pda     [24] laptop    [41] Mary            [51] pda

                             [23] 25 [25] 44      [26] 50                    [56] 15

                             Fig. 1. Example document dPER

together with their probability. In the general setting, we are given a document d and a
set of view queries v1 , . . . , vn . Given a query q, the goal is to understand whether one
can obtain q(d), the answers of q over d, by accessing view results v1 (d), . . . , vn (d)
only. For XML data, the problem was studied under the two possible semantics for
XML query results: with persistent node identifiers or in their absence. For the lat-
ter case, only rewrite plans that rely on a single view, by means of compensation, are
possible. For the former, plans using multiple views, by means of intersection and com-
pensation, are exploitable. We consider both settings and alternatives for rewritings.
    We present a preliminary study of the problem of answering queries using views
in the probabilistic XML setting. We formalize the problem in Section 3 and we give
a preliminary study of it for the two possible settings: in the absence of node Ids, in
Section 4, and in their presence, in Section 5. We show that, in the probabilistic setting,
the problem of answering queries using views becomes more complex and it does not
reduce to its deterministic version. The reason is that query results now involve not only
data trees, but also their probabilities. Hence probabilities should also be retrieved from
probabilistic view results, by means of a probabilistic function computing them.
    Even for the simpler setting (without Ids), the existence of the probabilistic func-
tion is not guaranteed by the existence of a data-retrieving rewriting. We present exam-
ples of views and queries for which such a function does not exist. Based on a notion
of probabilistic independence between queries, we also isolate a class of queries and
views for which this function, when it exists, can be found and computed efficiently.
For rewritings with intersection, we provide a sufficient condition (also based on prob-
abilistic independence) that guarantees that the probabilities of query answers can be
computed as a product-like formula over the probabilities of the views appearing in the
intersection. We also present an NP-hardness result for deciding whether a selection of
probabilistically independent views for a rewriting is possible.


2   Preliminaries
We briefly define here the data and query model. Details can be found in [12,14].

XML documents. We assume the existence of a countable set of labels L that subsumes
both XML tags and values. We consider an XML document as an unranked, unordered
rooted tree d modeled by a set of edges edges(d), a set of nodes nodes(d), a distin-
guished root node root(d) and a labeling function lbl, assigning to each node a label
from L. We assume that each node n ∈ nodes(d) has a unique identifier.
                     Challenges for View-Based Query Answering over Probabilistic XML                                            3

�PER :                                                                    [1] IT- personnel
P
                                                   [2] person                                           [3] person

         [4] name                                       [5] bonus                                 [6] name        [7] bonus

                                                [20]                0.6
  [11]    mux                                             mux                                    [41] Mary         [51] pda
                                                                            det     [21]
   0.75             0.25               0.3        0.1
              [13] John                                         [22] pda                                   [52]      mux
  [8] Rick                  [14] pda         [16] laptop                            [24] laptop
                                                                                                             0.7           0.3
                            [15] 25           [17] 44           [23] 25           [25] 44     [26] 50      [54] 10      [56] 15

                                                           bPER
                                Fig. 2. Example p-document P

Example 1. Consider the document dPER in Figure 1 (where PER stands for person-
nel) describing the personnel of an IT department and the bonuses distributed for dif-
ferent projects. The document dPER indicates that Rick worked under two projects
(laptop and pda) and got bonuses of 44 and 50 in the former project and 25 in the latter
one. Identifiers are written inside square brackets and labels are next to them, e.g., the
node n4 is labeled name, i.e., lbl(n4 ) = name.

     We define a finite probability space of XML documents, or px-space for short, as a
pair (D, Pr) with D a P set of documents and Pr mapping every document d to a proba-
bility Pr(d) such that {Pr(d) | d ∈ D} = 1.

Probabilistic documents. p-Documents give a general syntax for compactly represent-
ing px-spaces. Like a document, a p-document is a tree but with two kinds of nodes:
ordinary nodes, which have labels and are the same as in documents, and distributional,
which are used to define the probabilistic process for generating random documents. We
consider three kinds of distributional nodes: ind (for independent), mux (for mutually
exclusive), and det (for deterministic). Other kinds of distributional nodes are studied
in [12], but mux, det alone are enough to represent all px-spaces as p-documents [12].

Definition 2. A p-document P   b is an unranked, unordered tree with a set of edges
         b                  b
edges(P), nodes nodes(P), the root node root(P),        b and a labeling function lbl, as-
signing to each node v a label from L ∪ {ind(Pr), mux(Pr), det}. If lbl(v) is mux(Prv )
or ind(Prv ), then PrvP
                      assigns to each child v 0 of v a probability Prv (v 0 ), and if lbl(v) =
mux(Prv ), then also v0 Prv (v 0 ) ≤ 1. We require leaves and the root to be L-labeled.

Example 3. Figure 2 shows a p-document P bPER (where PER stands for personnel) that
has mux and det distributional nodes, shown on gray background. Node n52 is a mux
node with two children n54 and n56 , where Prn52 (n54 ) = 0.7 and Prn52 (n56 ) = 0.3.

    A p-document P  b has as associated semantics a px-space JPK b defined by the follow-
ing random process. Independently for each mux(Prv ) (resp. ind(Prv )) node, we select
at most one (resp. some) of its children v 0 and delete all other children along with their
descendants. We do not delete any of the children of det nodes. We then remove in turn
each distributional node, connecting ordinary children of deleted distributional nodes
with their lowest ordinary ancestors. The result of this process is a random document
P. The probability of P, Pr(P), is the product of all (i) Prv (v 0 ) for each chosen child
4         Bogdan Cautis and Evgeny Kharlamov

qRBON :    IT- personnel            1
                                   vBON :      IT- personnel          2
                                                                     vBON : IT- personnel
                person                               person                       person

       bonus Rick bonus                     bonus Rick                             Rick bonus

       laptop            pda                laptop                                         pda

                                                         1         2
       Fig. 3. Example TP query qRBON and two TP views: vRBON and vRBON

v 0 of a mux
          P or ind node v, (ii) 1 − Prv (v 0 ) for each not chosen child v 0 of a ind node v,
(iii) 1 − v0 Prv (v ) for all children of each mux node v for which no children were
                    0

chosen. Note that for any P ∈ JPK b there is a unique way to generate it.

Example 4. Looking again at Figures 1 and 2, one can obtain the document dPER from
bPER by choosing: the left child of the mux node n11 , the right child of the mux node
P
n20 , and the right one of child of the mux node n52 . The marginal probability of these
choices (and the probability of dPER ), is 0.135 = 0.75 × 0.6 × 0.3.

Tree-Pattern queries. The language of tree-pattern queries (TP) is roughly the subset
of navigational XPath with child, descendant navigation, predicates, without wildcards.
Definition 5. A tree-pattern q is a non-empty, unordered, unranked rooted tree, with a
set of nodes nodes(q) labeled with symbols from L, a node called the output node
out(q) (i.e., tree-patterns are unary queries), and two types of edges: child edges,
labeled by / and descendant edges, labeled by //. The root of q is denoted root(q).
    Due to space limitations, we will often write tree-patterns q in XPath notation [15],
and denote this notation with xpath(q). We use lbl(q) as short notation for lbl(out(q)).
We use the following graphical representation for tree-patterns: the main branch is the
vertical path starting from the root, the output node is the last node of this path, and
predicates are subtrees starting with side branches (see Figure 3).
Example 6. Consider the query qRBON in Figure 3 (left) (where RBON stands for
Rick’s bonuses) asking whether Rick has received a bonus from the project laptop and
a bonus from the project pda. In this representation, single lines denote child edges and
                                                          1            2
double lines descendant edges. The other two queries vRBON        and vRBON   in Figure 3
(center and right) ask whether Rick has received a bonus on either of these projects.
The output node in all the three queries is labeled with Rick.
     The semantics of tree-patterns is given using embeddings. An embedding e of a
TP query q into a document d is a function from nodes(q) to nodes(d) satisfying:
(i) e(root(q)) = root(d); (ii) for any n ∈ nodes(q), lbl(e(n)) = lbl(n); (iii) for
any /-edge (n1 , n2 ) in q, (e(n1 ), e(n2 )) is an edge in d; (iv) for any //-edge (n1 , n2 ) in
q, there is a path from e(n1 ) to e(n2 ) in d.
     The result q(d) of applying a tree-pattern q to a document d is the set:

                    q(d) := {e(out(q)) | e is an embedding of q into d}

Example 7. Continuing with the three queries in Figure 3, they all return {n8 } over the
document dPER , since Rick has got bonuses from both of the requested projects.
               Challenges for View-Based Query Answering over Probabilistic XML               5

Intersections of tree-patterns. We consider in this paper the extension TP∩ of TP with
respect to intersection, which denotes intersections of tree-pattern queries.

  TP∩ = {q1 ∩ · · · ∩ qk | k ∈ N, qi ∈ TP, and lbl(root(qi )) = lbl(root(qj )), and
                                  lbl(out(qi )) = lbl(out(qj )) for i, j ∈ {1, . . . , k}}.
                                                                                Tk
The result of a TP∩ query q1 ∩· · ·∩qk over a document d is the set of nodes       i=1 qi (d).

Query equivalence and containment. A pattern q1 is contained in a pattern q2 , denoted
q1 v q2 , if q1 (d) ⊆ q2 (d) for every d. Also q1 is equivalent to q2 , q1 ≡ q2 , if q1 v q2
and q2 v q1 . We discuss how to check containment of TP∩ queries in Section 5. For
TP queries, containment can be decided using containment mappings [16,17] which are
similar to embeddings. Intuitively, a containment mapping from q1 to q2 is a function
from nodes(q1 ) to nodes(q2 ) that respects the labels of nodes and maps any two
nodes connected with /-edges to nodes connected with /-edges, while nodes connected
with //-edges can be mapped to any connected nodes. Then for q1 and q2 in TP, q2 v q1
iff there is a containment mapping from q1 to q2 . Note that such a mapping can be
computed in polynomial time. For example, observe that qRBON is contained in both
  1            2
vRBON    and vRBON   , while none of the latter two is contained in each other.

Querying p-documents. Up to now, we have seen queries as functions over XML docu-
ments outputting sets of nodes. Over a p-document P,  b a query q (TP or TP∩ ) naturally
                                                                    b and p is the prob-
yields a set of node-probability pairs (n, p), where n is a node of P,
ability that q can be embedded into a random document P of P    b with some e such that
e(out(q)) = n; this value will also be written as Pr(n ∈ q(P)). Formally:
                                                         X
          b := {(n, p) | ∃ d ∈ JPK:
       q(P)                       b n ∈ q(d) and p =            b n∈q(d) Pr(d)}.
                                                             d∈JPK:


It is known [11] that TP queries can be evaluated over p-documents P b in time polyno-
         b                                                     ∩
mial in |P|, that is, in data-complexity. The same holds for TP queries.

Example 8. Evaluation of qRBON over P   bPER returns the node n8 iff the left child of the
node n11 and the right child of n20 are chosen. Hence, qRBON (PbPER ) = {(n8 , 0.75 ×
0.6)}. Evaluation of vRBON and vRBON over P
                      1            2          bPER returns {(n8 , 0.75 × (0.1 + 0.6))}
and {(n8 , 0.75 × (0.3 + 0.6)}, respectively.

Further notations. We introduce now some additional terminology for TP queries,
which will be used in Sections 4 and 5, and can be skipped until then.
    The main branch mb(q) of q is the path from root(q) to out(q), and the main
branch nodes of q, mbn(q), are the nodes of mb(q). A prefix qp of q is any tree-pattern
that can be build from q by setting root(q) as root(qp ), setting some node n ∈
mbn(q) as out(qp ), and removing from q all mbn(q) nodes below n along with their
descendants. A suffix qs of q is any subtree of q rooted at a node of mbn(q). The rank
of a main branch node is the distance from it to the root, i.e., the rank of root(q) is 1
and of out(q) is |mbn(q)|. The suffix of q rooted at the node of rank k is denoted q k .
For any rank k, cut(q, k) denotes the prefix of q with k main branch nodes.
6        Bogdan Cautis and Evgeny Kharlamov

3   Problem Definition
We assume an infinite set of view names V disjoint from the set of labels L. By a view
v we denote a tree-pattern query (that defines the view) together with its name v ∈ V.

Deterministic view-based rewriting. Let d be a document and v a view. A (determinis-
tic) view extension of v over d, denoted dv , is a document obtained by connecting to a
root node labeled doc(v) all the documents from the set {d0 | d0 ⊆ d and root(d0 ) ∈
v(d)}. Such a document can be queried by TP-queries of the form doc(v)/lbl(v)/ . . . .
If V is a set of views defined over a document d, then DVd = {dv | v ∈ V }. Let q be a
query in Q ∈ {TP, TP∩ } that may use doc(v)/lbl(v) for v ∈ V , then unfoldV (q) is the
query in Q obtained by replacing in q each doc(v)/lbl(v) with the definition of v.
                          1           2
Example 9. Two views vRBON      and vRBON  are in Figure 3. Their extensions over dPER
are, respectively, documents (dPER )vRBON
                                      1    and (dPER )vRBON
                                                        2     each with two nodes: the
                                       1
root labeled, respectively, with doc(vRBON  ) and doc(vRBON
                                                        2
                                                              ), and with one child n8
labeled Rick in both cases. Let v be vRBON
                                      2
                                            without the node labeled Rick and with the
output node person. Then (dPER )v has the root labeled doc(v) to which two subdocu-
ments of dPER are connected: a subdocument rooted at n2 and one at n3 .
    Let d be a document, q a TP-query, V a set of TP-views, and Q ∈ {TP, TP∩ }.
In the deterministic setting the problem of query answering using views is to find an
alternative query plan qr in Q, called a rewriting, that can be used to answer q. Formally,
a Q-rewriting qr of q using V is a query qr ∈ Q such that for every document d it holds
that qr (DVd ) = q(d). Clearly, this implies that unfoldV (qr ) ≡ q.
    The two alternatives, TP-rewritings and TP∩ -rewritings, are respectively motivated
by the two possible interpretations of XML query results. In an XML document, nodes
have unique Ids used by internal operators (selections, unions, joins, etc.) to manipulate
data during query evaluation. Queries can then either (i) output fresh Ids for the nodes of
the result, or (ii) expose (preserve) in the result the original Ids from the document. The
former case corresponds to what is called the copy semantics, under which the Ids of any
document in DVd are disjoint from those of d and from those of any other document in
DVd . Since one cannot know if nodes from results of different views are in fact copies of
the same node in d, the only possible rewritings are those that access a single document
from DVd and maybe navigate inside it. A rewriting qr ∈ TP will thus be of the form
doc(v)/lbl(v)[p1 ]/p2 or doc(v)/lbl(v)[p1 ]//p2 , where v ∈ V and the (possibly empty)
TP-queries p1 , p2 represent the compensation of v. In the latter case, every document
in DVd preserves the original Ids, which will identify nodes across different documents
in DVd . One can thus formulate and exploit more complex rewritings, as node Ids can
be used to intersect (join by Id) results of different views over the same input data d.
TP∩ -rewritings qr extend TP-rewritings in that they can access several DVd documents
at once, by first navigating in individual
                                       T       documents and then intersecting the result.
Thus the form of TP∩ -rewritings is i,j uij , where each uij is a TP-rewriting.

Probabilistic view-based rewriting. We generalize the definition of view extension to
                        bv is a p-document rooted at a node labeled doc(v) whose con-
the probabilistic case: P
tents is constructed as follows: (i) plug an unique ind-child below root(P bv ), (ii) for
               Challenges for View-Based Query Answering over Probabilistic XML             7

each pair (α, β) in the set {(Pb0 , p) | P
                                         b0 ⊆ P
                                              b and (root(P  b0 ), p) ∈ q(P)},
                                                                           b add α as
subtree of this ind-node with the probability β. A set of p-documents DVP for the set of
                                                                          b

views V and unfolding of a query over DVP is defined as in the deterministic case.
                                            b


Example 10. Continuing with Example 9, extensions (P     bPER )v1      (PbPER )v2     of
                                                                 RBON            RBON
                b
the views over PPER are p-documents with three nodes: the roots are labeled respec-
            1
tively doc(vRBON  ) and doc(vRBON
                              2
                                    ), with one child labeled ind, that in turns has one
child with the id n8 labeled Rick. The edge between the ind-node and its child is la-
beled 0.75 in both cases. The extension (PbPER )v has the root labeled v, with one child
labeled ind, and two p-subdocuments of P   bPER rooted under the ind node: one is the
p-subdocument rooted at n2 and the other one rooted at n3 . Probabilities on the edges
to n2 and n3 are 1.
     Query answering using views in the probabilistic setting is more involved than in
                                b is a set of node-probability pairs. Therefore, rewrite
the deterministic one, since q(P)
plans should deal with two sub-problems: (i) to find a query in terms of views, that
retrieves the nodes N of q(P)b (this corresponds to deterministic rewriting plans) and
(ii) to compute the probabilities for the nodes in N , using probabilities from DVP . Both
                                                                                  b

sub-problems require algorithms accessing p-documents DVP only. More formally:
                                                              b


Definition 11. Let q be a TP query, V be a set of TP views and Q ∈ {TP, TP∩ }. A
probabilistic Q-rewriting Qr = (qr , fr ) of q using V is a pair of
  (i) a deterministic Q-rewriting qr of q using V , and
 (ii) a probability function fr such that for every p-documents P   b and every node n of
      b
      P it holds that fr (n, DV ) = Pr(n ∈ q(P)).
                              P
                              b


When DVP is clear from the context we will use fr (n) as short notation for fr (n, DVP ).
         b                                                                              b

    For given q and V , a probabilistic rewriting problem is to find Qr . The main chal-
lenge in solving this problem is to construct a probability function fr that, by definition,
has access only to the p-documents in DVP . In Sections 4 and 5 we respectively show
                                           b

that this is not always possible for TP and TP∩ -rewritings.

4   TP-rewrite Plans
Here we discuss when probabilistic TP-rewrite plans do not exist and present cases for
which they do exist and can be computed in polynomial time.
    We first introduce some auxiliary notation. By dn we denote the subdocument of d
rooted at n. We denote the p-subdocument of P    b rooted at a node n as P
                                                                         bn . For TP queries
q1 and q2 , compensation of q1 with q2 , denoted comp(q1 , q2 ), is a TP-query obtained by
deleting the first symbol from xpath(q2 ) and concatenating the rest to xpath(q1 ). For
instance, the result of compensating q1 = a/b with q2 = b[c][d]/e is the concatenation
of a/b and [c][d]/e, i.e., comp(q1 , q2 ) = a/b[c][d]/e. Intuitively, view’s compensation
brings further navigation over the view’s result P   bv , and a rewrite plan will be of the
form qr = comp(doc(v)/lbl(v), q ) such that its unfolding comp(v, q k ) is equivalent
                                      k

                           bv , its unfolding is over P
to q. Note that qr is over P                          b while they yield the same result.
    We remind the reader the main result for deterministic compensations [18]:
8              Bogdan Cautis and Evgeny Kharlamov

�1 :
P          a                                  �2 : a
                                              P                                    �3 : a
                                                                                   P
      c          mux                             b       mux                              b1
               0.65                                     0.3                                         1
                      b                         mux           c                           b2
                                              0.5                                               mux
                 mux                                c                                     b3
                                                                                               p1       p2
                0.5                                                               2
                      c                                                                   b4    1       2

                                                                                          c
    Fig. 4. p-Documents to show non-existence of TP-rewritings (Examples 14 and 16)
Fact 12. Let q and V be TP-queries. Then there exists a deterministic TP-rewriting of
q over V if and only if there is v ∈ V and k ∈ N such that comp(v, q k ) ≡ q.

     This criterion can be verified in polynomial time [18]. Fact 12 says that, using just
one view v from V , we can find all the nodes n ∈ q(d) by querying dv with q k , i.e.,
the data in dv suffices to extract all such n. This naturally extends to the probabilistic
setting: we can find all the nodes n ∈ q(P)   b by querying P  bv with q k , i.e., the data in P
                                                                                               bv
it suffices to extract all ns. Note that n is in the query result q(P) iff Pr(n ∈ q(P)) > 0.

Proposition 13. Let q and v be TP-queries and k ∈ N. Let qr = comp(doc(v)/lbl(v), q k )
                                                                        b it holds
be a deterministic TP-rewriting of q using v. Then for every p-document P

                          Pr(n ∈ q(P)) > 0   if and only if       Pr(n ∈ qr (Pv )) > 0.

4.1       Nonexistence of TP-rewrite Plans

Is information in Pbv also sufficient to extract the probabilities Pr(n ∈ q(P)) for nodes
n ∈ q(P)? It turns out that the answer is negative. There are q and v for which a deter-
ministic rewriting qr exists but not the probabilistic one, i.e. the function fr such that
for every Pb it holds that fr (n) = Pr(n ∈ q(P)) does not exist. Thus the probabilistic
rewriting problem crucially different from the deterministic one. We now present two
example that will give insides on this phenomenon.

Example 14. Consider the query q = a/b[c] and the view v = a[.//c]/b. We now
show that there is no probabilistic rewriting (qr , fr ) for q over {v}. One can see that
comp(v, q 2 ) = a[.//c]/b[c] is equivalent to q, hence, qr = comp(doc(v)/lbl(v), q 2 ).
Consider now two p-documents P    b1 and Pb2 from Figure 4. Clearly, Pr(b ∈ q(P1 )) =
0.65 × 0.5 and Pr(b ∈ q(P2 )) = 0.5, and these probabilities are different. The func-
tion fr should compute the first probability 0.325 on a p-document (P     b1 )v and 0.5 on
  b
(P2 )v , hence fr should distinguish these p-documents. While one can see that these
p-documents are indistinguishable by v: (P b1 )v = (P b2 )v . Hence, fr does not exist.

     The problem raised in this example comes from the fact that in the unfolding of the
rewriting a[.//c]//b[c] the predicate [.//c] coming from the view (i.e. located above the
b-labeled node out(q)) and the predicate [c] coming from the compensation, (i.e. lo-
cated below out(q)) may interact. Where the interaction is of the following kind: there
is a document, e.g., d with the root a that has one child b, which in tern has one child c.
                Challenges for View-Based Query Answering over Probabilistic XML                  9

that satisfies a[.//c]//b[c] but both c-nodes of the query should be mapped to the same
c-node of d. In other words, the existence of a match for one predicate depends on the
(non-)existence of a match of the other. We now introduce a condition of probabilistic
independence that prevents such an interaction. This condition will be further used for
both TP and TP∩ -rewritings.
                                                                        b and n ∈ P:
    TP-queries q1 and q2 are independent, denoted q1 ⊥q2 , if for every P           b

      Pr(n ∈ (q1 ∩ q2 )(P)) = [Pr(n ∈ q1 (P)) × Pr(n ∈ q2 (P))] ÷ Pr(n ∈ P).

    Clearly, the view a[.//c]//b and the compensation b[c] from Example 14 are proba-
bilistically dependent. Deciding probabilistic dependency is tractable:

Proposition 15. For TP-queries the independence q1 ⊥q2 is decidable in polynomial time.

    Observe that probabilistic independence between a view and its compensation in a
deterministic rewriting does not guarantee existence of a probabilistic rewriting.

Example 16. Consider q = a//b[1]/b[2]/b//c and v = a//b[1]/b[2]/b. Clearly, q 2 is a
compensation for v and qr = comp(v, q 2 ) is a deterministic TP-rewriting for q and v.
We now present two p-documents P         b3 and Pb4 that show the non-existence of a probabil-
ity function fr , such that (qr , fr ) is a probabilistic TP-rewriting for q and v. Consider P b3
from Figure 4. where the upper index i on b indicates the i-th occurrence of a node la-
           b3 . Clearly, (P
beled b in P                b3 )v is a p-document with the root that has one ind child, under
which two p-documents: P       bb(3) with probability p2 and P   bb(4) with p1 are rooted. Con-
           b                              b
sider now P4 that is different from P3 in that it has an ind-node instead of the mux-node.
Clearly, (Pb3 )v = (P  b4 )v . Note that both P  bb(3) and Pbb(4) are deterministic documents,
               b
and in both (Pi )v there is no information on whether p1 and p2 are coming from the
same distributional node or not, and if they come from the same node, then there is no
information on what type of this node is. At the same time, Pr(c ∈ q(P          b3 )) = p1 + p2 ,
while Pr(c ∈ q(P   b4 )) = p1 + p2 − p1 × p2 . That is, the probabilities are different and the
result depends on the kind of the probabilistic relations in which p1 and p2 are involved,
i.e., on whether they are for associated with the children of a mux or ind distributional
node. If a probability computation function fr for qr exists then it should be able com-
pute the latter two different probabilities from the same views v(P       b3 ) and v(Pb4 ), which
is impossible since they have no information on the relationship between p1 and p2 .


4.2   Simple Queries and Restricted Compensations

We now provide a class of queries and a class of compensated queries for which decid-
ing existence of probability rewriting is based on probabilistic independence.
    A query q is simple if either its main branch has /-edges only, or all the nodes in
mbn(q) \ out(q) reachable from (the first occurrence of) a //-edge have no predicates.
Clearly, a query is not simple if it is of the form . . . // . . . [. . . ] . . . . The compensation
of a view v with c is restricted if either v is simple, or mb(c) has no //-edges.
    Let n0 be the highest ancestor-or-self node of n occurring in v(P)                b and let k =
|mb(v)|. We show that under some conditions on qr and v, the probability Pr(n ∈ q(P))
10         Bogdan Cautis and Evgeny Kharlamov


      INPUT : TP query q and views V
      OUTPUT: Set of TP-rewritings R
      R := ∅,
      Prefs := {(q1 , vi ) | vi ∈ V, q1 is a lossless prefix of q, mb(q1 ) ≡ mb(vi ), q1 v vi };
      for each (q1 , vi ) ∈ Prefs do
          k := |mb(q1 )| = |mb(vi )|
          if comp(vi , q k ) ≡ q and restricted then
              vi0 := comp(mb(vi ), vik )
                              // vi w/o predicates of nodes at rank 1, . . . , k − 1
              vi00 := comp(cut(vi , k − 1), mb(vi )k−1 )
                                            // vi w/o predicates of node at rank k
              q 0 := comp(cut(mb(q), k), q k )
                               // q w/o predicates of nodes at rank 1, . . . , k − 1
              if vi0 ⊥ vi00 and vi00 ⊥ q 0 then R := R ∪ {comp(doc(v)/lbl(v), q k )}

Algorithm 1: TPrewrite for finding TP-rewritings qr for which fr are as in Eq. 1


that a node n occurs in q’s result is the probability Pr(n ∈ qr (Pv )) that n can be
found by qr in qr (P bv ), divided by the probability Pr(n0 ∈ v k (Pvn0 )) that n0 verifies
the predicates of v found on its output node out(v). The following theorem is the
main result of this section, where for k = |mbn(v)| the query v 0 = comp(mb(v), v k )
is v without all predicates of (main branch) nodes of rank in {1, . . . , k − 1}, the query
v 00 = comp(cut(v, k − 1), mb(v)k−1 ) is v without all predicates of the node at rank k
(i.e., the output node), and q 0 = comp(cut(mb(q), k), q k ) is q without all predicates of
main branch nodes of rank in {1, . . . , k − 1}.
Theorem 17. Let v be a TP-view, q a TP-query, and k = |mb(v)|. Let qr be a restricted
TP-rewriting of q over v. If v 0 ⊥ v 00 and v 00 ⊥ q 0 hold, then for every n ∈ q(P) and its
highest ancestor-or-self node n0 ∈ v(P):
                                                                             0
                   Pr(n ∈ q(P)) = Pr(n ∈ qr (Pv )) ÷ Pr(n0 ∈ v k (Pvn )).                      (1)
    We summarize this section with a polynomial time algorithm TPrewrite (see Al-
gorithm 1), that takes an the input a TP query q and a set of views V and returns all
possible TP-rewritings qr , for which the probability functions fr are as in Equation 1.
In the algorithm we use so-called lossless prefixes: q1 is a lossless prefix of q, if q1 is a
tree-pattern obtained from q by setting out(q1 ) as some node of mbn(q).

5 TP∩ -rewrite Plans
First we discuss how to decide equivalence between TP and TP∩ queries and then pro-
vide a restriction on views for which probability functions fr exist and tractable.

5.1     Equivalence and containment for TP∩
Since Q in TP∩ is a rewriting for q in TP iff unfoldV (Q) ≡ q, deciding whether a TP
query q is equivalent to a TP∩ query Q is crucial in our setting. It is known [14] that one
               Challenges for View-Based Query Answering over Probabilistic XML        11

can rely on mappings to decide the equivalence: if q is first equivalently reformulated
into the union of TP queries ∪i qi , called its possible interleavings, and can be expo-
nentially large in |Q|. Interleavings capture all the possible ways to order or coalesce
the main branch nodes of queries participating in the intersection. Testing q ≡ Q is
coNP-hard and boils down to testing q ≡ ∪i qi , which in turn boils down to testing: if
for some j: q v qj , and if for all i: qi v q. We conclude:
Corollary 18. Deciding existence of a probabilistic TP∩ -rewriting for a TP query q
and TP views V views is coNP-hard.
    The equivalence problem was however shown in [14] to be in PTIME when q be-
longs to a restricted fragment of TP, called extended skeletons. As our focus in this
paper is on polynomial time algorithms for view-based rewriting, it is thus natural to
ask if view-based rewriting over probabilistic data remains tractable when input queries
are extended skeletons and expose node Ids, while views are general TP queries.

5.2   Computing Probability Function fr for Mutually Independent TP∩ Views
We start with the assumption that a deterministic rewriting qr has been found. Without
loss of generality, let us assume that qr consists only of intersected views (no compen-
sations of views). Let v1 , . . . , vk be the TP views of qr . Towards building the proba-
bility component of the rewrite plan, fr , we give some intuition first: for a given node
n ∈ q(P), since each view vi gives a probability, n ∈ vi (P), and since we are inter-
ested in the probability of the intersection thereof, we might be tempted to try what
is arguably the most intuitive definition for fr , the one which would simply combine
by multiplication the probabilities Pr(n ∈ vi (P)). But there are two issues with this
straightforward fr candidate.
      Regarding the first issue, a basic principle in probability theory is that the joint
probability of several events equals the product of the individual probabilities only when
these events are independent. For instance, for a given node n ∈ P, are Pr(n ∈ a[1](P)]
and Pr(n ∈ a[2](P)] independent? The answer is obviously negative, and we can read-
ily construct Pb instances over which both values are non-zero but the probability of
Pr(n ∈ a[1][2](P)]) = 0. We have introduced the notion of query independence in the
previous section, denoted vi ⊥vj , which guarantees that the existence of some embed-
ding of vi in a given document does not depend on the existence (or non-existence) of
some embedding of vj in this document. We will see now that for pairwise independent
views a function fr is based on multiplication of these views’ probabilities.
                                                                     b and, consequently,
      Regarding the second issue, for each node n that appears in q(P)
                      b              b
appears in each v1 (P), . . . , vk (P), we have k probability values Pr(n ∈ vi (P)). Fur-
thermore, each value Pr(n ∈ vi (P)) can be seen as the product of two distinct terms:
   (i) the probability of n appearing in a possible world of P, b denoted Pr(n ∈ P),
  (ii) the probability of n being selected by vi in a possible world in which n is known
       to appear, denoted in the following Pr(n ∈ vi (P) | n ∈ nodes(P)).
Note that the first term is independent of any particular view as it only depends on the
document itself We can thus write for each vi and each pair (n, pi ) ∈ vi (P) that

                 pi = Pr(n ∈ P) × Pr(n ∈ vi (P) | n ∈ nodes(P)).
12          Bogdan Cautis and Evgeny Kharlamov

Given a deterministic rewriting qr of q formed by pairwise independent views v1 , . . . , vk ,
for a node n ∈ q(P)), we would thus have as the overall product the following:
       Y                                    Y
          Pr(n ∈ vi (P)) = Pr(n ∈ P)k ×         Pr(n ∈ vi (P) | n ∈ nodes(P)). (2)
        i                                        i

Observe that in Equation 2 we account for the probability Pr(n ∈ P) too many times,
once for each view that participates in the rewrite plan. But we should instead account
for it only once. Hence, by dividing Equation 2 with Pr(n ∈ P)k−1 , we obtain fr :
                                     Y
              fr (n) := Pr(n ∈ P) ×      Pr(n ∈ vi (P) | n ∈ nodes(P)).             (3)
                                        i

     Each independent view vi gives us Pr(n ∈ vi (P) | n ∈ nodes(P)), while there is
now one missing ingredient in Equation 3: Pr(n ∈ P). We can compute this value only
if there is a view vi ∈ V subsuming mb(q), i.e., mb(q) v vi . Summing up we conclude:

Theorem 19. Let q be a TP-query, V a set of pairwise independent TP-views s.t. there
is v ∈ V satisfying mb(q) v v. Let qr be a TP∩ -rewriting of q over V . Then (qr , fr )
with the fr as in Equation 3 is a probabilistic TP∩ -rewriting of q over V .

    The next theorem shows that for //-free q and V 0 it is hard to decide the existence
of V ⊆ V 0 of pairwise independent views (by reduction from k-dimensional perfect
matching). Thus, deciding whether the result of Theorem 19 is applicable even for //-
free TP queries and views is intractable. This also implies that for extended skeletons it
is hard to find TP∩ -rewritings using the probability function as in Equation 3.

Theorem 20. Let q and views V be both of TP and without //-edges, deciding whether
a TP∩ -rewriting of q using only pairwise independent views from V exists is NP-hard.

6    Conclusion

This is the first study on answering queries using views over probabilistic XML. The
main challenge of this problem is to find probability-retrieving functions that can access
only view results, while able to compute the probabilities of XML answers. So far
we studied two cases of TP and TP∩ -rewritings where these functions exist under the
assumption of probabilistic independence for queries and views. In the TP case, our
setting allows for polynomial time query answering in the size of both data and query.
In the TP∩ case, our setting allows for polynomial time query answering in the size of
data, while it is intractable in the number of views. Recall that (direct) query answering
for probabilistic XML model considered here is also polynomial in data and intractable
in query complexity [11]. Moreover, query answering techniques of [11] are based on
an expensive dynamic programming approach. At the same time, in our TP∩ setting,
probability computation is done by means of the fr function (Equation 3) that requires
to compute probabilities that a node n occurs in p-documents. This computation is
both conceptually and computationally easier than dynamic programming. It requires
to collect the probabilities occurring on the way from the root of a p-document to the
                Challenges for View-Based Query Answering over Probabilistic XML               13

node n, and to multiply them. Hence, a practical implication of our study is that query
answering using TP∩ -views in our restricted setting should be more efficient than direct
query evaluation q(P).b As for the TP setting, reasoning over the views is tractable,
while view-based query answering may require navigation in a view result P  bv , which in
                                            b As for the future work, one extension is to
practice is often considerably smaller than P.
broaden the setting and to understand how one can cope with probabilistic dependences
in queries and views. Another extension concerns data: p-documents studied in this
paper have local probabilistic dependences, while there are models allowing for more
complex probabilistic interactions between remote fragments of data [19]. For these
types of XML data, query answering is intractable and we would like to see under
which conditions we can gain tractability by relying on views.
Acknowledgements. We are grateful to the anonymous reviewers for their comments.
The second author is supported by the ERC FP7 grant Webdam (agreement n. 226513).

References
 1. Kharlamov, E., Nutt, W., Senellart, P.: Value joins are expensive over (probabilistic) XML.
    In: Proc. LID, Uppsala, Sweden (2011)
 2. Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of Web information extrac-
    tion systems. IEEE TKDE 18(10) (2006)
 3. Rahm, E., Bernstein, P.: A survey of approaches to automatic schema matching. VLDBJ ’01
 4. Lafferty, J., McCallum, A., Pereira, F.: Conditional Random Fields: Probabilistic models for
    segmenting and labeling sequence data. In: Proc. ICML. (2001)
 5. Dong, X.L., Halevy, A.Y., Yu, C.: Data integration with uncertainty. VLDBJ 18(2) (2009)
 6. Dalvi, N., Ré, C., Suciu, D.: Probabilistic databases: Diamonds in the dirt. CACM ’09
 7. Widom, J.: Trio: A system for integrated management of data, accuracy, and lineage. In:
    Proc. CIDR, Online Proceedings (2005) 262–276
 8. Dalvi, N., Suciu, D.: The dichotomy of conjunctive queries on probabilistic structures. In:
    Proc. PODS. (2007)
 9. Koch, C.: MayBMS: A system for managing large uncertain and probabilistic databases. In
    Aggarwal, C., ed.: Managing and Mining Uncertain Data. Springer, New York, NY (2009)
10. Nierman, A., Jagadish, H.V.: ProTDB: Probabilistic data in XML. In: Proc. VLDB. (2002)
11. Kimelfeld, B., Kosharovsky, Y., Sagiv, Y.: Query evaluation over probabilistic XML. VLDBJ
    18(5) (2009) 1117–1140
12. Abiteboul, S., Kimelfeld, B., Sagiv, Y., Senellart, P.: On the expressiveness of probabilistic
    XML models. VLDBJ 18(5) (2009) 1041–1064
13. Abiteboul, S., Chan, T.H.H., Kharlamov, E., Nutt, W., Senellart, P.: Aggregate queries for
    discrete and continuous probabilistic XML. In: Proc. ICDT. (2010)
14. Cautis, B., Deutsch, A., Onose, N., Vassalos, V.: Querying XML data sources that export
    very large sets of views. TODS (2011)
15. Benedikt, M., Koch, C.: XPath leashed. ACM Comput. Surv. 41(1) (2008)
16. Amer-Yahia, S., Cho, S., Lakshmanan, L.V.S., Srivastava, D.: Tree pattern query minimiza-
    tion. VLDBJ 11(4) (2002) 315–331
17. Miklau, G., Suciu, D.: Containment and equivalence for a fragment of XPath. J. ACM ’04
18. Xu, W., Özsoyoglu, Z.: Rewriting XPath queries using materialized views. In: Proc. VLDB.
    (2005) 121–132
19. Senellart, P., Abiteboul, S.: On the complexity of managing probabilistic XML data. In:
    Proc. PODS. (2007) 283–292