The Possibility Problem for Probabilistic XML Antoine Amarilli Télécom ParisTech; Institut Mines-Télécom; CNRS LTCI Abstract. We consider the possibility problem of determining if a document is a possible world of a probabilistic document, in the setting of probabilistic XML. This basic question is a special case of query answering or tree automata evaluation, but it has specific practical uses, such as checking whether an user- provided probabilistic document outcome is possible or sufficiently likely. In this paper, we study the complexity of the possibility problem for probabilistic XML models of varying expressiveness. We show that the decision problem is often tractable in the absence of long-distance dependencies, but that its computa- tion variant is intractable on unordered documents. We also introduce an explicit matches variant to generalize practical situations where node labels are unambigu- ous; this ensures tractability of the possibility problem, even under long-distance dependencies, provided event conjunctions are disallowed. Our results entirely classify the tractability boundary over all considered problem variants. 1 Introduction Probabilistic representations are a way to represent incomplete knowledge through a concise description of a large set of possible worlds annotated with their probability. Such models can then be used, e.g., to run a query efficiently over all possible worlds and determine the overall probability that the query holds. Probabilistic representations have been successfully used both for the relational model [11] and for XML documents [9]. Many problems, such as query answering [8], have been studied over such repre- sentations; however, to our knowledge, the possibility problem (P OSS) has not been specifically studied: given a probabilistic document D and a deterministic document W , decide if W is a possible world of D, and optionally compute its probability according to D. This can be asked both of relational and XML probabilistic representations, but we focus on XML documents1 because they pose many challenges: they are hierarchical so some probabilistic choices appear dependent2 ; documents may be ordered; bag semantics must be used to count multiple sibling nodes with the same label. In addition, in the XML setting, the P OSS problem is a natural question that arises in practical scenarios. As a first example, when using probabilistic XML to represent a set D of possible versions [4] of an XML document, one may want to determine if a version W , obtained from a user or from an external source, is one of the known possible versions represented as a probabilistic XML document D. For instance, assume that a probabilistic XML version control system asks a user to resolve a conflict [3], whose uncertain set of possible outcomes is represented by D. When the user provides a candidate merge W , the system must then check if the document W is indeed a possible way to solve the conflict. This 1 Translations between the probabilistic relational and XML models [2] can be used to translate our results to complexity bounds for the P OSS problem on probabilistic relational databases. 2 In fact, we will see that our hardness results always hold even for shallow documents. may be hard to determine, because D may, in general, have many ways to generate W , through a possibly intractable number of different valuations of its uncertainty events. As a second practical example, assume that a user is editing a probabilistic XML document D. The user notices that choosing a certain valuation of the probabilistic events yields a certain deterministic document W , and asks whether the same document could have been obtained by making different choices. Indeed, maybe W is considered improbable under D following this particular valuation, but is likely overall because the same document can be obtained through different choices. What is the probability, over all valuations, of the user’s chosen outcome W according to D? On the face of it, P OSS seems related to query evaluation: we wish to evaluate on D a query qW which is, informally, “is the input document exactly W ”? However, there are three reasons why query evaluation cannot give good complexity bounds for P OSS. First, because qW depends on the possibly large W , we are not performing query answering for a fixed query, so we can only use the unfavorable combined complexity bounds where both the input document D and the query qW are part of the input. Second, because we want to obtain exactly W , the match of qW should never map two query variables to the same node of D, so the query language must allow inequalities on node identifiers. Third, once again because we require an exact match, we need to assert the absence of the nodes which are not in W , so we need negation in the language. To our knowledge, then, the only upper bound for P OSS from query answering is the combined complexity bound for the (expressive) monadic second-order logic over trees whose evaluation on deterministic (not even probabilistic) XML trees is already PSPACE-hard [10]. A second related approach is that of tree automata on probabilistic XML documents. Indeed, we can encode the possible world W to a deterministic tree automaton AW and compute the probability that AW accepts the probabilistic document D. The decision and computation variants of P OSS under local uncertainty models are thus special cases of the “relevancy” and “p-acceptance” problems of [6]. However, their work only considers ordered trees, and an unordered W cannot easily be translated to their deterministic tree automata, because of possible label ambiguity: we cannot impose an arbitrary order on D and W , as this also chooses how nodes must be disambiguated. In fact, we will show that P OSS is hard in some settings that are tractable for ordered documents. This paper specifically focuses on the P OSS problem to study the precise complexity of its different formulations. Our probabilistic XML representation is the PrXML model of [9], noting that some results are known for the P OSS problem (called the “membership problem”) in the incomparable and substantially different “open-world” incomplete XML model of [5] (whose documents have an infinite set of possible worlds, instead of a possibly exponential but finite set as in PrXML). We start by defining the required preliminaries in Section 2 and the different variants of P OSS in Section 3, establishing its overall NP-completeness and reviewing the results of [6]. We then study local uncertainty models in Section 4 and show that the absence of order impacts tractability, with a different picture for the decision and computation variants of P OSS. Last, in Section 5, we show that P OSS can be made tractable under long- distance event correlations, by disallowing event conjunctions and imposing an “explicit matches” condition which generalizes, e.g., unique node labels. We then conclude in Section 6. For lack of space, all proofs are deferred to the extended version [1]. 2 Preliminaries We start by formally defining XML documents and probability distributions over them: Definition 1. An unordered XML document is an unordered tree whose nodes carry a label from a set Λ of labels. Ordered XML documents are defined in the same way but with ordered trees, that is, there is a total order over the children of every node. A probability distribution is a function P mapping every XML document x from a finite set supp(P) to a rational number P(x), its probability according to P, with the condition that ∑D∈supp(P ) P(D) = 1. For any x ∈ / supp(P) we write P(x) = 0. As it is unwieldy to manipulate explicit probability distributions over large sets of documents, we use the language of probabilistic XML [9] to write extended XML documents (with so-called probabilistic nodes) and give them a semantics which is a (possibly exponentially larger) probability distribution over XML documents. Definition 2. A PrXML probabilistic XML document D is an XML document over Λ t {det, ind, mux, cie, fie}, where every edge from a mux or ind node to a child node is labeled with some rational number3 0 < x < 1 (the sum of the labels of the children of every mux node being ≤ 1), and every edge from a cie (resp. fie) node to a child node is labeled with a conjunction (resp. a Boolean formula) of events from a set E of events (and their negations), with a mapping π : E → [0, 1] attributing a rational probability to every event. The nodes with labels from Λ are called regular nodes, by opposition to probabilistic nodes. We assume that the root of D is a regular node. For any subset L ⊆ {det, ind, mux, cie, fie}, we call PrXMLL the language of proba- bilistic XML documents containing only nodes with labels in Λ t L. The semantics of a PrXML document D is the probability distribution over XML documents defined by the following sampling process (see [9] for more details): Definition 3. A deterministic XML document W is obtained from a PrXML document D as follows. First, choose a valuation ν : E → {t, f} of the events from E, with probability ∏e s.t.ν(e)=t π(e) × ∏e s.t.ν(e)=f (1 − π(e)). Evaluate cie and fie nodes by keeping only the child edges whose Boolean formula is true under ν. Evaluate ind nodes by choosing to keep or delete every child edge according to the probability indicated on its edge label. Evaluate mux nodes by removing all of their children edges, except one chosen according to its probability (possibly keep none if the probabilities sum up to less than 1). Finally, evaluate det nodes by replacing them by the collection of their children. All probabilistic choices are performed independently, so the overall probability of an outcome is the product of the probabilities at each step. Whenever an edge is removed, all of the descendant nodes and edges are removed. The probability of a document W according to D, written D(W ), is the total probability of all outcomes4 leading to W . Of course, the expressiveness and compactness of PrXML frameworks depend on which probabilistic nodes are allowed: we say that PrXMLC is more general than PrXMLD if there is a polynomial time algorithm to rewrite any PrXMLD document to a PrXMLC document representing the same probability distribution. Fig. 1 (adapted from [7]) represents this hierarchy on the PrXML classes that we consider. 3 The non-standard constraint x < 1 means that ind does not subsume det (see Thm. 3 and 4). 4 Note that in general there may be multiple outcomes that lead to the same document W . Problem Complexity > fie P OSS > fie NP (Prop. 1) < 6< cie #P OSS > fie FP#P (Prop. 1) #P OSS < mux, ind, det PTIME (Thm. 1) #P OSS 6< ind or mux #P-hard (Thm. 2) ⊥ mux, ind, det mie P OSS > ind or mux PTIME (Thm. 3) #P OSS ind, det mux, ind mux, det P OSS 6< 2 of mux, ind, det NP-hard (Thm. 4) #EP OSS > mux, ind, det PTIME (Thm. 5) P OSS #EP OSS mux ind EP OSS ⊥ cie NP-hard (Thm. 6) P OSS ⊥ mie NP-hard (Thm. 7) EP OSS 0/ #EP OSS > mie PTIME (Thm. 8) Table 1: Summary of results Fig. 1: Variants and PTIME reductions 3 Problem and general bounds We now define the P OSS problem formally, in its decision and computation variants. Definition 4. Given a class PrXMLC , the possibility problem for unordered documents P OSSC6< is to determine, given as input an unordered PrXMLC document D and an unordered XML document W , whether W is a possible world of D, namely, D(W ) > 0. The possibility problem for ordered documents P OSSC< is the same problem except that both D and W are ordered. For o ∈ {6<, <}, the #P OSSCo problem is the counting variant of P OSSCo , whose output is D(W ). For brevity, we write P OSSC⊥ and P OSSC> when describing lower or upper complexity bounds that apply to both P OSSC< and P OSSC6< . We start by giving straightforward bounds on the most general problem variants: Prop. 1. P OSSfie fie > is in NP and #P OSS > is in FP . #P cie Prop. 2. P OSS⊥ is NP-complete, even when D has height 3. Local models on ordered documents are known to be tractable using tree automata: mux,ind,det Theorem 1 ([6]). #P OSS< can be solved in polynomial time. 4 Local models We now complete the picture for the local model PrXMLmux,ind,det on unordered docu- ments. The results of [6] cannot be applied to this setting, as the ambiguity of node labels imply that we cannot impose an arbitrary order on document nodes; indeed, a reduction from perfect matching counting on bipartite graphs shows that the computation variant is hard even on the most inexpressive classes: Theorem 2. #P OSSind mux 6< and #P OSS 6< are #P-hard, even when D has height 4. By contrast, the decision variant is tractable for PrXMLind and PrXMLmux , using a dynamic algorithm. However, allowing both ind and mux, or allowing det nodes, leads to intractability (by reductions from set cover and Boolean satisfiability). Theorem 3. P OSSind mux > and P OSS > can be decided in PTIME. ,det ,det ,ind Theorem 4. P OSSind 6< , P OSSmux 6< and P OSSmux 6< are NP-complete, even when D has height 4. 5 Explicit matches We now attempt to understand how the overall hardness of P OSS is caused by the difficulty of finding how the possible world W can be matched to D. Definition 5. A candidate match of W in D is an injective mapping f from the nodes of W to the regular nodes of D such that, if r is the root of W then f (r) is the root of D, and if n is a child of n0 in W then there is a descending path from f (n) to f (n0 ) going only through probabilistic nodes. Intuitively, candidate matches are possible ways to generate W from D, ignoring probabilistic annotations, assuming we can keep exactly the regular nodes of D that are in the image of f . There are exponentially many candidate matches in general, so it is natural to ask whether P OSS is tractable if all matches are explicitly provided as input: Definition 6. Given a class PrXMLC and o ∈ {⊥, 6<, <, >}, the P OSS problem with explicit matches EP OSSCo is the same as the P OSSCo problem except that the set of the candidate matches of W in D is provided as input (in addition to D and W ). The explicit matches variant generalizes many practical cases where all possible matches can be computed in polynomial time; for instance, when node labels are assumed to be unique, or unambiguous in the sense that no two sibling nodes carry the same label. We first note that explicit matches ensure tractability of all local dependency models, by reduction to deterministic tree automata [6], this time also for unordered documents. Intuitively, we can consider all candidate matches separately and compute the probability of each one, in which case no label ambiguity remains so any order can be imposed: ,ind,det Theorem 5. #EP OSSmux > can be solved in polynomial time. For long-distance dependencies, however, it is easily seen that P OSS is still hard with conjunction of events, even if explicit matches are provided: Theorem 6. EP OSScie ⊥ is NP-complete, even when D has height 3. This being said, it turns out that the hardness is really caused by event conjunctions. To see this, we introduce the PrXMLmie class, which allows only individual events: Definition 7. The PrXMLmie class features multivalued independent events taking their values from a finite set V (beyond t and f, with probabilities summing to 1), and proba- bilistic mie nodes whose child edges are annotated by a single event e and a value x ∈ V . A mie node cannot be the child of a mie node. When evaluating D under a valuation ν, child edges of mie nodes labeled (e, x) should be kept if and only if ν(e) = x. Note that mie hierarchies are forbidden (because they can straightforwardly encode conjunctions), so that PrXMLmie does not capture ind hierarchies. However, as we introduced it with multivalued (not just Boolean) events, it captures PrXMLmux : Prop. 3. We can rewrite PrXMLmux to PrXMLmie and PrXMLmie to PrXMLcie in PTIME. In the PrXMLmie class, the P OSS problem is still NP-hard, by reduction to exact cover; however, with explicit matches, the #P OSS problem is tractable, both in the ordered and unordered setting, despite the long-distance dependencies. Intuitively, the candidate matches are mutually exclusive, and each match’s probability can be computed as that of a conjunction of equalities and inequalities on the events at the frontier. Theorem 7. P OSSmie ⊥ is NP-complete, even when D has height 3 and events are Boolean. Theorem 8. #EP OSSmie > can be solved in polynomial time. 6 Conclusion We have characterized the complexity of the counting and decision variants of P OSS for unordered or ordered XML documents, and various PrXML classes. With explicit matches, #P OSS is tractable unless event conjunctions are allowed. Without explicit matches, P OSS is hard unless dependencies are local; in this case, if the documents are ordered, #P OSS is tractable, otherwise #P OSS is hard and P OSS is tractable only with ind or mux nodes (and hard if both types, or det nodes, are allowed). Our results are summarized in Table 1 on page 4. Further work could study more precisely the effect of det nodes and ind hierarchies, for instance by attempting to extend the PrXMLmie class to capture them, or try to understand whether there is a connection between the algorithms of [6] and the proof of Thm. 3. It would also be interesting to determine under which conditions (beyond unique labels) can candidate matches be enumerated in polynomial time, so that the P OSS problem reduces to the explicit matches variant. Last but not least, another natural problem setting is to allow the order on sibling nodes of D to be partly specified. This question is already covered in [6], but only when all of the possible orderings are explicitly enumerated: investigating the tractability of P OSS for more compact representations, such as partial orders, is an intriguing problem. Acknowledgements. The author thanks Pierre Senellart for careful proofreading, useful suggestions, and insightful feedback, the anonymous referees for their valuable com- ments, and M. Lamine Ba and Tang Ruiming for helpful early discussion. This work has been partly funded by the French government under the X-Data project. References 1. A. Amarilli. The possibility problem for probabilistic XML (extended version). CoRR, 2014. http://arxiv.org/abs/1404.3131. 2. A. Amarilli and P. Senellart. On the connections between relational and XML probabilistic data models. In Proc. BNCOD, pages 121–134, Oxford, United Kingdom, 2013. 3. M. L. Ba, T. Abdessalem, and P. Senellart. Merging uncertain multi-version XML documents. Proc. DChanges, 2013. 4. M. L. Ba, T. Abdessalem, and P. Senellart. Uncertain version control in open collaborative editing of tree-structured documents. In Proc. DocEng, pages 27–36, 2013. 5. P. Barceló, L. Libkin, A. Poggi, and C. Sirangelo. XML with incomplete information. JACM, 58(1):4, 2010. 6. S. Cohen, B. Kimelfeld, and Y. Sagiv. Running tree automata on probabilistic XML. In Proc. PODS, pages 227–236. ACM, 2009. 7. E. Kharlamov, W. Nutt, and P. Senellart. Updating probabilistic XML. In Proc. Updates in XML, Lausanne, Switzerland, 2010. 8. B. Kimelfeld, Y. Kosharovsky, and Y. Sagiv. Query evaluation over probabilistic XML. VLDB Journal, 18(5):1117–1140, 2009. 9. B. Kimelfeld and P. Senellart. Probabilistic XML: Models and complexity. In Z. Ma and L. Yan, editors, Advances in Probabilistic Databases for Uncertain Information Management, pages 39–66. Springer-Verlag, 2013. 10. L. Libkin. Elements of Finite Model Theory. Springer, 2004. 11. D. Suciu, D. Olteanu, C. Ré, and C. Koch. Probabilistic Databases. Morgan & Claypool, 2011.