Towards Deterministic Decomposable Circuits for Safe Queries Mikaël Monet1 and Dan Olteanu2 1 LTCI, Télécom ParisTech, Université Paris-Saclay, France, Inria Paris; Paris, France mikael.monet@telecom-paristech.fr 2 University of Oxford dan.olteanu@cs.ox.ac.uk Abstract. There exist two approaches for exact probabilistic inference of UCQs on tuple-independent databases. In the extensional approach, query evaluation is performed within a DBMS by exploiting the structure of the query. In the intensional approach, one first builds a representa- tion of the lineage of the query on the database, then computes the probability of the lineage. In this paper we propose a new technique to construct lineage representations as deterministic decomposable circuits in PTIME. The technique can apply to a class of UCQs that has been conjectured to separate the complexity of the two approaches. We test our technique experimentally, and show that it succeeds on all the queries of this class up to a certain size parameter, i.e., over 20 million queries. 1 Introduction Probabilistic databases [1] have been introduced in answer to the need to cap- ture data uncertainty and reasoning about it. This uncertainty can come from various angles: imperfect sensor precision of scientific data, imprecise automatic processes (e.g., natural language processing, rule mining in knowledge bases), untrusted data sources (e.g., web crawling), etc. In their simplest and most com- mon form, probabilistic databases consist of a relational database where each tuple is annotated with a probability value that is supposed to represent how confident we are about having this tuple in the database. While a traditional (deterministic) database can only satisfy or violate a Boolean query, a proba- bilistic database has a certain probability of satisfying it. Given a Boolean query Q the probabilistic query evaluation problem for Q (PQE(Q)) then asks for the probability that the query holds on an input probabilistic database. We measure the complexity of PQE(Q) as a function of the input database, hence consider- ing that the Boolean query Q is fixed. This is known as data complexity, and is motivated by the fact that the queries are usually much smaller than the data. Unfortunately, even for very simple queries, PQE(Q) can be intractable. When Q is a union of conjunctive queries (UCQ), a dichotomy result is provided by the work of Dalvi and Suciu [2]: either Q is safe and PQE(Q) is PTIME, or Q is not safe and PQE(Q) is #P-hard. The algorithm to compute the probability of a safe UCQ exploits the first order structure of the query to find a so called safe query plan (using extended relational operators that can manipulate prob- abilities) and can be implemented within a DBMS. This approach is referred to as extensional query evaluation, or lifted inference. A second approach to PQE is intensional query evaluation or grounded in- ference, and consists of two steps. First, compute a representation of the lineage of the query Q on the database D, which is a Boolean formula intuitively repre- senting which tuples of D suffice to satisfy Q. Second, perform weighted model counting on the lineage to obtain the probability. To ensure that model counting is tractable, we use the structure of the query to represent the lineage in tractable formalisms from the field of knowledge compilation, such as read once Boolean formulas, free or ordered binary decision diagrams (OBDDs, FBDDs), deter- ministic decomposable normal forms (d-DNNFs), decision decomposable normal forms (dec-DNNFs), deterministic decomposable circuits (d-Ds), etc. The main advantage of this approach compared to lifted inference is that the lineage can help explain the query answer. Moreover, having the lineage in a good knowledge compilation formalism can be useful for other applications: we could for instance change the tuples’ probabilities and compute the new result easily, or compute the most probable state of the database that satisfies the query. What we call the q9 conjecture, formulated by Dalvi, Jha, and Suciu [2, 3], states that for safe queries, extensional query evaluation is strictly more powerful than the knowledge compilation approach. Or in other words, that there exists a query which is safe (i.e., can be handled by the extensional approach) whose lineages on arbitrary databases cannot be computed in PTIME in a knowledge compilation formalism that allows tractable weighted model counting (i.e., can- not be handled by the intensional approach). Note that the conjecture depends on the tractable formalism that we consider. The conjecture has recently been shown by Beame, Li, Roy, and Suciu [4] to hold for the formalism of dec-DNNFs (including OBDDs and FBDDs), which captures the traces of modern model counting algorithms. Another independent result by Bova and Szeider [5] shows that the conjecture also holds when we consider the class of deterministic struc- tured negation normal forms (d-SDNNFs), which are d-DNNFs that follow the structure of a v-tree [6]. However the question is still open for more expressive formalisms, namely, d-DNNFs and d-Ds. Maybe the conjecture fails for such ex- pressive formalisms, i.e., maybe the reason why PQE is PTIME for safe queries is because we can build deterministic decomposable circuits in PTIME for them? In this paper we focus on a class of queries (the H-queries) that was conjec- tured in [2, 3] to separate the two approaches and that was used to prove the conjecture for dec-DNNFs [4] and d-SDNNFs [5]. Our first contribution is to develop a new technique to build d-DNNFs and d-Ds in polynomial time for the H-queries, based on what we call nice Boolean functions. Because we were not able to prove that this technique works for all the safe H-queries, our second contribution is to test this technique with the help of the SAT solver Glucose [7] on all the H queries up to a certain size parameter, that we generated automat- ically. We found no query on which it does not work. Interestingly, we found a few queries for which we can build d-Ds with a single internal negation at the very top, whereas we do not know if we can build d-DNNFs (could these queries separate UCQ(d-DNNF) and UCQ(d-D)?). We conjecture that this technique can build d-Ds for all safe H-queries. To do this analysis, we had to solve a task of independent interest, namely, computing explicitly the list of all inequivalent monotone Boolean functions on 7 variables. This task had previously been undertaken by Cazé, Humphries, and Gutkin [8] and by Stephen and Yusun [9]. We reused parts of the code from [8] and confirmed the number of such functions: 490, 013, 148. Paper structure We start our presentation with preliminaries in Section 2. We then define the H-queries in Section 3 and review what is known about them. In Section 4 we introduce our technique, and we experimentally demonstrate its effectiveness in Section 5. Our code and all the functions are available online [10]. 2 Preliminaries We will consider in this work the most commonly used model for probabilistic databases: the tuple-independent model, where each tuple is annotated with a probability of being present or absent, assuming independence across tuples: Definition 1. A tuple-independent (TID) database is a pair (D, π) consisting of a relational instance D and a function π mapping each tuple t ∈ D to a rational probability π(t) ∈ [0; 1]. A TID instance (D, π)Qdefines a probability dis- tribution Pr on D0 ⊆ D, where Pr(D0 ) := t∈D0 π(t)× t∈D\D0 (1−π(t)). Given Q a Boolean query Q, the probabilistic query evaluation problem for Q (PQE(Q)) asks, given as input a TID instance (D, π), the probability P that Q is satisfied in the distribution Pr. That is, formally, Pr(Q, (D, π)) := D0 ⊆D s.t. D0 |=Q Pr(D0 ). Dalvi and Suciu [2] have shown a dichotomy result on UCQs for PQE: either Q is safe and PQE(Q) is PTIME, or Q is not safe and PQE(Q) is #P-hard. Moreover they show that all the safe queries can be handled by the extensional approach, i.e., by using the structure of the query to compute the probability. Due to space constraints, we point to [2] for a presentation of their algorithm to compute the probability of a safe query, though it is not strictly necessary to understand the current paper. We denote by UCQ(P) the set of safe UCQs (hence which corresponds to the set of tractable UCQs if P 6= #P). By contrast, in the intentional approach, one first computes a representation of the lineage Lin(Q, D) of the query Q on the instance D: Definition 2. The lineage of a Boolean query Q over D is a Boolean formula Lin(Q, D) on the tuples of D mapping each Boolean valuation ν : D → {0, 1} to 1 or 0 depending on whether Dν satisfies Q or not, where Dν := {t ∈ D | ν(t) = 1}. The lineage can be represented with any formalism that represents Boolean functions (Boolean formulas, BDDs, Boolean circuits, etc), but the crucial idea is to use a formalism that allows tractable probability computation. In this work we will specifically focus on deterministic decomposable circuits (d-Ds) and deterministic decomposable normal forms [11] (d-DNNFs). Definition 3. Let C be a Boolean circuit (featuring and, or, not, and variable gates). An and-gate g of C is decomposable if for every two input gates g1 6= g2 of g we have Vars(g1 ) ∩ Vars(g2 ) = ∅, where Vars(g) denotes the set of variable gates that have a directed path to g in C. We call C decomposable if each and- gate is. An or-gate g of C is deterministic if there is no pair g1 6= g2 of input gates of g and valuation ν of the variables such that g1 and g2 both evaluate to 1 under ν. We call C deterministic if each or-gate is. A negation normal form (NNF) is a circuit in which the inputs of not-gates are always variable gates. Probability computation is in linear time for d-Ds (hence, for d-DNNFs): to compute the probability of a d-D, compute by a bottom-up pass the probability of each gate, where and gates are evaluated using ×, or gates using +, and not gates using 1−x. While there does not seem to be any interest in using d-DNNFs rather that d-Ds for probabilistic databases, we are also interested by d-DNNFs from a knowledge compilation point of view, as it is currently not known if d-Ds are strictly more succinct than d-DNNFs. We write UCQ(d-DNNF) (resp., UCQ(d-D)) to denote the set of UCQs Q such that for any database instance D, we can compute in polynomial time (in data complexity) a d-DNNF (resp., d-D) representation of Lin(Q, D). For a study of the intensional approach using weaker formalisms for Boolean functions (read once formulas, ordered and free binary decision diagrams), see [3]. Hence we have: UCQ(d-DNNF) ⊆ UCQ(d-D) ⊆ UCQ(P) (1) Dalvi, Jha, and Suciu [2, 3] conjectured that the inclusion UCQ(d-D) ⊆ UCQ(P) is strict, i.e., that the extensional approach is strictly more power- ful than the intensional approach, and proposed a candidate query to separate these classes (named q9 and that we define in the next section). The purpose of this paper is to study this conjecture. 3 The H-queries We define in this section the H-queries and review what is known about them. The building blocks of these queries are the queries hki , which were first defined in the work of Dalvi and Suciu to show the hardness of UCQs that are not safe: Definition 4. Let k ∈ N, k ≥ 1. The queries hki for 0 ≤ i ≤ k are defined by: – hk0 = ∃x∃y R(x) ∧ S1 (x, y); – hki = ∃x∃y Si (x, y) ∧ Si+1 (x, y) for 1 ≤ i < k; – hkk = ∃x∃y Sk (x, y) ∧ T (y). We define the H-queries to be combinations of queries hki , as in [4]: Definition 5. For k ≥ 1, we define the set of variables hki := {0, . . . , k}. Given a Boolean function φ on variables hki, we define the Boolean query Qφk to be the query represented by the first order formula φ[0 7→ hk0 , . . . , k 7→ hkk ], i.e., φ where we substituted each variable i ∈ hki by the formula hki . The query class Hk (resp., Hk+ ) is then the set of queries Qφk when φ ranges over all Boolean functions (resp., monotone Boolean functions) on variables hki. ∞ ∞ We finally define H (resp., H+ ) to be Hk+ ). Observe that the S S Hk (resp., k=1 k=1 queries in H+ are in particular UCQs. Example 1. Let k = 3, and φ9 be the monotone Boolean function (2 ∨ 3) ∧ (0 ∨ 3) ∧ (1 ∨ 3) ∧ (0 ∨ 1 ∨ 2). Then Qφ3 9 represents the query q9 = (h32 ∨ h33 ) ∧ (h30 ∨ h33 ) ∧ (h31 ∨ h33 ) ∧ (h30 ∨ h31 ∨ h32 ) ∈ H3+ , which is safe and was conjectured in [2, 3] not to be in UCQ(d-D). To study the H-queries, we need the following notions on Boolean functions: Definition 6. Let φ be a Boolean function on variables hki. We will always consider a valuation ν of hki simply as the set of variables that ν maps to 1. We write SAT(φ) the set of satisfying valuations of φ. We say that φ depends on variable l ∈ hki if there exists a valuation ν ⊆ hki such that φ(ν∪{l}) 6= φ(ν\{l}). We write DEP(φ) ⊆ hki for the set of variables on which φ depends. We call φ and Qφk nondegenerate if DEP(φ) = hki (and degenerate otherwise). Then, if φ is degenerate (i.e., does not depend on all its k + 1 variables), Qφk is safe and is in UCQ(d-DNNF) (in fact, even in UCQ(OBDD)): Proposition 1 (Theorem 3.12 of [4], or Lemma 3.8 of [12]). Let k ≥ 1, and Qφk ∈ Hk with DEP(φ) ( hki. Then Qφk ∈ UCQ(d-DNNF). This is in contrast to when φ is nondegenerate. Indeed, Beame, Li, Roy, and Suciu then show [4] that Qφk do not admit polynomial sized decision decompos- able NNFs (dec-DNNF). A dec-DNNF is a d-DNNF in which the determinism of or gates is restricted to simply choosing the value of a variable [13, 14]. That is, each or gate is of the form (v ∧ g) ∨ (¬v ∧ g 0 ) for some variable v. In fact, they show a lower bound for more general representations than dec-DNNFs, namely, for what they called Decomposable Logic Decision Diagrams (DLDDs), which generalise dec-DNNFs in that they allow negations at arbitrary places and also allow decomposable binary operator gates. When φ is monotone, another independent lower bound by Bova and Szeider [5] tells us than we cannot im- pose structuredness either (i.e., use d-SDNNFs) when φ is nondegenerate. These results mean that for such queries, one cannot restrict too much the expres- sivity of determinism. The question is then: do the nondegenerate queries have polynomial sized d-DNNFs (or d-Ds)? Let us first see what the dichotomy theorem tells us about H-queries that are nondegenerate. We shall restrict our attention to monotone functions now, i.e., to queries in H+ , because the dichotomy theorem applies only to UCQs, and Qφk is not a UCQ when φ is not monotone. We need to define the CNF lattice of φ: Definition 7. Let φ be a monotone Boolean function on variables hki such that DEP(φ) = hki, and let FCNF = C0 ∧ . . . ∧ Cn be the (unique) minimized CNF representing φ, where we see each clause S simply as the set of variables that it contains. For s ⊆ hni, we define ds := Ci . Note that d∅ is ∅, and that we can i∈s have ds = ds0 for s 6= s0 . The CNF lattice of φ is the lattice (L, ≤), where L is {ds | s ⊆ hni}, and where ≤ is reversed set inclusion. In particular, the top element 1̂ of LCNF is ∅, while its bottom element 0̂ is hki (because φ depends on all the variables, hence each variable is in at least one clause). Example 2. The Hasse diagram of the CNF lattice of φ9 is shown in Figure 1 (ignore for now the values at the right inside the nodes). ∅:1 {2, 3} : -1 {1, 3} : -1 {0, 3} : -1 {1, 2, 3} : 1 {0, 2, 3} : 1 {0, 1, 3} : 1 {0, 1, 2} : -1 {0, 1, 2, 3} : 0 Fig. 1. CNF lattice of φ9 with the values µ(n, 1̂) for each node n. The criterion of Dalvi and Suciu is based on the Mobius function [15] of the CNF lattice L of φ. The Mobius function µ : L × L → PZ on L is defined on pairs (u, v) with u ≤ v by µ(u, u) = 1 and µ(u, v) = − µ(w, v) for u < v. The u