Set-Valued Probabilistic Sentential Decision Diagrams Alessandro Antonucci and Alessandro Facchini Istituto Dalle Molle di Studi sull’Intelligenza Artificiale Manno-Lugano (Switzerland) {alessandro,alessandro.facchini}@idsia.ch Abstract. Probabilistic sentential decision diagrams are a class of arith- metic circuits locally annotated by probability mass functions. This al- lows for a compact representation of joint mass functions in the presence of logical constraints. Here we propose a set-valued generalisation of the probabilistic quantification in these models, that allows to replace the sharp specification of the local probabilities with linear constraints over them. This induces a (convex) set of joint probability mass functions, all consistent with the assigned logical constraints. These models should be adopted for a cautious learning of the local parameters if only small amount of data are available. Algorithmic strategies to compute the lower and upper bounds of marginal and conditional queries with respect to these sets of joint mass functions are sketched. The task can be achieved in linear time with respect to the diagram size for marginal queries and, if the diagram is singly connected, the same can be done for conditional queries. A simple example showing the need of coherently combining logic and prob- ability is considered in [7]. Assume we are given a report of the results of 100 students in four subjects, namely logic (L), knowledge representation (K), prob- ability (P ) and AI (A). From those data, depicted in Table 1, we thus want to learn a probabilistic model such as a Bayesian network (BN) [8]. The white nodes in Figure 1 show a BN model that assumes conditional independence between K and P given A and L. To learn BN parameters we use a Laplace prior of strength one. This assigns probability Pr(X = x) = [n(X = x) + |X|−1 ]/[n(·) + 1] to the state x of variable X, where |X| is the cardinality of X and n is a counting func- tion for the data set (e.g., Pr(L = 0) = 70.5/101). Out of sixteen possible com- binations, Table 1 reports no observations for eight cases (grey rows). Yet, seven of these eight cases (dark grey) are not observed because of logical constraints: L ∨ P (it is compulsory for a student to either take logic or probability), A → P (probability is prerequisite for AI), and K → A ∨ L (knowledge representation is a prerequisite to either AI or logic). The light grey row corresponds instead to a possible configuration for which we just do not have observations. Yet, the BN cannot distinguish between those two cases and non-zero probabilities can be therefore assigned to impossible events (e.g., Pr(A = 1, P = 0) ' 0.005). The logical constraints can then be embedded in the BN by adding auxiliary nodes (grey nodes in Figure 1). Each constraint induces a Boolean variable which is a child of all the variables involved in the logical constraint and whose conditional probability table has zero/one values modelling the fact that the variable is true if and only if the logical constraint is satisfied. The model is therefore queried given the fact that the auxiliary variables implementing the constraints are in their true state. Such a BN approach to the learning of probability mass func- tions that are subject to domain constraints might increase the treewidth of the original graphical model, thus potentially affecting the speed of the inferences (exact inference is exponential in this parameter). LKP A# 0 0 0 0 0 A→P P ∨L 0 0 0 1 0 0 0 1 0 6 P 0 1 0 0 0 0 1 0 1 0 0 1 1 0 0 A L 0 1 1 1 10 1 0 0 0 5 K 1 0 0 1 0 1 0 1 0 1 1 0 1 1 0 1 1 0 0 13 K →A∨L 1 1 0 1 0 1 1 1 0 8 Fig. 1. A Bayesian network over 1 1 1 1 3 four variables (white nodes). Grey nodes implement logical Table 1. Results of 100 students constraints. in four subjects. Probabilistic sentential decision diagrams (PSDDs for short) [7] offer an elegant alternative approach to the learning of probabilistic models from data subject to logical constraints. A PSDD is a circuit representation of a joint probability mass function assigning zero probability to the impossible states of the logical constraints. More specifically, it is a parametrised directed acyclic graph. Each non-root node is either a logical AND gate with two inputs, or a logical OR gate with an arbitrary number of inputs and whose incoming wires are annotated with the probabilities of a conditional mass function (that can be learned from the data). These types of nodes alternate. Each root node is a univariate mass function, assigning probability one to X, when X is always true, to ¬X when X is always false, or assigning θ to X (written as (θ : X)) when X is true with probability θ (conditional to a certain context encoded by the circuit). E.g., the circuit in Figure 2 with sharp numbers instead of intervals associated to the wires defines a PSDD consistent with the logical constraints in the example. Inference in PSDDs, intended here as the computation of the probability of marginal or conditional queries for single variables, takes time linear in the size of the graph. Dedicated algorithms have been developed to identify the smallest circuit modeling a given formula [2]. This makes inferences based on a PSDD typically faster than those based on a BN with auxiliary nodes implementing the same constraints [7]. Yet, PSDDs embed context-specific independence relations and the problem of trading off the search for a small circuit with the learning of proper dependence relations from the data is still relatively unexplored [9], while many techniques have been proposed for the structural learning of BNs [8]. From sharp to interval probabilities. Both BN and PSDD parameters are probabilities to be learned from the data set. These are sharp numerical val- ues corresponding to the mean of a posterior distribution or the maximum of a likelihood. Yet, standard inference algorithms adopt only the sharp estimators, while propagating the whole distribution might increase the inferential complex- ity (e.g., [6]). This might be an issue especially if few data are available [13] or if some data are missing according to an asystematic incompleteness process [4]. In these cases, coping with sets instead of single probability mass functions might offer a more cautious, and hence reliable, approach to probabilistic modelling.1 In particular the imprecise Dirichlet model (IDM) [13] replaces the above con- sidered estimate for P (X = x) with the interval [n(x), n(x) + |X|−1 ]/(n(·) + 1). Thus, for instance, P (L = 0) ∈ [70, 71]/101. IDM has been often applied to BN quantification from data (e.g., [1]). The approach transforms the BN into a credal network [3], and the local conditional probability mass functions, associ- ated to each variable and each configuration of the parent variables are replaced by sets of probability mass functions consistent with such interval constraints. The same can be done with PSDDs by replacing each probability mass function annotating the incoming wires of a OR gate or a root node with a IDM-based set of probability mass function.2 This might be valuable as the local parameters of a PSDD are conditional probabilities based on a context. Indeed, the higher is the number of variables in the context, the less data might be available to learn the probabilities, and therefore IDM appears as more reliable from a mod- elling perspective. As an example, Figure 2 depicts a PSDD with interval-valued weights obtained in this way. We call such a model credal PSDD (CPSDD). Ex- actly as a PSDD defines a joint probability mass function which assigns non-zero probability only to the joint states satisfying the logical constraints, a CPSDD defines a (credal) set of joint probability mass function assigning non-zero proba- 1 Other extensions of the classical probability theory have been proposed. Among others, we mention the evidence theory [11], the possibility theory [5], and fuzzy sets theory [15]. The set-valued approach we consider here is based instead on the imprecise probability theory [12] and can be regarded as a generalization of all the previous ones [14], with a direct sensitivity analysis interpretation. To the best of our knowledge, none of the other formalisms has been applied to PSDDs so far. 2 If the application of the IDM involves zero counts because of the logical constraints, the IDM interval associated to the impossible event is replaced by a sharp zero. bility to the non-impossible configurations. The model semantics is as expected, each element of this set of joint mass functions corresponds to a PSDD whose (sharp) parameters have values consistent with the intervals in the CPSDD. In- ferences in CPSDDs are thence intended as the computation of the lower and upper bounds of a query with respect to such a joint set of probability mass functions. Notably, it seems possible to apply the same strategy proposed by [10] for sum-product networks (SPNs) with set-valued specifications of the pa- rameters to PSDDs. Yet, compared to SPNs, PSDDs allows to cope with logical constraints. For joint or marginal queries, the inference corresponds to a sim- ple bottom-up propagation which can be easily achieved by interval arithmetic. E.g., the query Pr(A = 0, P = 0) corresponds to a product of IDM intervals 540 19 and has value [ 3131 , 101 ]. For conditional queries the situation might be more involved. This is because local (linear) optimisation should be solved during the intermediate steps of the propagation. By a derivation analogous to that in [10], it is possible to argue that for CPSDDs with singly connected diagrams, i.e., such that each node has exactly one child, inference can be computed in time linear with respect to the size of the circuit. ∨ [ 10 , 11 ] [ 30 , 31 ] [ 60 , 61 ] 101 101 101 101 101 101 ∧ ∧ ∧ ∨ ∨ ∨ ∨ ∨ ∨ 1 0 1 0 1 18 , 19 ] 0 [ 31 [ 12 , 13 ] 1 0 1 0 31 31 31 ∧ ∧ ∧ ∧ ∧ ∧ ∧ ∧ ∧ ∧ ∧ ∧ ¬L L P ¬P L ¬L ¬P P ¬L L P ¬P K ⊥ A ⊥ K : ⊥ ¬A A : ¬K ⊥ A : ⊥ [ 24 , 25 ] [ 3 , 4 ] [ 54 , 55 ] 31 31 13 13 61 61 Fig. 2. A credal PSDD. References 1. Alessandro Antonucci and Giorgio Corani. The multilabel naive credal classifier. International Journal of Approximate Reasoning, 83:320–336, 2016. 2. Arthur Choi and Adnan Darwiche. Dynamic minimization of sentential decision diagrams. In AAAI, 2013. 3. Fabio G. Cozman. Credal networks. Artificial Intelligence, 120:199–233, 2000. 4. Gert de Cooman and Marco Zaffalon. Updating beliefs with incomplete observa- tions. Artificial Intelligence, 159(1–2):75–125, 2004. 5. Didier Dubois and Henri Prade. Possibility theory. In Computational Complexity, pages 2240–2252. Springer, 2012. 6. Lance Kaplan and Magdalena Ivanovska. Efficient belief propagation in second- order Bayesian networks for singly-connected graphs. International Journal of Approximate Reasoning, 93:132–152, 2018. 7. Doga Kisa, Guy Van den Broeck, Arthur Choi, and Adnan Darwiche. Probabilistic sentential decision diagrams. In KR, 2014. 8. Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009. 9. Yitao Liang, Jessa Bekker, and Guy Van den Broeck. Learning the structure of probabilistic sentential decision diagrams. In UAI, 2017. 10. Denis Deratani Mauá, Diarmaid Conaty, Fabio Gagliardi Cozman, Katja Pop- penhaeger, and Cassio Polpo de Campos. Robustifying sum-product networks. International Journal of Approximate Reasoning, 2018. 11. Glenn Shafer. A Mathematical Theory of Evidence. Princeton University Press, 1976. 12. Peter Walley. Statistical Reasoning with Imprecise Probabilities. Chapman and Hall, 1991. 13. Peter Walley. Inferences from multinomial data: learning about a bag of marbles. Journal of the Royal Statistical Society. Series B (Methodological), pages 3–57, 1996. 14. Peter Walley. Towards a unified theory of imprecise probability. International Journal of Approximate Reasoning, 24:125–148, 2000. 15. Lotfi A. Zadeh. A computational approach to fuzzy quantifiers in natural lan- guages. Computers & Mathematics with applications, 9(1):149–184, 1983.