Probabilistic Abductive Logic Programming using Dirichlet Priors Calin Rares Turliuc1 , Luke Dickens2 , Alessandra Russo1 , and Krysia Broda1 1 Department of Computing, Imperial College London, United Kingdom 2 Department of Information Studies, University College London {ct1810, a.russo, k.broda}@imperial.ac.uk, l.dickens@ucl.ac.uk Abstract. Probabilistic logic programming has traditionally focused on languages where probabilities or weights are specified or inferred directly, rather than through Bayesian priors. To address this limitation, we pro- pose a probabilistic logic programming language that bridges the gap between logical and probabilistic inference in categorical models with Dirichlet priors. The language is described in terms of its general plate model, syntax, semantics and the relation between the three. A prototype implementation is evaluated on two case studies: latent Dirichlet alloca- tion (LDA) on synthetic data, where we compare it with collapsed Gibbs sampling, and repeated insertion model (RIM) on real data. Universal probabilistic programming is not always scalable beyond toy examples on some models. However, our promising results show that the infer- ence yields similar results to state-of-the-art solutions reported in the literature, produced with model-specific implementations. Keywords: probabilistic programming, Bayesian inference, abductive logic pro- gramming, latent Dirichlet allocation, repeated insertion model 1 Introduction Probabilistic programming is an area of research that aims to generalize infer- ence in probabilistic models specified as inputs to a programming language. In this context, evaluating the program corresponds to prediction with or inference on the described model. A wide range of probabilistic programming languages (PPLs) have been developed, based on different programming languages and ex- pressing a variety of classes of probabilistic models. Examples of PPLs include Church [8], Anglican [18], BUGS [15], Stan [22] and Figaro [19].1 While some PPLs, such as Church, typically enrich a functional programming language with exchangeable random primitives, there also exist logic based PPLs that add probabilistic annotations or primitives to a logical encoding of the model. This encoding usually relates either to first-order logic, e.g. Alchemy[6], BLOG [17] or to logic programming PPLs, e.g. PRiSM [21], ProbLog [7]. 1 For a more comprehensive list cf. http://probabilistic-programming.org/. 85 Probabilistic Abductive Logic Programming using Dirichlet Priors Typical PPLs based on functional programming can express a wide range of probabilistic models, and inference is based on general sampling algorithms. Existing logic based PPLs mostly focus on discrete probabilistic models, and, generally, they do not consider Bayesian inference with prior distributions. For instance, Alchemy is a PPL which implements Markov logic, encoding a first or- der knowledge base into a Markov random field. Here, uncertainty is expressed by weights on the logical formulae and one cannot specify prior distributions on the weights. ProbLog is a PPL that primarily targets the inference of conditional probabilities and the most probable explanation (maximum likelihood solution) and it does not feature the specification of prior distributions. PRiSM is a PPL which introduces conjugate Dirichlet priors over categorical distributions; how- ever, it is limited to probabilistic models described at the abductive level by non-overlapping explanations, such as hidden Markov models and probabilistic context-free grammars. These observations motivate our present paper: we develop a logic program- ming based PPL specialized on probabilistic models involving categorical vari- ables with conjugate Dirichlet priors that can be encoded as abductive logic programs with overlapping explanations. The programs evaluated by our PPL are abductive logic programs [11] enriched with probabilistic definitions and in- ference queries. We consider as case studies the latent Dirichlet allocation (LDA, [2]) and the repeated insertion model (RIM, [5]). The contributions of this paper are: – the design of peircebayes, a logic programming based PPL for inference in discrete models with categorical variables and Dirichlet priors. – the description of the class of probabilistic models that can be expressed in the PPL, and their relation to the language. – a prototype implementation of the language. For probabilistic inference, we adapt the Gibbs sampling algorithm described in [10]. – the formulation of RIM [5] as a probabilistic program. – the evaluation of the PPL on an LDA task with synthetic data and on a RIM task with real data. The rest of the paper is organized as follows. In Section 2 we describe the class of probabilistic models supported by our PPL. Section 3 explains the key features of the syntax and semantics of the PPL. We present the results of two experiments with our PPL in Section 4. Finally, in Section 5 we relate our PPL to other PPLs and methods, and we conclude. 2 The Probabilistic Model This section introduces peircebayes2 , referred to in the rest of the paper as PB, a probabilistic logic programming language designed for inference in a subclass of the class of models called “propositional logic-based probabilistic models”, described in [10]. 2 Named, in Church style, after Charles Sanders Peirce, the father of logical abduction, and Thomas Bayes, the father of Bayesian reasoning. Pronounced [’p3rs’beIz]. 86 Probabilistic Abductive Logic Programming using Dirichlet Priors αa θai xnai vnaij fn ka − 1 Ia A N Fig. 1. The PB plate model. Unbounded nodes are constants, circle nodes are latent variables, shaded nodes are observed variables, diamond nodes are deterministic vari- ables. A, Ia , N , and ka are positive integers, ka ≥ 2. The general plate notation [4] of the models expressible in PB is given in Figure 1. The plate model encodes the following (joint) probability distribution: P (f, v, x, θ; α) = Ia A Y N !! N Y Y Y P (θai ; αa ) P (xnai |θai )P (vnai∗ |xnai ) P (fn |vn∗ ) (1) a=1 i=1 n=1 n=1 We use ∗ to denote the set of variables obtained by iterating over the missing indexes, e.g. vnai∗ is the set of all the variables vnaij , for j = 1, . . . , ka − 1, and vn∗ is the set of all the variables vnaij , for a = 1, . . . , A, i = 1, . . . , Ia , j = 1, . . . , ka − 1. Unindexed variables are implicitly such sets, e.g. x = x∗ . In the model, each αa , for a = 1, . . . , A, is a vector of finite length ka ≥ 2 of positive real numbers. Each αa may have a different length, and it represents the parameters of a Dirichlet distribution. From each such distribution Ia samples are drawn, i.e.: θai ∼ Dirichlet(αa ) a = 1, . . . , A , i = 1, . . . , Ia The samples θ are parameters of categorical distributions. Sampling N times from the latter yields: xnai ∼ Categorical(θai ) a = 1, . . . , A , i = 1, . . . , Ia , n = 1, . . . , N Each xnai ∈ {1, . . . , ka } is encoded, similarly to [20], as a set of propositional variables vnaij ∈ {0, 1}, for j = 1, . . . , ka − 1, in the following manner:  vnai1 . . . vnail−1 vnail , if l < ka P (vnai∗ |xnai = l) = vnai1 . . . vnail−1 , if l = ka where v denote boolean negation. Finally, the observed variables of the model, fn ∈ {0, 1}, represent the output of boolean functions of v, such that: 87 Probabilistic Abductive Logic Programming using Dirichlet Priors P (fn |vn∗ ) = [fn = Booln (vn∗ )] n = 1, . . . , N Booln (v) denotes an arbitrary boolean function of variables v, and [i = j] is the Kronecker delta function δij . The observed value for each fn is 1 (or true) as we will explain in the following paragraph. Inference in PB can be described in direct relation to a general schema of probabilistic inference, i.e. the characterization of P (θ|∆; α), where θ are param- eters of interest, α are constants (hyper-parameters) and ∆ is some observed data. In PB, the parameters and the hyper-parameters correspond to θ and α, respectively. The observed data is captured by f and is assumed to be a set of N data points or observations. By convention, the realization fn = 1 ensures that the n-th observation is included in the model, and, as such, we assume this is always the case. Furthermore, fn is independent of the other observations given x (since x determines v), as implied by the joint distribution in Equation 1. The various ways in which a data point can be generated, as well as the distributions involved in this process, are encoded through the boolean function Booln (vn∗ ) corresponding to the n-th data point. It is important to note that the data ∆ can take any finite number of values, and Booln (vn∗ ) encodes the process of generating a single realization thereof. Example. We illustrate the encoding of a popular probabilistic model for topic modelling, the latent Dirichlet allocation (LDA) [2] as a PB model. This will also serve as a running example throughout Section 3. LDA can be summarized as follows: given a corpus of D documents, each document is a list of tokens, the set of all tokens in the corpus is the vocabulary, with size V and assume there exist T topics. There are two sets of categorical distributions: D distributions over T categories, each distribution indexed µd , and T distributions over V categories, each distribution indexed φt . The words of a document d are produced independently by sampling a topic t from µd , then sampling a word from φt . Furthermore, each distribution in µ is sampled using the same Dirichlet prior with parameters γ, and, similarly, each distribution in φ is sampled using β. Note that µ and φ correspond to the parameters θ in the general model, and γ and β correspond to α. Assume that there is a corpus with 3 documents, 2 topics and a vocabulary of 4 words. The plate notation of the PB model of LDA is given in Figure 2. Let the first data point be the observation of the second word of the vocab- ulary in document 3. Then the associated boolean function is: Bool1 (v1∗ ) = v15 v111 v112 + v15 v121 v122 The literals v15 and v15 denote the choice, in document 3, of topic 1 and 2, respectively, and the conjunctions v111 v112 and v121 v122 denote the choice of the second word from topic 1 and 2, respectively. Note that, in Figure 2, even though all possible edges between deterministic nodes and fn are drawn, not all the variables must affect the probability of fn , for instance the value of Bool1 (v1∗ ) doesn’t depend on the value of v13 . 88 Probabilistic Abductive Logic Programming using Dirichlet Priors β γ φ1 φ2 µ3 µ4 µ5 yn1 yn2 zn3 zn4 zn5 vn11 vn12 vn13 vn21 vn22 vn23 vn3 vn4 vn5 fn N Fig. 2. The PB model for the LDA example. 3 Syntax and Semantics Having established the semantics of the PB model, we proceed to describe the syntax and semantics of PB programs, and show how they relate to the proba- bilistic model. A PB program is an abductive logic program [11] enhanced with probabilis- tic predicates. The abductive logic program encodes the generative story of the model, as well as the observed data. The most important probabilistic predicates are pb dirichlet and pb plate. The former provides a way to declare the prob- ability distributions of the model, the latter is a query mechanism: it iterates through the data and computes all the possible ways it could have been gener- ated according to the model, enabling the application of probabilistic inference algorithms. The probabilistic predicate pb dirichlet specifies a set of categorical dis- tributions with the same Dirichlet prior, i.e. the elements on the outer plate indexed by a in Figure 1. Therefore, a set of such predicates express the whole outer plate. The syntax of the predicate is pb dirichlet(Alpha a, Name, K a, I a). The first argument, Alpha a, corresponds to αa in the model, and can be either a list of ka positive scalars specifying the parameters of the Dirichlet, or a positive scalar that specifies a symmetric prior. The second argument, Name is an atom that will be used as a functor when calling a predicate that represents a realization of a categorical random variable on the a-th plate. The third argu- ment K a corresponds to ka , and I a represents Ia , i.e. the number of categorical distributions having the same prior. The semantics of the predicate is that Name can be called in the program as a predicate, with the first argument denoting a category from 1, . . . , ka , and the second argument a distribution from 1, . . . , Ia . In this paper, Name(K a, I a) is assumed to be a ground atom when called. The probabilistic predicate pb plate(OuterQuery, Count, InnerQuery) is the querying mechanism of PB. Informally, the first argument, OuterQuery is a usual Prolog query that iterates through the data. It must not call any predi- cates defined by pb dirichlet. The argument Count is a positive integer that indicates that a particular observation is observed Count times. The final argu- 89 Probabilistic Abductive Logic Programming using Dirichlet Priors ment, InnerQuery is an abductive query that computes the explanation of the observations iterated in the OuterQuery. The formal semantics of pb plate will be discussed after we introduce additional notation. Example. Consider the LDA example from the previous section. Suppose we observe more data and we encode it in a PB program as: observe(d(1),[(w(1),4),(w(4),2)]). observe(d(2),[(w(3),1),(w(4),5)]). observe(d(3),[(w(2),2)]). Each observe fact encodes a document, indexed by an id using the first argument, and consisting of a bag-of-words in the second argument. The bag- of-words is a list of pairs: word index and its (positive) count per document. The next part of the PB program specifies the probability distributions as the following facts: pb dirichlet(1.0, mu, 2, 3) and pb dirichlet(1.0, phi, 4, 2). The pb plate query iterates through document and word indexes and “ex- plains” each such pair using the generate predicate. Note that observe and generate are not keywords, but descriptive conventional names. pb_plate( [observe(d(Doc), TokenList), member((w(Token), Count), TokenList)], Count, [generate(Doc, Token)] ). generate(Doc, Token) :- Topic in 1..2, mu(Topic, Doc), phi(Token, Topic). Having described the core syntax of PB programs and its relation to the probabilistic model, we explain what is the result of executing a PB program and how probabilistic inference is performed to estimate P (θ|f, x, v; α), or, in more typical applications, to produce an estimate θ̂. In a traditional abductive logic programming setting [11,12,16], the result of evaluating a query is a list of abductive solutions, and an abductive solution is a list of abducibles. An abducible is a predicate that has no definition in the program. Some systems [20,23] represent an abductive solution as a pair of lists: a list of positive abducibles, i.e. abducibles that must be true, and a list of negative abducibles, i.e. abducibles that must be false. Since PB queries are abductive queries, an identical representation is obtained if the predicates defined by pb dirichlet are parsed as annotated disjunctions [20]. Example. Consider the LDA example, more specifically the probabilistic predicate defining φ: pb dirichlet(1.0, phi, 4, 2). Let idx denote a pos- itive integer such that pa(Idx) is a new abducible w.r.t. previously parsed pb dirichlet predicates, i.e. the one defining µ. Abusing notation, idx + incr is denoted by Idx+incr. The corresponding annotated disjunction is shown below: phi(1,1) :- pa(Idx). phi(2,1) :- \+pa(Idx), pa(Idx+1). phi(3,1) :- \+pa(Idx), \+pa(Idx+1), pa(Idx+2). phi(4,1) :- \+pa(Idx), \+pa(Idx+1), \+pa(Idx+2). 90 Probabilistic Abductive Logic Programming using Dirichlet Priors A similar annotated disjunction is produced for phi(Token,2). Note that pa(Idx), pa(Idx+1), pa(Idx+2) represent vn11 ,vn12 and vn13 in Figure 2. This encoding generates abductive representations of linear size in the num- ber of categories of the distribution. In common LDA tasks, the size of the vocabulary of a corpus is frequently (much) more than 10.000 words, making it difficult to represent abductive solutions efficiently in a traditional way. For the above reasons, PB uses a different representation of an abductive solution: a list of tuples (a, i, l), where a and i index the distribution as in the previous section, and l is a category 1 ≤ l ≤ ka . In the rest of the paper, the term “abductive solution” denotes this representation. The result of a PB query is a list of abductive solutions, and the result of calling pb plate is a list of PB query results, one for each solution to OuterQuery. For the moment, consider programs with exactly one pb plate definition. A generalization is presented once the necessary notation has been introduced. Example. In the LDA program, the OuterQuery simply grounds Doc and Token in the order they are specified, e.g. the first grounding is (1, 1), the second is (1, 4), the third (2, 3) etc. Assuming that square brackets represent lists, the result of pb plate is shown below: [ [ [(1,1,1),(2,1,1)], [(1,1,2),(2,2,1)] ], [ [(1,1,1),(2,1,4)], [(1,1,2),(2,2,4)] ], [ [(1,2,1),(2,1,3)], [(1,2,2),(2,2,3)] ], [ [(1,2,1),(2,1,4)], [(1,2,2),(2,2,4)] ], [ [(1,3,1),(2,1,2)], [(1,3,2),(2,2,2)] ] ] The observation of word 2 in document 3 is the last element of the big list, and it can be produced using either topic 1 or 2, hence the two lists representing abductive solutions. The tuple (1, 3, 1) means we choose topic 1 (last element) in document 3 (first two elements), and (2, 1, 2) means we choose word 2 in topic 1 (with the same remarks). Each result of a PB query is then parsed into a boolean formula which corre- sponds to Booln (vn∗ ) from the previous section. The key feature of PB is that it implicitly assumes that every result of a PB query on a pb plate produces the same formula. This allows more concise query definitions, as well as improved time and memory performance, as a trade off with the user’s expertise in PB. If the user were agnostic, she would write a pb plate predicate for each data point (thus making the OuterQuery trivial). In future work we plan on investigating the automatic partition of the dataset in pb plate definitions in an efficient way. Example. The formula for the LDA example is v0 v1 + v0 v2 . Notice it is different from the one in Section 2, because the indexes have no semantic meaning w.r.t. the plate model, and the annotated disjunction is compiled w.r.t. the choices present in the same PB query, rather than the whole sample space. The formula is compiled into a reduced ordered binary decision diagram (ROBDD, in the rest of the paper the RO attributes are implicit) [1,3], with the variables in ascending order according to their index. This means that the order of pb dirichlet predicates matters, and that it should correspond to the sampling order in the generative story of the model. The BDD is the key data 91 Probabilistic Abductive Logic Programming using Dirichlet Priors structure that is used for probabilistic inference. The inference algorithm we use is an adaptation of Ishihata and Sato’s Gibbs sampling for PLP models [10]. The algorithm is uncollapsed Gibbs sampling along two dimensions (θ and x).3 The difference between the original inference algorithm and the PB one is that instead of sampling from a Bernoulli, we sample from a multinomial with Count trials. We also sample all the (identical) BDDs for a pb plate at once, using a single BDD, making sure to stop sampling a node when it isn’t sampled in any of the implicit BDDs. Furthermore, sampling θ is followed by a re-parametrization such that the probabilities of the boolean variables in the BDD correspond to the new θ. The generalization to multiple pb plate predicates is straightforward: we sample each BDD, corresponding to one pb plate, in turn, and all the samples update a common data structure representing x, and the probabilities of the boolean variables in each BDD are re-parametrized to adjust to the sampled θ. In principle, it is possible to use the learned θ, or, to be more Bayesian, the posterior parameters of the Dirichlet α0 , to perform “forward” inference in a PB program: if we freeze θ, then the backward probability of the BDD yields the estimated parameter of a new observation. Otherwise, we sample x and θ using as priors α0 , record the backward probabilities of the BDD, then output the average thereof. 4 Evaluation In this section we present experiments with PB4 . No burn-in or lag was used in the experiments. PB and collapsed Gibbs sampling (CGS) for LDA on synthetic data5 . We run a variation of the experiment performed in [9,10]. A synthetic corpus is generated from an LDA model with parameters: 25 words in the vo- cabulary, 10 topics, 1000 documents, 100 words per document, and a symmetric prior on the mixture of topics µ, γ = 1. The topics used as ground truth specify uniform probabilities over 5 words, cf. [9,10]. We evaluate the convergence of PB and a traditional collapsed Gibbs sampling implementation6 . The parameters are: β = γ = 1 as hyper-parameters, and we run 200 iterations of the samplers. The experiments are run 10 times over each corpus from a set of 10 identically sampled corpora, yielding 100 values of the log likelihoods per iteration. The average and 95% confidence interval (under a normal distribution) per iteration are shown in Figure 3. The experiment confirms the conclusions of the LDA ex- periment in [10]: both sampling algorithms converge, albeit PB converges slower 3 The original Gibbs sampling for LDA [9] is collapsed Gibbs sampling along a number of dimensions equal to the number of words in the corpus. 4 See supplementary materials for details on implementation and software availability (Appendix A). 5 For details on likelihood formulation and comparison with the Church PPL, see supplementary materials (Appendices B and C). 6 We use the topicmodels R package. 92 Probabilistic Abductive Logic Programming using Dirichlet Priors 180000 Average Log Likelihood and 95% confidence interval (normal) 200000 220000 240000 Log Likelihood 260000 280000 300000 320000 PB Gibbs 340000 0 50 100 150 200 Iterations Fig. 3. Comparison between PB and collapsed Gibbs sampling on 10 sampled synthetic corpora (10 runs per corpus). π1 = 0.144 π2 = 0.191 π3 = 0.153 π4 = 0.185 π5 = 0.187 π6 = 0.138 fatty tuna fatty tuna fatty tuna fatty tuna fatty tuna fatty tuna shrimp tuna sea urchin sea urchin tuna tuna sea eel shrimp salmon roe salmon roe shrimp salmon roe squid squid sea eel shrimp sea eel shrimp tuna sea eel tuna sea eel squid squid tuna roll tuna roll shrimp squid tuna roll sea eel salmon roe egg tuna roll tuna roll egg tuna roll sea urchin cucumber roll squid tuna salmon roe sea urchin egg salmon roe egg egg cucumber roll egg cucumber roll sea urchin cucumber roll cucumber roll sea urchin cucumber roll Table 1. Maximum likelihood preference profiles and mixture parameters on the Sushi dataset. than CGS. However, we report different values for the log likelihood and note that PB takes 200 iterations rather than 100 to converge to a value that is close, under usual statistical assumptions, to the one produced by CGS. PB for RIM on Sushi dataset. A repeated insertion model (RIM, [5]) provides a recursive and compact representation of K probability distributions, called preference profiles, over the set of all permutations of M items. This intu- itively captures K different types of people with similar preferences. We evaluate a variant of the repeated insertion model in an experiment inspired by [14], on a dataset published in [13]. The data consists of 5000 permutations over M = 10 Sushi ingredients, each permutation expressing the preferences of a surveyed person. Following [14], we use K = 6 preference profiles, however we use the RIM rather than a Mallows model, and we train on the whole dataset. The pa- rameters of the model are 50/K symmetric prior for the mixture of profiles, and 0.1 symmetric prior for all categorical distributions in all profiles. We run PB 10 times with 100 iterations and average the parameters. For each categorical 93 Probabilistic Abductive Logic Programming using Dirichlet Priors observe([5,0,3,4,6,9,8,1,7,2]). observe([0,9,6,3,7,2,8,1,5,4]). % ... 4998 ’observe’ facts ommited pb_dirichlet(8.33333333333, pi, 6, 1). pb_dirichlet(0.1, p2, 2, 6). pb_dirichlet(0.1, p7, 7, 6). pb_dirichlet(0.1, p3, 3, 6). pb_dirichlet(0.1, p8, 8, 6). pb_dirichlet(0.1, p4, 3, 6). pb_dirichlet(0.1, p9, 9, 6). pb_dirichlet(0.1, p5, 5, 6). pb_dirichlet(0.1, p10, 10, 6). pb_dirichlet(0.1, p6, 6, 6). pb_plate( [observe(Sample)], 1, [generate([0,1,2,3,4,5,6,7,8,9], Sample)] ). generate([H|T], Sample):- K in 1..6, pi(K, 1), generate(T, Sample, [H], 2, K). generate([], Sample, Sample, _Idx, _K). generate([ToIns|T], Sample, Ins, Idx, K) :- % insert next element at Pos yielding a new list Ins1 append(_, [ToIns|Rest], Sample), insert_rim(Rest, ToIns, Ins, Pos, Ins1), % build prob predicate in Pred number_chars(Idx, LIdx), append([’p’], LIdx, LF), atom_chars(F, LF), Pred =.. [F, Pos, K], % call prob predicate and recurse pb_call(Pred), Idx1 is Idx+1, generate(T, Sample, Ins1, Idx1, K). insert_rim([], ToIns, Ins, Pos, Ins1) :- append(Ins, [ToIns], Ins1), length(Ins1, Pos). insert_rim([H|_T], ToIns, Ins, Pos, Ins1) :- nth1(Pos, Ins, H), nth1(Pos, Ins1, ToIns, Ins). insert_rim([H|T] , ToIns, Ins, Pos, Ins1) :- \+member(H, Ins), insert_rim(T, ToIns, Ins, Pos, Ins1). Table 2. PB program for a RIM with K = 6 preference profiles. distribution in a profile, we select its maximum likelihood realization to build the corresponding maximum likelihood preference profile, shown in Table 1. The inference yields similar conclusions to [14]: there is a strong preference for fatty tuna, a strong dislike of cucumber roll and a strong positive correlation between salmon roe and sea urchin. We show the PB program used in Table 2, noting that pb call/1 is a special PB predicate that allows the evaluation of its argu- ment as a predicate defined by pb dirichlet. We are not aware of any other implementation of RIM in a PPL, therefore we briefly describe the program. The mixture of profiles is characterized by π, a set of K distributions, and for each profile there are M − 1 categorical distributions that specify the proba- bilities over the set of permutations of M elements. An observed permutation is produced by selecting a latent profile, then generating that permutation by consecutively inserting elements from an insertion order, e.g. [0, 1, . . . , 9], at the right position, according to the distributions in that profile. The right position is chosen using the insert rim predicate, as naively generating all the possible permutations is intractable. 94 Probabilistic Abductive Logic Programming using Dirichlet Priors 5 Related Work and Conclusions In this paper, we introduced PB, a probabilistic logic programming language for categorical models with Dirichlet priors. The idea of PB was sparked by [10], which defines a similar class of probabilistic models and provides a Gibbs sam- pling algorithm for BDDs. However, BDDs are not a programming language, nor are they an intuitive representation for a non-expert. This paper bridges the gap between logical and probabilistic inference in the considered class of mod- els, and addresses issues on representation of abductive solutions and inference on “syntactically” identical BDDs. Similarly to ProbLog [7], the pipeline of PB can be described as logical inference, followed by knowledge compilation, fol- lowed by probabilistic evaluation. Unlike ProbLog, the most difficult task in PB is probabilistic evaluation, rather than knowledge compilation, though for com- plex programs, PB could benefit from using more compact decision diagrams. In relation to Church [8] and many other related PPLs, PB is similar in that it uses a Turing-complete declarative language, but the set of probabilistic primi- tives available in PB is very restricted compared to Church. PB uses a different probabilistic model than Alchemy [6], and by using abductive logic programming instead of a first-order knowledge base, it can easily encode recursive generative models, such as RIM. Although in this paper we present well studied models, they can be easily adapted to include various constraints, e.g. seed words in LDA. In future work, we hope to explore more probabilistic models that fit the PB paradigm, and to design, implement, and compare efficient algorithms for generalized probabilistic inference. References 1. Akers, S.B.: Binary decision diagrams. IEEE Trans. Comput. 27(6), 509–516 (Jun 1978), http://dx.doi.org/10.1109/TC.1978.1675141 2. Blei, D.M., Ng, A.Y., Jordan, M.I., Lafferty, J.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 2003 (2003) 3. Bryant, R.: Graph-based algorithms for boolean function manipulation. Comput- ers, IEEE Transactions on C-35(8), 677–691 (Aug 1986) 4. Buntine, W.L.: Operations for learning with graphical models. J. Artif. Intell. Res. (JAIR) 2, 159–225 (1994), http://dx.doi.org/10.1613/jair.62 5. Doignon, J.P., Peke, A., Regenwetter, M.: The repeated insertion model for rank- ings: Missing link between two subset choice models. Psychometrika 69(1), 33–54 (2004), http://dx.doi.org/10.1007/BF02295838 6. Domingos, P., Kok, S., Poon, H., Richardson, M., Singla, P.: Unifying logi- cal and statistical AI. In: Proceedings of the 21st National Conference on Ar- tificial Intelligence - Volume 1. pp. 2–7. AAAI’06, AAAI Press (2006), http: //dl.acm.org/citation.cfm?id=1597538.1597540 7. Fierens, D., den Broeck, G.V., Renkens, J., Shterionov, D.S., Gutmann, B., Thon, I., Janssens, G., Raedt, L.D.: Inference and learning in probabilistic logic programs using weighted boolean formulas. CoRR abs/1304.6810 (2013), http: //arxiv.org/abs/1304.6810 95 Probabilistic Abductive Logic Programming using Dirichlet Priors 8. Goodman, N.D., Mansinghka, V.K., Roy, D.M., Bonawitz, K., Tenenbaum, J.B.: Church: a language for generative models. Uncertainty in Artificial Intelligence 2008 (2008), http://danroy.org/papers/church_GooManRoyBonTen-UAI-2008. pdf 9. Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proceedings of the Na- tional Academy of Sciences 101(suppl 1), 5228–5235 (2004), http://www.pnas. org/content/101/suppl_1/5228.abstract 10. Ishihata, M., Sato, T.: Bayesian inference for statistical abduction using Markov chain Monte Carlo. In: Proceedings of the 3rd Asian Conference on Machine Learn- ing, ACML 2011, Taoyuan, Taiwan, November 13-15, 2011. pp. 81–96 (2011), http://www.jmlr.org/proceedings/papers/v20/ishihata11/ishihata11.pdf 11. Kakas, A.C., Kowalski, R.A., Toni, F.: Abductive logic programming (1993) 12. Kakas, Antonis, C., Van Nuffelen, B., Denecker, M.: A-system : Problem solving through abduction. In: Proceedings of the Seventeenth International Joint Confer- ence on Artificial Intelligence. vol. 1, pp. 591–596. Morgan Kaufmann Publish- ers, Inc (2001), http://www.cs.kuleuven.ac.be/cgi-bin-dtai/publ_info.pl? id=34862 13. Kamishima, T., Kazawa, H., Akaho, S.: Supervised ordering - an empirical survey. In: Data Mining, Fifth IEEE International Conference on. pp. 4 pp.– (Nov 2005) 14. Lu, T., Boutilier, C.: Effective sampling and learning for Mallows models with pairwise-preference data. Journal of Machine Learning Research 15, 3783–3829 (2014), http://jmlr.org/papers/v15/lu14a.html 15. Lunn, D., Spiegelhalter, D., Thomas, A., Best, N.: The BUGS project: Evolution, critique and future directions. Statistics in Medicine 28(25), 3049–3067 (2009), http://dx.doi.org/10.1002/sim.3680 16. Ma, J.: Abductive reasoning module for SICStus prolog (2012), http://www-dse. doc.ic.ac.uk/cgi-bin/moin.cgi/abduction 17. Milch, B., Marthi, B., Russell, S.: Blog: Relational modeling with unknown objects. In: ICML 2004 Workshop on Statistical Relational Learning and Its Connections. pp. 67–73 (2004) 18. Paige, B., Wood, F.: A compilation target for probabilistic programming languages. CoRR abs/1403.0504 (2014), http://arxiv.org/abs/1403.0504 19. Pfeffer, A.: Figaro: An object-oriented probabilistic programming language (2009) 20. Raedt, L.D., Kimmig, A., Toivonen, H.: ProbLog: A probabilistic Prolog and its application in link discovery. In: Veloso, M.M. (ed.) IJCAI. pp. 2462–2467 (2007), http://dblp.uni-trier.de/db/conf/ijcai/ijcai2007.html#RaedtKT07 21. Sato, T., Kameya, Y.: New advances in logic-based probabilistic modeling by prism. In: De Raedt, L., Frasconi, P., Kersting, K., Muggleton, S. (eds.) Prob- abilistic Inductive Logic Programming, Lecture Notes in Computer Science, vol. 4911, pp. 118–155. Springer Berlin Heidelberg (2008), http://dx.doi.org/10. 1007/978-3-540-78652-8_5 22. Stan Development Team: Stan Modeling Language Users Guide and Reference Manual, Version 2.5.0 (2014), http://mc-stan.org/ 23. Turliuc, C., Maimari, N., Russo, A., Broda, K.: On minimality and integrity constraints in probabilistic abduction. In: Logic for Programming, Artificial In- telligence, and Reasoning - 19th International Conference, LPAR-19, Stellen- bosch, South Africa, December 14-19, 2013. Proceedings. pp. 759–775 (2013), http://dx.doi.org/10.1007/978-3-642-45221-5_51 96 Probabilistic Abductive Logic Programming using Dirichlet Priors A Implementation PB is implemented in YAP and Python (2.7), and is currently available as a command-line script. YAP is used to parse input files and produce files for prob- abilistic inference (e.g. solutions to each pb plate query, information on the probability distributions). PyCUDD is used to compile ROBDDs and computa- tionally intensive parts of the sampling algorithm are implemented in Cython. This prototype implementation and any additional files are released under a GNU General Public License (GPL3). For more information and documentation see: http://raresct.github.io/peircebayes To access the source code see: http://www.github.com/raresct/peircebayes To reproduce the experiments, more concretely Figure 3 and Table 1, see: http://www.github.com/raresct/peircebayes_experiments On an Intel R CoreTM i7-4710HQ CPU @ 2.50GHz ×8, the LDA experiment took: ≈ 265 minutes for PB, ≈ 4 minutes for CGS, and the RIM experiment took ≈ 10 minutes. Note that there is significant overhead for PB because we don’t measure only sampling time, but also logical inference and knowledge compilation. B A Note on the Joint Distribution of the PB model and Likelihood for LDA The joint distribution of collapsed PB models is: P (f, v, x; α) = P (f |v)P (v|x)P (x; α) If x is the result of sampling the appropriate BDDs, then P (f |v)P (v|x) = 1, and the joint distribution reduces to P (x; α). This type of distribution has been well studied, cf. equations 2 and 3 in [9] for LDA, and in the case of PB models, it is:  !Ia I Q  A Y Pka Y a ka PN Γ(  Q l=1 alα ) l=1 Γ( [x n=1 nai = l] + α ) al  P (x; α) = ka PN Pka a=1 l=1 Γ(αal ) i=1 Γ( x n=1 nai + α l=1 al ) The likelihood for LDA is recovered by using the factors of the joint distri- bution involving only β, φ, y. C PB and Church on LDA In Section 4 we compare PB with CGS on a synthetic LDA task. We add to the comparison a much more expressive, universal PPL called Church [8]. The experimental setting differs from the LDA experiment in Section 4 in that we 97 Probabilistic Abductive Logic Programming using Dirichlet Priors 22000 Average Log Likelihood +/- Standard Error 24000 26000 28000 Log Likelihood 30000 32000 34000 PB 36000 Church1 Church2 38000 0 50 100 150 200 250 300 Iterations Fig. 4. Comparison between PB and Church on one sampled synthetic corpus (10 runs per corpus). sample only one synthetic corpus of 100 documents. This is due to the fact that the Church implementation of LDA is slow. Furthermore, we take 300 samples (no lag, no burn-in for PB, 10 lag, no burn-in for Church). We use two implemen- tations of LDA in Church7 , and report the results in Figure 4. Note that we use the uncollapsed likelihood for the Church models (which is more “optimistic” than the collapsed one), mainly due to the fact that we were unable to find an implementation of the log Γ function in Church. Neither Church LDA programs seems to converge, while PB behaves consis- tently with the previous experiment. The average time per run is: ≈ 0.36 minutes for PB, ≈ 13.5 minutes for Church1 and ≈ 16.7 minutes for Church2. 7 We use adaptations of the two programs shown here: http://forestdb.org/ models/lda.html and run them with webchurch (https://github.com/probmods/ webchurch) as command-line scripts. 98