EM over Binary Decision Diagrams for
            Probabilistic Logic Programs

                        Elena Bellodi and Fabrizio Riguzzi

       ENDIF – Università di Ferrara – Via Saragat, 1 – 44122 Ferrara, Italy.
                {elena.bellodi,fabrizio.riguzzi}@unife.it


      Abstract. Recently much work in Machine Learning has concentrated
      on representation languages able to combine aspects of logic and prob-
      ability, leading to the birth of a whole field called Statistical Relational
      Learning. In this paper we present a technique for parameter learning
      targeted to a family of formalisms where uncertainty is represented us-
      ing Logic Programming techniques - the so-called Probabilistic Logic
      Programs such as ICL, PRISM, ProbLog and LPAD. Since their equiv-
      alent Bayesian networks contain hidden variables, an EM algorithm is
      adopted. In order to speed the computation, expectations are computed
      directly on the Binary Decision Diagrams that are built for inference.
      The resulting system, called EMBLEM for “EM over Bdds for proba-
      bilistic Logic programs Efficient Mining”, has been applied to a number
      of datasets and showed good performances both in terms of speed and
      memory usage.


   Keywords: Statistical Relational Learning, Probabilistic Logic Program-
ming, Distribution Semantics, Logic Programs with Annotated Disjunctions,
Expectation Maximization


1   Introduction
Machine Learning has seen the development of the field of Statistical Relational
Learning (SRL) where logical-statistical languages are used in order to effec-
tively learn in complex domains involving relations and uncertainty. They have
been successfully applied in social networks analysis, entity recognition, collec-
tive classification and information extraction, to name a few.
    Similarly, a large number of works in Logic Programming have attempted
to combine logic and probability, among which the distribution semantics [21]
is a prominent approach. This semantics underlies for example PRISM [21],
the Independent Choice Logic [14], Logic Programs with Annotated Disjunc-
tions (LPADs) [29], ProbLog [4] and CP-logic [27]. The approach is particularly
appealing because efficient inference algorithms appeared [4,17], which adopt
Binary Decision Diagrams (BDDs).
    In this paper we present the EMBLEM system for “EM over Bdds for prob-
abilistic Logic programs Efficient Mining” [1] that learns parameters of proba-
bilistic logic programs under the distribution semantics by using an Expectation
Maximization (EM) algorithm. Such an algorithm is a popular tool in statisti-
cal estimation problems involving incomplete data: it is an iterative method to
estimate some unknown parameters Θ of a model, given a dataset where some
of the data is missing. The aim is to find maximum likelihood or maximum a
posteriori (MAP) estimates of Θ [13]. EM alternates between performing an ex-
pectation (E) step, where the missing data are estimated given the observed data
and current estimate of the model parameters, and a maximization (M) step,
which computes the parameters maximizing the likelihood of the data given the
sufficient statistics on the data computed in the E step. The translation of the
probabilistic programs into graphical models requires the use of hidden variables
(see Section 3) and therefore of EM: the main characteristic of our system is the
computation of the values of expectations using BDDs.
    Since there are transformations with linear complexity that can convert a
program in a language under the distribution semantics into the others [28],
we will use LPADs for their general syntax. EMBLEM has been tested on the
IMDB, Cora and UW-CSE datasets and compared with RIB [20], LeProbLog [4],
Alchemy [15] and CEM, an implementation of EM based on [17].
    The paper is organized as follows. Section 2 presents LPADs and Section
3 describes EMBLEM. Section 4 discusses related works. Section 5 shows the
results of the experiments performed and Section 6 concludes the paper.


2    Logic Programs with Annotated Disjunctions

Formally a Logic Program with Annotated Disjunctions [29] consists of a finite
set of annotated disjunctive clauses. An annotated disjunctive clause Ci is of the
form hi1 : Πi1 ; . . . ; hini : Πini : −bi1 , . . . , bimi . In such a clause hi1 , . . . hini are
logical atoms and bi1 , . . . , bimi P
                                     are logical literals, Πi1 , . . . , Πini are real numbers
                                       ni
in the interval [0, 1] such that k=1       Πik ≤ 1. bi1 , . . . , bimi is called the body and
is indicated with body(Ci ). Note thatP         if ni = 1 and Πi1 = 1 the clause corre-
                                                   ni
sponds to a non-disjunctive clause. If k=1              Πik < 1, the head of the annotated
disjunctive clause implicitly contains an extra atom null           Pnithat does not appear
in the body of any clause and whose annotation is 1 − k=1                  Πik . We denote by
ground(T ) the grounding of an LPAD T .
    An atomic choice is a triple (Ci , θj , k) where Ci ∈ T , θj is a substitution
that grounds Ci and k ∈ {1, . . . , ni }. (Ci , θj , k) means that, for the ground
clause Ci θj , the head hik was chosen. In practice Ci θj corresponds to a random
variable Xij and an atomic choice (Ci , θj , k) to an assignment Xij = k. A set of
atomic choices κ is consistent if (C, θ, i) ∈ κ, (C, θ, j) ∈ κ ⇒ i = j, i.e., only one
head is selected for a ground clause. A composite choice κ is a consistent set of
atomic choices. The probability P (κ) of a composite choice κ Q              is the product of
the probabilities of the individual atomic choices, i.e. P (κ) = (Ci ,θj ,k)∈κ Πik .
    A selection σ is a composite choice that, for each clause Ci θj in ground(T ),
contains an atomic choice (Ci , θj , k). We denote the set of all selections σ of a
program T by ST . A selection σ identifies a normal logic program wσ defined
as wσ = {(hik ← body(Ci ))θj |(Ci , θj , k) ∈ σ}. wσ is called a world of T . Since
selections are composite
                  Q       choices we can assign a probability to possible worlds:
P (wσ ) = P (σ) = (Ci ,θj ,k)∈σ Πik .
    We consider only sound LPADs, in which every possible world has a total
well-founded model. In the following we write wσ |= Q to mean that the query
Q is true in the well-founded model of the program wσ .
    The probability of a query Q according to an LPAD T is given by
                                        X
                               P (Q) =        P (σ)                           (1)
                                         σ∈E(Q)

where we define E(Q) as {σ ∈ ST , wσ |= Q}, the set of selections corresponding
to worlds where the query is true.
    To reduce the computational cost of answering queries in our experiments,
random variables can be directly associated to clauses rather than to their ground
instantiations: atomic choices then take the form (Ci , k), meaning that head hik
is selected from program clause Ci , i.e., that Xi = k.

Example 1. The following LPAD T encodes a very simple model of the develop-
ment of an epidemic or pandemic:
    C1 = epidemic : 0.6; pandemic : 0.3 : −f lu(X), cold.
    C2 = cold : 0.7.
    C3 = f lu(david).
    C4 = f lu(robert).
Clause C1 has two groundings, C1 θ1 with θ1 = {X/david} and C1 θ2 with θ2 =
{X/robert}, so there are two random variables X11 and X12 ; C2 has only one
grounding that is associated to the variable X21 . X11 and X12 have three values
since C1 has three head atoms (epidemic, pandemic, null); similarly X21 has
two values since C2 has two head atoms (cold, null).

The worlds in which a query is true can be represented using a Multivalued Deci-
sion Diagram (MDD) [25]. An MDD represents a function f (X) taking Boolean
values on a set of multivalued variables X by means of a rooted graph that
has one level for each variable. Each node is associated to the variable of its
level and has one child for each possible value of the variable. The leaves store
either 0 or 1. Given values for all the variables X, we can compute the value
of f (X) by traversing the graph starting from the root and returning the value
associated to the leaf that is reached. A MDD can be used to represent the set
E(Q) by considering the multivalued variables Xij s associated to the Ci θj s of
ground(T ). Xij has values {1, . . . , ni } and the atomic choice (Ci , θj , k) corre-
sponds to the propositional
                   W       Vequation Xij = k. If we represent with an MDD the
function f (X) = σ∈E(Q) (Ci ,θj ,k)∈σ Xij = k, then the MDD will have a path
to a 1-leaf for each world where Q is true. While building MDDs, simplification
operations can be applied that delete or merge nodes. Merging is performed
when the diagram contains two identical sub-diagrams, while deletion is per-
formed when all arcs from a node point to the same node. In this way a reduced
MDD is obtained with respect to a Multivalued Decision Tree (MDT), i.e., a
MDD in which every node has a single parent, all the children belong to the
level immediately below and all the variables have at least one node. For ex-
ample, the reduced MDD corresponding to the query epidemic from Example
1 is shown in Figure 1(a). The labels on the edges represent the values of the
variable associated to the node.


   X11
                                                                     X111
                                                                                                  n1 R
                                                    3                                                            H
                                                                                            n2 
                       1
                                         2                                                                     ;
   X12
                                             k kk                            X121                                
                                    kkkk
                             1


                                                                             n3  Y
                                                                                                                   
                              kkkkkk 1                3
                                                                                                                     
   X21                        TTTT           2
                                                              2          3   X211                  V T                 
                                  TTT2T                                                                R O
                       1               TTTT                                                                  L 
                                                T
                   1                                      0                            1                             0
                       (a) MDD.                                                        (b) BDD.

                                 Fig. 1. Decision diagrams for Example 1.


    It is often unfeasible to find all the worlds where the query is true so inference
algorithms find instead explanations for it, i.e. composite choices such that the
query is true in all the worlds whose selections are a superset of them. Expla-
nations however, differently from possible worlds, are not necessarily mutually
exclusive with respect to each other, but exploiting the fact that MDDs split
paths on the basis of the values of a variable and the branches are mutually
disjoint so a dynamic programming algorithm can be applied for computing the
probability.
    Most packages for the manipulation of decision diagrams are however re-
stricted to work on Binary Decision Diagrams, i.e., decision diagrams where all
the variables are Boolean. A node n in a BDD has two children: the 1-child, in-
dicated with child1 (n), and the 0-child, indicated with child0 (n). The 0-branch,
the one going to the 0-child, is drawn with a dashed line.
    To work on MDDs with a BDD package we must represent multivalued vari-
ables by means of binary variables. For a multivalued variable Xij , correspond-
ing to ground clause Ci θj , having ni values, we use ni − 1 Boolean variables
Xij1 , . . . , Xijni −1 and we represent the equation Xij = k for k = 1, . . . ni − 1
by means of the conjunction Xij1 ∧ Xij2 ∧ . . . ∧ Xijk−1 ∧ Xijk , and the equa-
tion Xij = ni by means of the conjunction Xij1 ∧ Xij2 ∧ . . . ∧ Xijni −1 . Figure
1(b) shows the reduced BDD corresponding to the MDD in Figure 1(a). BDDs
can be used for computing the probability of queries by associating to each
Boolean variable Xijk a parameter πik that represents P (Xijk = 1). If we de-
fine g(i) = {j|θj is a substitution grounding Ci } then P (Xijk = 1) = πik for all
j ∈ g(i). The parameters are obtained from those of multivalued variables in
this way:
                                                    πi1 = Πi1
                              ...
                                        Πik
                              πik = Qk−1
                                     j=1 (1 − πij )
                              ...

up to k = ni − 1.


3   EMBLEM

EMBLEM applies the algorithm for performing EM over BDDs, proposed in
[26,9,10,8], to the problem of learning the parameters of an LPAD. EMBLEM
takes as input a number of goals that represent the examples and for each one
generates the BDD encoding its explanations. The examples are organized in
a set of interpretations (sets of ground facts) each describing a portion of the
domain of interest. The queries correspond to ground atoms in an interpreta-
tion whose predicate has been indicated as “target” by the user. The predi-
cates can be treated as closed-world or open-world. In the first case the body
of clauses is resolved only with facts in the interpretation, in the second case
it is resolved both with facts in the interpretation and with clauses in the the-
ory. If the last option is set and the theory is cyclic, we use a depth bound on
SLD-derivations to avoid going into infinite loops, as proposed by [6]. Given the
program containing only the clauses C1 and C2 from Example 1 and the interpre-
tation {epidemic, f lu(david), f lu(robert)}, we obtain the BDD in Figure 1(b)
that represents the query epidemic. A value of 1 for the Boolean variables X111
and X121 means that, for the ground clauses C1 θ1 and C1 θ2 , the head h11 =
epidemic is chosen, regardless of the other variables for the clause (X112 , X122 )
that are in fact omitted from the diagram.
    Then EMBLEM enters the EM cycle, in which the steps of expectation and
maximization are repeated until the log-likelihood of the examples reaches a local
maximum. The necessity of exploiting EM depends on the fact that, to determine
the parameters Πik , the number of times that a head hik has been chosen is
required. The information about which selection was used in the derivation of a
goal is unknown, so the random variables are hidden and we compute expected
counts. For a single example Q:

 – Expectation: computes E[cik0 |Q] and E[cik1 |Q] for all rules Ci and k =
   1, . . . , ni − 1, where cikx is the number of times a variable
                                                            P      Xijk takes value x
   for x ∈ {0, 1}, with j in g(i). E[cikx |Q] is given by j∈g(i) P (Xijk = x|Q).
 – Maximization: computes πik for all rules Ci and k = 1, . . . , ni − 1.

                                           E[cik1 |Q]
                             πik =                                               (2)
                                     E[cik0 |Q] + E[cik1 |Q]

If we have more than one example the contributions of each example simply sum
up when computing E[cijx ].
                                                        P (Xijk =x,Q)
   P (Xijk = x|Q) is given by P (Xijk = x|Q) =              P (Q)     with
                                       X
                P (Xijk = x, Q) =             P (Q, Xijk = x, σ)
                                       σ∈ST
                                       X
                                   =          P (Q|σ)P (Xijk = x|σ)P (σ)
                                       σ∈ST
                                        X
                                   =            P (Xijk = x|σ)P (σ)
                                       σ∈E(Q)

where P (Xijk = 1|σ) = 1 if (Ci , θj , k) ∈ σ for k = 1, . . . , ni − 1 and 0 otherwise.
   Since there is a one to one correspondence between the possible worlds where
Q is true and the paths to a 1 leaf in a Binary Decision Tree (a MDT with binary
variables),                            X                        Y
                P (Xijk = x, Q) =            P (Xijk = x|ρ)         π(d)
                                        ρ∈R(Q)                  d∈ρ

where ρ is a path, and if σ corresponds to ρ, then P (Xijk = x|σ)=P (Xijk = x|ρ).
R(Q) is the set of paths in the BDD for query Q that lead to a 1 leaf, d is an
edge of ρ and π(d) is the probability associated to the edge: if d is the 1-branch
from a node associated to a variable Xijk , then π(d) = πik , if d is the 0-branch
from a node associated to a variable Xijk , then π(d) = 1 − πik .
    Now consider a BDT in which only the merge rule is applied, fusing together
identical sub-diagrams. The resulting diagram, that we call Complete Binary
Decision Diagram (CBDD), is such that every path contains a node for every
level. For a CBDD, P (Xijk = x, Q) can be further expanded as
                                              X         Y
                   P (Xijk = x, Q) =                        π(d)
                                          ρ∈R(Q),(Xijk =x)∈ρ d∈ρ


where (Xijk = x) ∈ ρ means that ρ contains an x-edge from a node associated
to Xijk . We can then write
                                   X                    Y       Y
   P (Xijk = x, Q) =                                       π(d)     π(d)
                         n∈N (Q),v(n)=Xijk ,ρn ∈Rn (Q),ρn ∈Rn (Q,x) d∈ρn        d∈ρn


where N (Q) is the set of nodes of the CBDD, v(n) is the variable associated to
node n, Rn (Q) is the set containing the paths from the root to n and Rn (Q, x)
is the set of paths from n to the 1 leaf through its x-child.
                             X             X         X        Y      Y
    P (Xijk = x, Q) =                                           π(d)   π(d)
                         n∈N (Q),v(n)=Xijk ρn ∈Rn (Q) ρn ∈Rn (Q,x) d∈ρn         d∈ρn
                                X                X      Y               X        Y
                     =                                       π(d)                       π(d)
                         n∈N (Q),v(n)=Xijk ρn ∈Rn (Q) d∈ρn          ρn ∈Rn (Q,x) d∈ρn
                                X
                     =                        F (n)B(childx (n))πikx
                         n∈N (Q),v(n)=Xijk
where πikx is πik if x=1 and (1 − πik ) if x=0. F (n) is the forward probability [10],
the probability mass of the paths from the root to n, while B(n) is the backward
probability [10], the probability mass of paths from n to the 1 leaf. If root is the
root of a tree for a query Q then B(root) = P (Q).
    The expression F (n)B(childx (n))πikx represents the sum of the probabilities
of all the paths passing through the x-edge of node n and is indicated with ex (n).
Thus                                            X
                     P (Xijk = x, Q) =                     ex (n)                  (3)
                                         n∈N (Q),v(n)=Xijk

For the case of a BDD, i.e., a diagram obtained by applying also the deletion rule,
Formula 3 is no longer valid since also paths where there is no node associated to
Xijk can contribute to P (Xijk = x, Q). These paths might have been obtained
from a BDD having a node m associated to variable Xijk that is a descendant
of node n along the 0-branch and whose outgoing edges both point to child0 (n).
The correction of formula (3) to take into account this aspect is applied in the
Expectation step.
    EMBLEM’s main procedure consists of a cycle in which the procedures Ex-
pectation and Maximization are repeatedly called. Procedure Expectation
returns the log likelihood of the data that is used in the stopping criterion:
EMBLEM stops when the difference between the log likelihood of the current
iteration and the one of the previous iteration drops below a threshold  or when
this difference is below a fraction δ of the current log likelihood.
    Procedure Expectation takes as input a list of BDDs, one for each example,
and computes the expectations for each one, i.e. P (Xijk = x, Q) for all variables
                                                                          x
          P BDD and values x ∈ {0, 1}. In the procedure we use η (i, k) to
Xijk in the
indicate j∈g(i) P (Xijk = x, Q). Expectation first calls GetForward and
GetBackward that compute the forward, the backward probability of nodes
and η x (i, k) for non-deleted paths only. Then it updates η x (i, k) to take into
account deleted paths. The expectations are updated in this way: for all rules i
and for k = 1 to ni − 1, E[cikx ] = E[cikx ] + η x (i, k)/P (Q).
    Procedure Maximization computes the parameters values for the next EM
iteration, as specified in (2).
    Procedure GetForward traverses the diagram one level at a time starting
from the root level, where F(root)=1, and for each node n it computes its contri-
bution to the forward probabilities of its children. Then the forward probabilities
of both children are updated in this way: F (childx (node)) = F (childx (node)) +
F (node) · πikx .
    Function GetBackward computes the backward probability of nodes by
traversing recursively the tree from the leaves to the root. When the calls of
GetBackward for both children of a node n return, we have all the information
that is needed to compute the ex values and the value of η x (i, k) for non-deleted
paths. An array ς is used here to store the contributions of the deleted paths by
starting from the root level and accumulating ς(l) for the various levels l.
    A fully detailed description of EMBLEM together with an example of its
execution can be found in [1].
4    Related Works
Our work has close connection with various other works. [9,10] proposed an EM
algorithm for learning the parameters of Boolean random variables given obser-
vations of the values of a Boolean function over them, represented by a BDD.
EMBLEM is an application of that algorithm to probabilistic logic programs. In-
dependently, also [26] proposed an EM algorithm over BDD to learn parameters
for the CPT-L language.[7] presented the CoPrEM algorithm that performs
EM over BDDs for the ProbLog language.
    Approaches for learning probabilistic logic programs can be classified into
three categories: those that employ constraint techniques (such as [16,18]), those
that use EM and those that adopt gradient descent.
    Among the approaches that use EM, [12] first proposed to use it to induce pa-
rameters and the Structural EM algorithm to induce ground LPADs structures.
Their EM algorithm however works on the underlying Bayesian network. RIB
[20] performs parameter learning using the information bottleneck approach,
which is an extension of EM targeted especially towards hidden variables. The
PRISM system [21,22] is one of the first learning algorithms based on EM.
    Among the works that use a gradient descent technique, LeProbLog [5,6]
finds the parameters of a ProbLog program that minimize the Mean Squared
Error of the probability of queries and uses BDD to compute the gradient.
    Alchemy [15] is a state of the art SRL system that offers various tools for infer-
ence, weight learning and structure learning of Markov Logic Networks (MLNs).
MLNs differ significantly from the languages under the distribution semantics
since they extend first-order logic by attaching weights to logical formulas, re-
flecting “how strong” they are, but do not allow to exploit logic programming
techniques.


5    Experiments
EMBLEM has been tested over three real world datasets: IMDB1 , UW-CSE2 [23]
and Cora3 [23]. We implemented EMBLEM in Yap Prolog4 and we compared it
with RIB [20]; CEM, an implementation of EM based on the cplint inference li-
brary [17,19]; LeProblog [5,6] and Alchemy [15]. All experiments were performed
on Linux machines with an Intel Core 2 Duo E6550 (2333 MHz) processor and
4 GB of RAM.
   To compare our results with LeProbLog we exploited the translation of
LPADs into ProbLog proposed in [3], for Alchemy we exploited the translation
between LPADs and MLNs used in [20].
   For the probabilistic logic programming systems (EMBLEM, RIB, CEM and
LeProbLog) we consider various options. The first consists in choosing between
1
  http://alchemy.cs.washington.edu/data/imdb
2
  http://alchemy.cs.washington.edu/data/uw-cse
3
  http://alchemy.cs.washington.edu/data/cora
4
  http://www.dcc.fc.up.pt/~vsc/Yap
associating a distinct random variable to each grounding of a probabilistic clause
or a single random variable to a non-ground probabilistic clause expressing
whether the clause is used or not. The latter case makes the problem easier. The
second option is concerned with imposing a limit on the depth of derivations
as done in [6], thus eliminating explanations associated to derivations exceed-
ing the depth limit. This is necessary for problems that contain cyclic clauses,
such as transitive closure clauses. The third option involves setting the num-
ber of restarts for EM based algorithms. All experiments for probabilistic logic
programming systems have been performed using open-world predicates.
    IMDB regards movies, actors, directors and movie genres and it is divided
into five mega-examples. We performed training on four mega-examples and test-
ing on the remaining one. Then we drew a Precision-Recall curve and computed
the Area Under the Curve (AUCPR) using the method reported in [2]. We de-
fined 4 different LPADs, two for predicting the target predicate sameperson/2,
and two for predicting samemovie/2. We had one positive example for each fact
that is true in the data, while we sampled from the complete set of false facts
three times the number of true instances in order to generate negative examples.
    For predicting sameperson/2 we used the same LPAD of [20]:
sameperson(X,Y):p:- movie(M,X),movie(M,Y).
sameperson(X,Y):p:- actor(X),actor(Y),workedunder(X,Z),
                    workedunder(Y,Z).
sameperson(X,Y):p:- gender(X,Z),gender(Y,Z).
sameperson(X,Y):p:- director(X),director(Y),genre(X,Z),genre(Y,Z).
where p is a placeholder meaning the parameter must be learned. We ran EM-
BLEM on it with the following settings: no depth bound, random variables
associated to instantiations of clauses and a number of restarts chosen to match
the execution time of EMBLEM with that of the fastest other algorithm.
    The queries that LeProbLog takes as input are obtained by annotating with
1.0 each positive example for sameperson/2 and with 0.0 each negative exam-
ple for sameperson/2. We ran LeProbLog for a maximum of 100 iterations or
until the difference in Mean Squared Error (MSE) between two iterations got
smaller than 10−5 ; this setting was used in all the following experiments as well.
For Alchemy we always used the preconditioned rescaled conjugate gradient dis-
criminative algorithm [11]. For this experiments we specified sameperson/2 as
the only non-evidence predicate.
    A second LPAD has been created to evaluate the performance of the algo-
rithms when some atoms are unseen:
sameperson_pos(X,Y):p:- movie(M,X),movie(M,Y).
sameperson_pos(X,Y):p:- actor(X),actor(Y),
                        workedunder(X,Z),workedunder(Y,Z).
sameperson_pos(X,Y):p:- director(X),director(Y),genre(X,Z),
                        genre(Y,Z).
sameperson_neg(X,Y):p:- movie(M,X),movie(M,Y).
sameperson_neg(X,Y):p:- actor(X),actor(Y),
                        workedunder(X,Z),workedunder(Y,Z).
sameperson_neg(X,Y):p:- director(X),director(Y),genre(X,Z),
                        genre(Y,Z).
sameperson(X,Y):p:- \+ sameperson_pos(X,Y), sameperson_neg(X,Y).
sameperson(X,Y):p:- \+ sameperson_pos(X,Y),\+ sameperson_neg(X,Y).
sameperson(X,Y):p:- sameperson_pos(X,Y), sameperson_neg(X,Y).
sameperson(X,Y):p:- sameperson_pos(X,Y), \+ sameperson_neg(X,Y).
The sameperson_pos/2 and sameperson_neg/2 predicates are unseen in the
data. Settings are the same as the ones for the previous LPAD. In this experiment
Alchemy was run with the −withEM option that turns on EM learning.
    Table 1 shows the AUCPR averaged over the five folds for EMBLEM, RIB,
LeProbLog, CEM and Alchemy. Results for the two LPADs are shown respec-
tively in the IMDB-SP and IMDBu-SP rows. Table 2 shows the learning times
in hours.
    For predicting samemovie/2 we used the LPAD:
samemovie(X,Y):p:- movie(X,M),movie(Y,M),actor(M).
samemovie(X,Y):p:- movie(X,M),movie(Y,M),director(M).
samemovie(X,Y):p:- movie(X,A),movie(Y,B),actor(A),director(B),
                   workedunder(A,B).
samemovie(X,Y):p:- movie(X,A),movie(Y,B),director(A),director(B),
                   genre(A,G),genre(B,G).
To test the behaviour when unseen predicates are present, we transformed the
program for samemovie/2 as we did for sameperson/2, thus introducing the
unseen predicates samemovie pos/2 and samemovie neg/2. We ran EMBLEM
on them with no depth bound, one variable for each instantiation of the rules
and one random restart. With regard to LeProbLog and Alchemy, we ran them
with the same settings as IMDB-SP and IMDBu-SP, by replacing sameperson
with samemovie.
    Table 1 shows, in the IMDB-SM and IMDBu-SM rows, the average AUCPR
for EMBLEM, LeProblog and Alchemy. For RIB and CEM we obtained a lack of
memory error (indicated with “me”); Table 2 shows the learning times in hours.
    The Cora database contains citations to computer science research papers.
For each citation we know the title, authors, venue and the words that appear in
them. The task is to determine which citations are referring to the same paper,
by predicting the predicate samebib(cit1,cit2).
    From the MLN proposed in [24]5 we obtained two LPADs. The first contains
559 rules and differs from the direct translation of the MLN because rules in-
volving words are instantiated with the different constants, only positive literals
for the hasword predicates are used and transitive rules are not included:
samebib(B,C):p:- author(B,D),author(C,E),sameauthor(D,E).
samebib(B,C):p:- title(B,D),title(C,E),sametitle(D,E).
samebib(B,C):p:- venue(B,D),venue(C,E),samevenue(D,E).
5
    http://alchemy.cs.washington.edu/mlns/er
samevenue(B,C):p:-haswordvenue(B,word_06),haswordvenue(C,word_06).
...
sametitle(B,C):p:-haswordtitle(B,word_10),haswordtitle(C,word_10).
....
sameauthor(B,C):p:- haswordauthor(B,word_a),
                    haswordauthor(C,word_a).
.....

The dots stand for the rules for all the possible words. The Cora dataset com-
prises five mega-examples each containing facts for the four predicates
samebib/2, samevenue/2, sametitle/2 and sameauthor/2, which have been
set as target predicates. We used as negative examples those contained in the
Alchemy dataset. We ran EMBLEM on this LPAD with no depth bound, a single
variable for each instantiation of the rules and a number of restarts chosen to
match the execution time of EMBLEM with that of the fastest other algorithm.
    The second LPAD adds to the previous one four transitive rules of the form

samebib(A,B):p :- samebib(A,C),samebib(C,B).

for every target predicate, for a total of 563 rules. In this case we had to run
EMBLEM with a depth bound equal to two and a single variable for each non-
-ground rule; the number of restarts was one. As for LeProbLog, we separately
learned the four predicates because learning the whole theory at once would
give a lack of memory error. We annotated with 1.0 each positive example for
samebib/2, sameauthor/2, sametitle/2, samevenue/2 and with 0.0 the nega-
tive examples for the same predicates. As for Alchemy we learned weights with
the four predicates as the non-evidence predicates. Table 1 shows in the Cora
and CoraT (Cora transitive) rows the average AUCPR obtained by training on
four mega-examples and testing on the remaining one. CEM and Alchemy on
CoraT gave a lack of memory error while RIB was not applicable because it was
not possible to split the input examples into smaller independent interpretations
as required by RIB.
    The UW-CSE dataset contains information about the Computer Science de-
partment of the University of Washington through 22 different predicates, such
as yearsInProgram/2, advisedBy/2, taughtBy/3 and is split into five mega-
-examples. The goal here is to predict the advisedBy/2 predicate, namely the
fact that a person is advised by another person: this was our target predicate.
The negative examples have been generated by considering all couple of persons
(a,b) where a and b appear in an advisedby/2 fact in the data and by adding
a negative example advisedby(a,b) if it is not in the data.
    The theory used was obtained from the MLN of [23]6 . It contains 86 rules,
such as for instance:

advisedby(S, P) :p :- courselevel(C,level_500),taughtby(C,P,Q),
                      ta(C, S, Q).
6
    http://alchemy.cs.washington.edu/mlns/uw-cse
We ran EMBLEM on it with a single variable for each instantiation of a rule, a
depth bound of two and one random restart.
    The annotated queries that LeProbLog takes as input have been created by
annotating with 1.0 each positive example for advisedby/2 and with 0.0 the
negative examples. As for Alchemy, we learned weights with advisedby/2 as
the only non-evidence predicate. Table 1 shows the AUCPR averaged over the
five mega-examples for all the algorithms.
    Table 3 shows the p-value of a paired two-tailed t-test at the 5% significance
level of the difference in AUCPR between EMBLEM and RIB/LeProbLog/CEM/
Alchemy (significant differences in bold).


Table 1. Results of the experiments on all datasets. IMDBu refers to the IMDB dataset
with the theory containing unseen predicates. CoraT refers to the theory containing
transitive rules. Numbers in parenthesis followed by r mean the number of random
restarts (when different from one) to reach the area specified. “me” means memory
error during learning. AUCPR is the area under the precision-recall curve averaged
over the five folds. R is RIB, L is LeProbLog, C is CEM, A is Alchemy.

                                        AUCPR
                  Dataset
                           EMBLEM R            L     C     A
                  IMDB-SP 0.202(500r) 0.199 0.096 0.202 0.107
                  IMDBu-SP 0.175(40r) 0.166 0.134 0.120 0.020
                  IMDB-SM     1.000     me 0.933 0.537 0.820
                  IMDBu-SM    1.000     me 0.933 0.515 0.338
                  Cora     0.995(120r) 0.939 0.905 0.995 0.469
                  CoraT       0.991     no 0.970 me me
                  UW-CSE      0.883    0.588 0.270 0.644 0.294


Table 2. Execution time in hours of the experiments on all datasets. R is RIB, L is
LeProbLog, C is CEM and A is Alchemy.

                                       Time(h)
                Dataset
                         EMBLEM R          L     C     A
                IMDB-SP     0.01   0.016 0.35 0.01 1.54
                IMDBu-SP    0.01  0.0098 0.23 0.012 1.54
                IMDB-SM   0.00036   me 0.005 0.0051 0.0026
                IMDBu-SM    3.22    me 0.0121 0.0467 0.0108
                Cora        2.48    2.49 13.25 11.95 1.30
                CoraT       0.38     no   4.61  me     me
                UW-CSE      2.81    0.56 1.49 0.53 1.95


   From the results we can observe that over IMDB EMBLEM has comparable
performances with CEM for IMDB-SP, with similar execution time. On IMDBu-
SP it has better performances than all other systems, with a learning time equal
Table 3. Results of t-test on all datasets. p is the p-value of a paired two-tailed t-test
(significant differences at the 5% level in bold) between EMBLEM and all the others.
R is RIB, L is LeProbLog, C is CEM, A is Alchemy.

                                      p
         Dataset
                  EMBLEM-R EMBLEM-L EMBLEM-C EMBLEM-A
         IMDB-SP    0.2167    0.0126    0.3739    0.0134
         IMDBu-SP   0.1276    0.1995    0.001  4.5234e-005
         IMDB-SM      me      0.3739    0.0241    0.1790
         IMDBu-SM     me      0.3739    0.2780 2.2270e-004
         Cora       0.011     0.0729       1      0.0068
         CoraT        no      0.0464      me        me
         UW-CSE     0.0054 1.5017e-004  0.0088 4.9921e-004


to the fastest other algorithm. On IMDB-SM it reaches the highest area value
in less time (only one restart is needed). On IMDBu-SM it still reaches the
highest area with one restart but with a longer execution time. Over Cora it has
comparable performances with the best other system CEM but in a significantly
lower time and over CoraT is one of the few systems to be able to complete
learning, with better performances in terms of area and time. Over UW-CSE it
has significant better performances with respect to all the algorithms.
    Memory errors, that we encountered with some systems over certain datasets,
have to be ascribed to the memory needs of the systems; for instance, some of
them are not able to manage the LPAD for CoraT because its transitive rules
generate large BDDs.


6    Conclusions

We have proposed a technique which applies an EM algorithm for learning the
parameters of Logic Programs with Annotated Disjunctions. It can be applied
to all languages that are based on the distribution semantics and exploits the
BDDs that are built during inference to efficiently compute the expectations for
hidden variables.
    We executed the algorithm over the real datasets IMDB, UW-CSE and Cora,
and evaluated its performances - together with those of four other probabilistic
systems - through the AUCPR and AUCROC. These results show that EM-
BLEM uses less memory than RIB, CEM and Alchemy, allowing it to solve
larger problems, as one can see from Table ?? where, for some datasets, not all
the mentioned algorithms are able to terminate. Moreover its speed allows to
perform a high number of restarts making it escape local maxima and achieve
higher AUCPR.
    EMBLEM is available in the cplint package in the source tree of Yap Pro-
log and information on its use can be found at http://sites.google.com/a/
unife.it/ml/emblem.
   In the future we plan to extend EMBLEM for learning the structure of LPADs
by combining the standard Expectation Maximization algorithm, which opti-
mizes parameters, with structure search for model selection.

References
 1. Bellodi, E., Riguzzi, F.: EM over binary decision diagrams for probabilistic logic
    programs. Tech. Rep. CS-2011-01, ENDIF, Università di Ferrara, Italy (2011)
 2. Davis, J., Goadrich, M.: The relationship between Precision-Recall and ROC
    curves. In: Cohen, W.W., Moore, A. (eds.) Proceedings of the 23rd International
    Conference on Machine Learning. ACM International Conference Proceeding Se-
    ries, vol. 148, pp. 233–240. ACM (2006)
 3. De Raedt, L., Demoen, B., Fierens, D., Gutmann, B., Janssens, G., Kimmig, A.,
    Landwehr, N., Mantadelis, T., Meert, W., Rocha, R., Santos Costa, V., Thon, I.,
    Vennekens, J.: Towards digesting the alphabet-soup of statistical relational learn-
    ing. In: Roy, D., Winn, J., McAllester, D., Mansinghka, V., Tenenbaum, J. (eds.)
    Proceedings of the 1st Workshop on Probabilistic Programming: Universal Lan-
    guages, Systems and Applications, in NIPS (2008)
 4. De Raedt, L., Kimmig, A., Toivonen, H.: ProbLog: A probabilistic prolog and its
    application in link discovery. In: Veloso, M.M. (ed.) Proceedings of the 20th In-
    ternational Joint Conference on Artificial Intelligence. pp. 2462–2467. AAAI Press
    (2007)
 5. Gutmann, B., Kimmig, A., Kersting, K., Raedt, L.D.: Parameter learning in prob-
    abilistic databases: A least squares approach. In: Daelemans, W., Goethals, B.,
    Morik, K. (eds.) Proceedings of the European Conference on Machine Learning
    and Knowledge Discovery in Databases. LNCS, vol. 5211, pp. 473–488. Springer
    (2008)
 6. Gutmann, B., Kimmig, A., Kersting, K., Raedt, L.: Parameter estimation in
    ProbLog from annotated queries. Tech. Rep. CW 583, Department of Computer
    Science, Katholieke Universiteit Leuven, Belgium (2010)
 7. Gutmann, B., Thon, I., De Raedt, L.: Learning the parameters of probabilistic
    logic programs from interpretations. Tech. Rep. CW 584, Department of Computer
    Science, Katholieke Universiteit Leuven, Belgium (June 2010)
 8. Inoue, K., Sato, T., Ishihata, M., Kameya, Y., Nabeshima, H.: Evaluating abduc-
    tive hypotheses using an em algorithm on bdds. In: Boutilier, C. (ed.) Proceedings
    of the 21st International Joint Conference on Artificial Intelligence (IJCAI). pp.
    810–815. Morgan Kaufmann Publishers Inc. (2009)
 9. Ishihata, M., Kameya, Y., Sato, T., Minato, S.: Propositionalizing the em algo-
    rithm by bdds. In: Zelezn, F., Lavra, N. (eds.) Late Breaking Papers of the 18th
    International Conference on Inductive Logic Programming. pp. 44–49 (2008)
10. Ishihata, M., Kameya, Y., Sato, T., Minato, S.: Propositionalizing the em algorithm
    by bdds. Tech. Rep. TR08-0004, Dept. of Computer Science, Tokyo Institute of
    Technology (2008)
11. Lowd, D., Domingos, P.: Efficient weight learning for Markov logic networks. In:
    Kok, J.N., Koronacki, J., de Mántaras, R.L., Matwin, S., Mladenic, D., Skowron, A.
    (eds.) Proceedings of the 18th European Conference on Machine Learning. LNCS,
    vol. 4702, pp. 200–211. Springer (2007)
12. Meert, W., Struyf, J., Blockeel, H.: Learning ground CP-Logic theories by leverag-
    ing Bayesian network learning techniques. Fundamenta Informaticae 89(1), 131–
    160 (2008)
13. Neapolitan, R.: Learning Bayesian Networks. Prentice Hall, Upper Saddle River,
    NJ (2003)
14. Poole, D.: The Independent Choice Logic for modelling multiple agents under
    uncertainty. Artificial Intelligence 94(1-2), 7–56 (1997)
15. Richardson, M., Domingos, P.: Markov logic networks. Machine Learning 62(1-2),
    107–136 (2006)
16. Riguzzi, F.: ALLPAD: Approximate learning of logic programs with annotated
    disjunctions. In: Muggleton, S., Otero, R.P., Tamaddoni-Nezhad, A. (eds.) Pro-
    ceedings of the 16th International Conference on Inductive Logic Programming.
    LNCS, vol. 4455, pp. 43–45. Springer (2007)
17. Riguzzi, F.: A top-down interpreter for LPAD and CP-Logic. In: Basili, R.,
    Pazienza, M.T. (eds.) Proceedings of the 10th Congress of the Italian Association
    for Artificial Intelligence. LNCS, vol. 4733, pp. 109–120. Springer (2007)
18. Riguzzi, F.: ALLPAD: approximate learning of logic programs with annotated
    disjunctions. Machine Learning 70(2-3), 207–223 (2008)
19. Riguzzi, F.: Extended semantics and inference for the Independent Choice Logic.
    Logic Journal of the IGPL 17(6), 589–629 (2009)
20. Riguzzi, F., Mauro, N.D.: Applying the information bottleneck to statistical rela-
    tional learning. Machine Learning (2011), to appear
21. Sato, T.: A statistical learning method for logic programs with distribution se-
    mantics. In: Sterling, L. (ed.) Proceedings of the 12th International Conference on
    Logic Programming. pp. 715–729. MIT Press (1995)
22. Sato, T., Kameya, Y.: Parameter learning of logic programs for symbolic-statistical
    modeling. Journal of Artificial Intelligence Research 15, 391–454 (2001)
23. Singla, P., Domingos, P.: Discriminative training of Markov logic networks. In:
    Veloso, M.M., Kambhampati, S. (eds.) Proceedings of the 20th National Confer-
    ence on Artificial Intelligence and the 17th Innovative Applications of Artificial
    Intelligence Conference. pp. 868–873. AAAI Press/The MIT Press (2005)
24. Singla, P., Domingos, P.: Entity resolution with Markov logic. In: Proceedings
    of the 6th IEEE International Conference on Data Mining. pp. 572–582. IEEE
    Computer Society (2006)
25. Thayse, A., Davio, M., Deschamps, J.P.: Optimization of multivalued decision
    algorithms. In: International Symposium on Multiple-Valued Logic. pp. 171–178.
    IEEE Computer Society Press (1978)
26. Thon, I., Landwehr, N., Raedt, L.D.: A simple model for sequences of relational
    state descriptions. In: Daelemans, W., Goethals, B., Morik, K. (eds.) Proceedings
    of the European conference on Machine Learning and Knowledge Discovery in
    Databases (ECML/PKDD 2008)- Part II. Lecture Notes in Computer Science,
    vol. 5212, pp. 506–521. Springer-Verlag (2008)
27. Vennekens, J., Denecker, M., Bruynooghe, M.: Cp-logic: A language of causal prob-
    abilistic events and its relation to logic programming. Theory and Practice of Logic
    Programming 9(3), 245–308 (2009)
28. Vennekens, J., Verbaeten, S.: Logic programs with annotated disjunctions. Tech.
    Rep. CW386, Department of Computer Science, Katholieke Universiteit Leuven,
    Belgium (2003)
29. Vennekens, J., Verbaeten, S., Bruynooghe, M.: Logic programs with annotated
    disjunctions. In: Demoen, B., Lifschitz, V. (eds.) Proceedings of the 20th Interna-
    tional Conference on Logic Programming. LNCS, vol. 3131, pp. 195–209. Springer
    (2004)