<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Learning the Parameters of Deep Probabilistic Logic Programs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Arnaud Nguembang Fadja</string-name>
          <email>arnaud.nguembafadja@unife.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabrizio Riguzzi</string-name>
          <email>fabrizio.riguzzi@unife.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Evelina Lamma</string-name>
          <email>evelina.lamma@unife.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dipartimento di Ingegneria</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dipartimento di Matematica e Informatica</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Ferrara Via Saragat</institution>
          <addr-line>1, I-44122, Ferrara</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Probabilistic logic programming (PLP) is a powerful tool for reasoning in relational domains with uncertainty. However, both inference and learning from examples are computationally expensive tasks. We consider a restriction of PLP called hierarchical PLP whose clauses and predicates are hierarchically organized forming a deep neural network or arithmetic circuit. Inference in this language is much cheaper than for general PLP languages. In this work we present an algorithm called Deep Parameter learning for HIerarchical probabilistic Logic programs (DPHIL) that learns hierarchical PLP parameters using gradient descent and back-propagation.</p>
      </abstract>
      <kwd-group>
        <kwd>Probabilistic Logic Programming</kwd>
        <kwd>Distribution Semantics</kwd>
        <kwd>Deep Neural Networks</kwd>
        <kwd>Arithmetic Circuits</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Probabilistic logic programming (PLP) under the distribution semantics [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] has
been very useful in machine learning. However, inference is expensive so machine
learning algorithms may turn out to be slow. We consider a restriction of the
language of Logic Programs with Annotated Disjunctions (LPADs) called
hierarchical PLP (HPLP), see [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] in which clauses and predicates are hierarchically
organized. Any HPLP can be translated into an arithmetic circuit (AC) or a
deep neural network. Inference is performed by evaluating the AC and
parameter learning by applying gradient descent and back-propagation algorithms.
      </p>
      <p>The paper is organized as follows: we describe HPLP in Section 2 and present
parameter learning in Section 3. Related work is discussed in Section 4 and
Section 5 concludes the paper.
hi1; : : : ; hini are logical atoms, bi1; : : : ; bimi are logical literals and
wih1e;r:e:: P;ninii are real numbers in the interval [0; 1] that sum up to 1. Clauses
k=1 ik &lt; 1 implicitly contain an extra atom null that does not appear
in the body of any clause and whose annotation is 1 Pkn=i1 ik.</p>
      <p>
        An HPLP is a restricted LPAD where clauses are of the form Cpi = ap :
pi : pi; bpi1; : : : ; bpimp . ap, an atom possibly about target predicate (the
atom we aim at predicting), is the single head atom annotated with probability
pi. The body of the clause has 2 parts: a conjunction of input predicates, pi, in
the sense that their de nition is given as input and is certain, and a conjunction
of literals for hidden predicates, bpik, that are de ned by clauses of the program.
Hidden predicates are disjoint from input and target predicates. Moreover, the
program is hierarchically organized so that it can be divided into layers forming
an Arithmetic circuit or a deep neural netwoks, see [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>Let us consider the following example in the UW-CSE domain where the
objective is to predict the \advised by" relation (advisedby/2 is the target
predicate)</p>
      <p>C1 = advisedby(A; B) : 0:3 :
student(A); prof essor(B); project(C; A); project(C; B);
r11(A; B; C):
C2 = advisedby(A; B) : 0:6 :</p>
      <p>student(A); prof essor(B); ta(C; A); taughtby(C; B):
C111 = r11(A; B; C) : 0:2 :</p>
      <p>publication(D; A; C); publication(D; B; C):
where project(C; A) means that C is a project with participant A, ta(C; A)
means that A is a teaching assistant (TA) for course C and taughtby(C; B)
means that course C is taught by B. publication(A; B; C) means that A is
a publication with author B produced in project C. student=1, prof essor=1,
project=2, ta=2, taughtby=2 and publication=3 are input predicates and r11=3 is
a hidden predicate.</p>
      <p>The probability p = advisedby(harry; ben) depends on the number of joint
courses and projects and on the number of joint publications from projects. The
clause for the hidden predicate r11=3 computes an aggregation over publications
of the same project and the clause of the level above aggregates over projects.
The corresponding arithmetic circuit is shown in Fig 1 .</p>
      <p>In order to perform inference with such a program, we can generate its
grounding. Each ground probabilistic clause is associated with a random
variable whose probability of being true is given by the parameter of the clause and
that is independent of all the other clause random variables. Given a ground
instance of Cpi, the probability that the body is true is computed by multiplying
the probability of being true of each individual atom in positive literals and one
minus the probability of being true of each individual atom in negative literals.
Therefore the probability of the body of Cpi is P (bpi1; : : : ; bpimp ) = Qim=pk P (bpik)
and P (bpik) = 1 P (apik) if bpik = :apik. If bpik = a is a literal for an input
predicate, P (a) = 1 if it belongs to the example interpretation and P (a) = 0
otherwise. To compute the probability, P (ap), of an atom ap of a given hidden
0:3
0:2</p>
      <p>L
L</p>
      <p>L</p>
      <p>
        0:6
literal bp, we need to take into account the contribution of every ground clause
for the atom ap. For a single clause, P (ap) = p1 P (body(Cp1)). For two clauses,
we can compute the contribution of each clause as above and then combine them
with the formula P (api) = 1 (1 p1 P (body(Cp1)) (1 p2 P (body(Cp2))).
We de ned the operator that combines two probabilities as follows p q =
1 (1 p) (1 q). This operator is commutative and associative. For many
clauses, we can compute sequences of applications as Li pi = 1 Qi(1 pi). AC
are built by performing inference using tabling and answer subsumption using
PITA(IND,IND) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] as described in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. However, writing HPLP can be
unintuitive because of the constraints imposed on it and because of uncleared meaning
of hidden predicates. We plan in our future work to design an algorithm for
learning both the parameters and the structure of HPLP.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Parameter learning</title>
      <p>
        We propose a back-propagation algorithm for learning HPLP parameters. Given
an HPLP T with parameters , an interpretation I de ning input predicates and
a set of positive and negative examples E = fe1; : : : ; eM ; not eM+1; : : : ; not eN g
where each ei is a ground atom for the target predicate r, we want to nd the
values of that minimize the sum of cross entropy errors, erri = yi log(pi)
(1 yi) log(1 pi), for all the examples. The algorithm, called DPHIL 1, starts by
building a set of ground ACs sharing parameters , line 2. Then weights (W [i]),
gradients (G[i]) and moments (adam's parameters, see [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]) are initialized, lines
2{6 . At each iteration, three actions are performed on each AC: the Forward
pass computes the output v(n) of each node n in the AC, the Backward pass
computes the derivative of the error d(n) 1, that is the gradient, with respect to
each parameter
8&gt;d(pan) vv(p(ann) )
&gt;&gt; 1 v(pan)
d(n) = &gt;&lt;d(pan) 1 v(n)
&gt;&gt;Ppan d(pan):v(pan):(1
&gt;
&gt;
: d(pan)
if n is a L node,
if n is a
      </p>
      <p>
        node
i) if n= (Wi)
pan = not(n)
(1)
where pan is the parent of node n and i = (Wi) = 1+e1 Wi , lines 9{16.
Parameters are updated using Adam the optimizer [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], line 17. These actions are
repeatedly performed until a maximum number of steps is reached or until the
di erence between the log likelihood of the current and the previous iteration
drops below a threshold or the di erence is below a fraction of the current log
likelihood. Finally the theory is updated with the learned parameters, line 19.
Algorithm 1 Function DPHIL.
1: function DHPIL(T heory; ; ; M axIter; 1; 2; ; ^; Strategy)
2: Examples BuildACs(T heory) . Build the set of ACs
3: for i 1 ! jT heoryj do . Initialize weights,gradient and moments vector
4: W [i] random(M in; M ax) . initially W [i] 2 [M in; M ax].
5: G[i] 0, M oment0[i] 0, M oment1[i] 0
6: end for
7: Iter 1, LL 0
8: repeat
9: LL0 LL
10: LL 0
11: Batch NextBatch(Examples) . Select the batch according to the
strategy
12: for all node 2 Batch do
13: P Forward(node)
14: Backward(G; P1 ; node)
15: LL LL + log P
16: end for
17: UpdateWeightsAdam(W; G; M oment0; M oment1; 1; 2; ; ^; Iter)
18: until LL LL0 &lt; _ LL LL0 &lt; LL: _ Iter &gt; M axIter
19: F inalT heory UpdateTheory(T heory; W )
20: return F inalT heory
21: end function
4
      </p>
    </sec>
    <sec id="sec-3">
      <title>Related Work</title>
      <p>
        HPLP is related to [
        <xref ref-type="bibr" rid="ref2 ref4 ref5">4,2,5</xref>
        ] where the probability of a query is computed by
combining the contribution of di erent rules and grounding of rules with
noisyOr or Mean combining rules. In First-Order Probabilistic Logic (FOPL), [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
and Bayesian Logic Programs (BLP), [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], each ground atoms is considered as
a random variable and rules have a single atom in the head and only positive
literals in the body. Each rule is associated with a conditional probability table
de ning the dependence of the head variable from the body ones. Similarly to
HPLP, FOPL and BLP allow multiple layer of rules. Di erently from FOPL
and HPLP, BLP allows di erent combining rules. Like FOLP, BLP and HPLP,
First-Order Conditional In uence Language (FOCIL) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], uses probabilistic rules
for compactly encoding probabilistic dependencies. The probability of a query is
computed using two combining rules: the contributions of di erent groundings
of the same rule with the same random variable in the head are combined by
taking the Mean, and the contributions of di erent rules are combined either
with a weighted mean or with a noisy-OR combining rule. FOPL, BLP and
FOCIL implement parameter learning using gradient descent, as DPHIL, or
using the Expectation Maximization algorithm. By suitably restricting the form
of programs, we were able to provide simpler formulas for the computation of
the gradient with respect to [
        <xref ref-type="bibr" rid="ref2 ref4 ref5">4,2,5</xref>
        ].
5
      </p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>
        We have presented the algorithm DPHIL that performs parameter learning of
HPLP. DPHIL converts an HPLP into an AC and performs gradient descent
and back-propagation. In the future we plan to experiments with DPHIL and
investigate the Expectation Maximization algorithm as an alternative parameter
learning method. We also plan to design an algorithm for learning both the
structure and the parameters of HPLPs, taking inspiration from SLIPCOVER
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bellodi</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riguzzi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Structure learning of probabilistic logic programs by searching the clause space</article-title>
          .
          <source>Theor. Pract. Log. Prog</source>
          .
          <volume>15</volume>
          (
          <issue>2</issue>
          ),
          <volume>169</volume>
          {
          <fpage>212</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Kersting</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Raedt</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Basic principles of learning bayesian logic programs</article-title>
          . In: Institute for Computer Science, University of Freiburg.
          <source>Citeseer</source>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Kingma</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ba</surname>
          </string-name>
          , J.:
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>arXiv preprint arXiv:1412.6980</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Koller</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pfe</surname>
            <given-names>er</given-names>
          </string-name>
          , A.:
          <article-title>Learning probabilities for noisy rst-order rules</article-title>
          .
          <source>In: IJCAI</source>
          . pp.
          <volume>1316</volume>
          {
          <issue>1323</issue>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Natarajan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tadepalli</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kunapuli</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shavlik</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Learning parameters for relational probabilistic models with noisy-or combining rule</article-title>
          .
          <source>In: Machine Learning and Applications</source>
          ,
          <year>2009</year>
          . ICMLA'09. International Conference on. pp.
          <volume>141</volume>
          {
          <fpage>146</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Nguembang</given-names>
            <surname>Fadja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Lamma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Riguzzi</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          :
          <article-title>Deep probabilistic logic programming</article-title>
          .
          <source>CEUR-WS</source>
          , vol.
          <year>1916</year>
          , pp.
          <volume>3</volume>
          {
          <fpage>14</fpage>
          .
          <string-name>
            <surname>Sun SITE Central Europe</surname>
          </string-name>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Riguzzi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Speeding up inference for probabilistic logic programs</article-title>
          .
          <source>Comput. J</source>
          .
          <volume>57</volume>
          (
          <issue>3</issue>
          ),
          <volume>347</volume>
          {
          <fpage>363</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Sato</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>A statistical learning method for logic programs with distribution semantics</article-title>
          . In: Sterling,
          <string-name>
            <surname>L. (ed.) ICLP</surname>
          </string-name>
          <year>1995</year>
          . pp.
          <volume>715</volume>
          {
          <fpage>729</fpage>
          . MIT Press (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>