=Paper=
{{Paper
|id=Vol-1583/CoCoNIPS_2015_paper_7
|storemode=property
|title=Lifted Relational Neural Networks
|pdfUrl=https://ceur-ws.org/Vol-1583/CoCoNIPS_2015_paper_7.pdf
|volume=Vol-1583
|authors=Gustav Sourek,Vojtech Aschenbrenner,Filip Zelezny,Ondrej Kuzelka
|dblpUrl=https://dblp.org/rec/conf/nips/SourekAZK15
}}
==Lifted Relational Neural Networks==
<pdf width="1500px">https://ceur-ws.org/Vol-1583/CoCoNIPS_2015_paper_7.pdf</pdf>
<pre>
                               Lifted Relational Neural Networks


                   Gustav Šourek                  Vojtěch Aschenbrenner              Filip Železný
              Czech Technical University             Charles University          Czech Technical University
               Prague, Czech Republic              Prague, Czech Republic         Prague, Czech Republic
             souregus@fel.cvut.cz                        v@asch.cz               zelezny@fel.cvut.cz

                                                    Ondřej Kuželka∗
                                                    Cardiff University
                                                 Cardiff, United Kingdom
                                              KuzelkaO@cardiff.ac.uk


                                                         Abstract
                      We propose a method combining relational-logic representations with neural net-
                      work learning. A general lifted architecture, possibly reflecting some background
                      domain knowledge, is described through relational rules which may be hand-
                      crafted or learned. The relational rule-set serves as a template for unfolding pos-
                      sibly deep neural networks whose structures also reflect the structures of given
                      training or testing relational examples. Different networks corresponding to dif-
                      ferent examples share their weights, which co-evolve during training by stochastic
                      gradient descend algorithm. Discovery of notable latent relational concepts and
                      experiments on 78 relational learning benchmarks demonstrate favorable perfor-
                      mance of the method.


         1       Introduction
         Lifted models also known as templated models have attracted significant attention recently [10] in ar-
         eas such as statistical relational learning. Lifted models define patterns from which specific (ground)
         models can be unfolded. For example, a lifted Markov network model [18] may express that friends
         of smokers tend to be smokers and such a pattern then constrains the probabilistic relationships in all
         sets of vertices corresponding to particular friends-smokers in the derived ground Markov network.
         The lifted patterns are typically encoded in relational logic-based languages.
         Here we contribute a method for (deep) lifted feed-forward neural network learning, in which the
         ground network structure is unfolded from a set of weighted rules in relational logic. The relational
         rules are instantly interpretable and can be handcrafted by a domain expert or learned, e.g. through
         techniques of Inductive Logic Programming (ILP) [5]. Weights of the ground neural networks are
         determined by the weighted relational rules and can be learned by stochastic gradient descend al-
         gorithm. This means that weights between different ground neurons constructed from the same
         relational rule are tied in our framework, similarly to how weights are shared in lifted graphical
         models in statistical relational learning or how weights are tied together by application of filters in
         convolutional neural networks in deep learning. A salient property of our approach distinguishing it
         from previous studies on adapting neural networks for relational learning is that the ground network
         structure depends not only on the relational rule set but also on a particular example, i.e., differ-
         ent networks are constructed for different examples to exploit their particular relational properties.
         However, the different networks share their weights as these are all bound to the relational rules, and
         so weight-updates performed for one training example are reflected in networks produced for other
         examples, which allows the model to learn directly from relational data.
             ∗
                 Corresponding author.


                                                              1

Copyright © 2015 for this paper by its authors. Copying permitted for private and academic purposes.
The main advantage of the presented approach is that it can effectively learn weights of latent non-
ground relational structures, which is close in principle to predicate invention in ILP [5]. This is a
difficult task for existing lifted systems based on probabilistic inference because there one typically
needs to run expensive expectation maximization algorithms in order to learn parameters when la-
tent structures are present. On the other hand, deep neural networks, which we exploit in our work,
have been shown to effectively learn latent structures, although obviously only in the ground non-
relational settings. By combining relational logic with deep neural networks, we obtain a framework
flexible enough to learn weights of latent relational structures, which we also verify experimentally.
While there have been several works combining propositional or relational logic with neural net-
works [21, 3, 6], none of the existing methods is able to learn weights of latent non-ground relational
structures 1 .


2    Lifted Relational Neural Networks

A lifted relational neural network (LRNN) N is a set of weighted definite clauses, i.e. pairs (Ri , wi )
where Ri is a function-free definite clause and wi is a real number. When N is a set of weighted
definite clauses, N ∗ will denote the corresponding set of the definite clauses without weights, i.e.
N ∗ = {C : (C, w) ∈ N }. The set N must satisfy the following non-recursiveness2 requirement:
there must exist a strict ordering ≺ of predicates such that if there is a rule with a predicate p1 in the
head and a predicate p2 in the body then p1 ≺ p2 .
Given a LRNN N , let H be the least Herbrand model of N ∗ . We define grounding of the LRNN N
as N = {(hθ ← b1 θ ∧ · · · ∧ bk θ, w) : (h ← b1 ∧ · · · ∧ bk , w) ∈ N and {hθ, b1 θ, . . . , bk θ} ⊆ H}.
That is, N is defined as the set of ground definite clauses which can be obtained by grounding rules
from the LRNN and which are active in the least Herbrand model of N ∗ (a rule is active in H if its
body is true in H). As already outlined in Introduction, LRNNs are templates for creating ground
neural networks. The requirement that ground rules should be active in H is beneficial for practice
because it provides us with flexibility in controlling complexity and tractability of the constructed
ground neural networks.
Example 1. Let
               N ={(mother(C, M ) ← parent(C, M ) ∧ female(M ), 1),
                  (father(C, F ) ← parent(C, F ) ∧ male(F ), 2),
                  (female(alice), 1), (parent(bob, alice), 1), (parent(eve, alice), 1)}.
Then for its grounding we have
               N ={(mother(bob, alice) ← parent(bob, alice) ∧ female(alice), 1),
                  (mother(eve, alice) ← parent(eve, alice) ∧ female(alice), 1),
                  (female(alice), 1), (parent(bob, alice), 1), (parent(eve, alice), 1)}.

Notice that N does not contain the predicates male/1 or father/2 as there are no ground atoms based
on them in the least Herbrand model of N .
                                                                                 ∗
Definition 1. Let N be a LRNN, and let N be its grounding. Let g∨ , g∧ and g∧      be families of
multivariate functions with exactly one function for each number of arguments. The ground neural
network of N is a feedforward neural network constructed as follows.

       • For every ground atom h occurring in N , there is a neuron Ah , called atom neuron. The
         activation functions of atom neurons are from the family g∨ .
       • For every ground fact (h, w) ∈ N , there is a neuron F(h,w) , called fact neuron, which has
         no input and always outputs a constant value.
    1
      Exemplification of latent non-ground relational concept learning and a more detailed description of Lifted
Relational Neural Networks can be found in [20].
    2
      The reason why we do not allow recursion will be clearer when we explain weight learning in the next
section. Here, we just note that rule sets without recursion will allow us to directly exploit gradient descent
training of feed-forward neural networks.


                                                       2
                                                                                                         ∗
           • For every ground rule hθ ← b1 θ ∧ · · · ∧ bk θ ∈ N , there is a neuron Rhθ←b1 θ∧···∧bk θ ,
             called rule neuron. It has the atom neurons Ab1 θ , . . . , Abk θ as inputs, all with weight 1.
             The activation functions of rule neurons are from the family g∧ .
           • For every rule (h ← b1 ∧ · · · ∧ bk , w) ∈ N and every hθ ∈ H, there is a neu-
             ron Agghθ(h←b1 ∧···∧bk ,w) , called aggregation neuron.    Its inputs are all rule neurons
             Rhθ0 ←b1 θ0 ∧···∧bk θ0 where hθ = hθ0 with all weights equal to 1. The activation functions of
                                                             ∗
             the aggregation neurons are from the family g∧    .

    • Inputs of an atom neuron Ahθ are the aggregation neurons Agghθ      (h←b1 ∧···∧bk ,w) and fact
       neurons F(hθ,w) . The weights of the input neurons are the respective w’s.
Example 2. Let us consider the following LRNN
  N ={(foal(A) ← parent(A, P ) ∧ horse(P ), wm ), (foal(A) ← sibling(A, S) ∧ horse(S), wn ),
     (horse(dakotta), w1 ), (horse(cheyenne), w2 ), (horse(aida), w3 ),
     (parent(star, aida), w6 ), (parent(star, cheyenne), w5 ), (sibling(star, dakotta), w4 )}.
The LRNN N and its ground neural network are shown in Fig. 1.
                                                                          Fact neurons                Atoms neurons
         Facts           Rule-bodies
                                                                                             w6
                                                                       parent(star,aida)            parent(star,aida)                 Rule neurons
    horse(dakota)                                                                                            ∨                1

                         parent(A,P)                                                                                                                     Aggregation neurons
                                                                                                                          1            foal(star)
                                                                                             w3                                            ∧         1
  horse(cheyenne)                                                        horse(aida)                  horse(aida)
                                        1                                                                    ∨                                               foal(star)        Atom neuron
                                                                                                                                                                ∧∗        wm
                                            A           Rule-heads                                                                                   1
                                                =                                            w5
     horse(aida)                                    H
                                                                     parent(star,cheyenne)        parent(star,cheyenne)                                                         foal(star)
                                                                                                             ∨                    1
                                                                                                                                                                                    ∨
                          horse(X)      1
                                                         foal(H)                                                                       foal(star)                         wn
                                                                                                                                           ∧
                                                                                                                              1
                                                                                             w2
  parent(star,aida)                                 H                  horse(cheyenne)              horse(cheyenne)
                                            B
                                                =                                                            ∨                                               foal(star)
                                                                                                                                                                ∧∗
                                        1                                                                                                            1
parent(star,cheyenne)                                                sibling(star,dakotta)   w4
                                                                                                  sibling(star,dakotta)
                                                                                                             ∨                1        foal(star)
                         sibling(B,S)                                                                                                      ∧
 sibling(star,dakotta)                                                                       w1                               1
                                                                        horse(dakotta)               horse(dakotta)
                                                                                                             ∨


Figure 1: Depiction of the rule-based template (left) of LRNN N from Ex. 2, and its corresponding
ground neural network N (right), with colors denoting the predicate signatures, rectangular nodes
corresponding to ground and circular to lifted literals, respectively.

What distinguishes LRNNs from ordinary neural networks the most is the following property. Hav-
ing a pre-trained LRNN N described by some general rules, we can extend it with description of
a particular case to obtain a ground neural network and then use the latter for prediction. This is
similar in spirit to lifted graphical models.
Example 3. For instance, N may describe general rules for explosiveness of molecules (e.g., rep-
resented by a predicate explosive) and M1 and M2 may be sets of (weighted) facts describing two
particular molecules. Then to use the LRNN N for predicting whether M1 and M2 are explosive,
we can simply construct ground NNs of N ∪ M1 and N ∪ M2 , and compute the output of the
respective atom neurons explosive1 ∈ N ∪ M1 and explosive2 ∈ N ∪ M2 . As a distinctive feature
of lifted models, the two ground LRNNs for the two example molecules may have very different size
and structure because the least Herbrand models of N ∗ ∪ M∗1 and of N ∗ ∪ M∗2 , which determine
the structures of the ground LRNNs, may be very different (because the structure and the size of the
molecules described by M1 and M2 are different). An illustration of this effect, for two example
molecules and a template N from Fig. 2, is displayed in Fig. 3.
                                                                       ∗
Depending on the used families of activation functions g∨ , g∧ and g∧    , we can obtain neural networks
with different behavior. For intuitiveness, in order for rules (h ← b1 ∧· · ·∧bk , w) to behave similarly
to “if-then” rules, we should prefer the outputs of rule neurons to be high (e.g. close to 1) if and
only if all the inputs from the atom neurons corresponding to the literals from the body of the
rule have high outputs. Similarly, we should prefer the output of the atom neurons, which should
intuitively behave similarly to disjunction, to be high if and only if at least one of the rule neurons or
fact neurons, which are inputs for the given atom neuron, has high output. Logical operators from
various fuzzy logics [11] may serve as an inspiration for selecting suitable activation functions.


                                                                                         3
                                                                                 types                                                          p-ty
                                                                               m-                                                           grou pes


                                                                           o
                                                                         At
               1        1
          H(h )     H                   bond(h2 , h1 )                                               w o1
                                                                                                                                                                                                    let-featu
                                                                                O(X1 )                               X1 = A                 gr1 (A)                1                           gr
                                                                                                                                                                                                 aph         re
   bond(h1 , h2 )                   2
                                H        H(h2 )


                                                                                                                                                                                                                s
                                                                                                                                                                            A=
                                                                                                                           =A                                                    A
                                                                                                                     X   2


                                                                                                wo
                                                                                                                                                                                               f1 (A, B)                             output-valu


                                                                                                 2
                                                                                                                                                 -bo


                                                                                                                                                                       1
                                                                                                          w n1                       A
                                                                                                                                             atom nd                                                                wf                          e
                                                                                                                             n
                                                                                                                                 =
                                                                                                                                           m-                                 A                                          1

                                                                                N(X2 )                                   X                                                  A= B


                                                                                                                                     ato
                                                                                                                                                                   1
                                        bond(h1 , o1 )
                                                                                                     wn                                    b(A, B)
                                                                                                                                                                            B=
                                                                                                                                                                                                     ..                         explosive
  bond(o1 , h1 )                H2       H(h1 )
                                                                                        ..           h1
                                                                                                          2                                                        1
                                                                                                                                                                                =
                                                                                                                                                                                    B


                                                                                                                      X
                                                                                                 w                                                                          B                                     w fm


                                                                                                                         1
                    O


                                                                                                                          =
      O(o1 )                1                                                                                        X                                              1


                                                                                                                                 B
                                                                                                                      2 =
                                                                                                                                 B                                                                fm (B, A)
  bond(o1 , h2 )                H1       H(h2)                                                       wh2                                                                1
                                                                                H(Xn )                               Xn = B                 gr2 (B)
                                        bond(h2 , o1 )


Figure 2: Two example molecules (left), described by surrounding sets of ground facts M1 and M2 ,
are being merged with the lifted LRNN N , composed of general weighted rules loosely pointing to
explosiveness of molecules (right), to form two ground networks displayed in Fig. 3. The rules
in N provide adaptive means to create latent groups (gri ) of atom types (O . . . H) that, through a
bond predicate (b(A, B)) connecting couples of atoms, form relational features (e.g., f1 (A, B) ←
gr1 (A) ∧ bond(A, B) ∧ gr2 (B)), which set the basis for the final explosiveness output. For the sake
of space we assume a single relational (graphlet) feature f1 only.


                                                                                                                                                                                              1
                                                                                                                                                      w h2
                                                                                                                                                           gr1 (h1 )                    gr2 (h1 )
                                                                                                                                                                                                        1
                                                                                                                                                      1
                       w h2                                                                                                                   wh                                                    1           f1 (o1 , h1 )
                                                  1            1                                                                                                                               1        1


                                                                                                                                                                                                                                w f1
                    gr1 (h1 )              gr2 (h1 )                                                                             H(h1 )                    b(h1 , o1 )               b(o1 , h1 )
           w h1                                                                                                                                                                                             1
                                                                   f1 (h1 , h2 )
 H(h1 )                                      1                                                                                                    w o1                                     11 1                 f1 (h1 , o1 )


                                                                                                                                                                                                                                wf
                                                         1 1


                                                                                                                                                                                                                                 1
                                                                            w
                                                                               f1                                                                                                                       1
                    b(h1 , h2 )           b(h2 , h1 )                                    explosive                               O(o1 )                       gr1 (o1 )                 gr2 (o1 )                                        explosive
                                                                   1                1
                                                                               wf                                                                                                                       1
                                                                                                                                                                wo2
 H(h2 )                                                1                                                                                                                                   11 1                 f1 (o1 , h2 )   wf
                                                                                                                                                                                                                                     1


           wh                                              1       f1 (h2 , h1 )                                                                                                                            1
               1
                    gr1 (h2 )              gr2 (h2 )                                                                             H(h2 )                    b(o1 , h2 )               b(h2 , o1 )


                                                                                                                                                                                                                                wf1
                                                                                                                                                                                                        1
                                                               1
                       wh                         1                                                                                           w                                                1                f1 (h2 , o1 )
                            2
                                                                                                                                              h
                                                                                                                                                  1
                                                                                                                                                                                                    1
                                                                                                                                                                                                        1
                                                                                                                                                           gr1 (h2 )                    gr2 (h2 )
                                                                                                                                                      wh
                                                                                                                                                          2
                                                                                                                                                                                              1


Figure 3: Two groundings N ∪ M1 and N ∪ M2 formed by merging the two example molecules
with the LRNN N from Fig. 2. The shared predicate signatures and weights tied by the template
are denoted by colors. For the sake of space we display only ground rule sets instead of complete
ground networks (i.e., fact and aggregation neurons are omitted), Fig. 1 illustrates the (direct) corre-
spondence of such a set to a full ground neural network.


Example 4. In Goedel fuzzy logic, conjunction b1 ∧· · ·∧bk , where bi are fuzzy logic literals, is given
as mini bi and disjunction b1 ∨ · · · ∨ bk is given as maxi bi . To emulate reasoning in Goedel logic,
                                                       ∗
we could simply set g∧ (b1 , . . . , bk ) = mini bi , g∧ (b1 , . . . , bm ) = maxi bi , and g∨ (b1 , . . . , bm ) =
maxi bi . Here, the output of any rule neuron Rh←b1 ∧···∧bk is the minimum value which makes the
fuzzy truth value of the implication h ← b1 ∧ · · · ∧ bk equal to 1 in the Goedel fuzzy logic. Likewise,
the output of any aggregation neuron is the minimum value which makes the fuzzy truth value of all
the respective ground implications equal to 1 simultaneously. This way, LRNNs can emulate fuzzy
logic programming.

Next, we introduce two particular collections of activation functions inspired by fuzzy logic which
will be used in the experiments (note that the activation functions shown in the above example would
not be very suitable for gradient-based learning).


                                                                                                                 4
Definition 2 (Max-Sigmoid Activation Functions). The Max-Sigmoid (MS) collec-
tion of activation functions P    is composed of the following three families of functions:
                               k                  ∗
g∧ (b1 , . . . , bk ) = sigm   i=1 bi − k + b0 , g∧ (b1 , . . . , bm ) = maxi bi , and g∨ (b1 , . . . , bk ) =
     P                  
             k
sigm         i=1 bi + b0 .


The rationale for this family of activation functions is as follows. As already mentioned, the activa-
tion function g∧ should have high output if and only if all its inputs are high. To achieve this, we can
crudely approximate Lukasiewicz fuzzy conjunction, which is given as max{0, b1 +· · ·+bk −k+1},
                                                                              ∗
by the function sigm (b1 + · · · + bk − k + b0 ). The activation function g∧    outputs the value equal
to the highest of its inputs. Example 5 illustrates that this can be seen as finding the best “match”
of a pattern (rule). The activation function g∨ should have high output if at least one of the inputs
is high or if all inputs are somewhat high. To satisfy this, we can crudely approximate Lukasiewicz
fuzzy disjunction, which is given as min{1, b1 +· · ·+bk } by the function sigm (b1 + · · · + bk + b0 ).
Example 6 illustrates the intuition for the activation function g∨ .
Example 5. Let us consider the LRNN
N ={(hasBrightEdge ← isBright(E), 1), (isBright(E) ← edge(E, U, V ) ∧ bright(U ) ∧ bright(V ),
   1), (bright(U ) ← yellow(U ), 2), (bright(U ) ← red(U ), 1), (bright(U ) ← blue(U ), 0.5)}.
Let us also have a set G describing a graph with colored vertices.
     G ={(edge(e1 , v1 , v2 ), 1), (edge(e2 , v2 , v3 ), 1), (edge(e3 , v3 , v4 ), 1), (edge(e4 , v4 , v1 ), 1),
        (red(v1 ), 1), (blue(v2 ), 1), (yellow(v3 ), 1), (yellow(v4 ), 1)}
The output of the atom neuron AhasBrightEdge will only depend on the “brightest edge”, i.e. in this
case on the edge e3 . The output would be the same for any other colored graph G 0 , which would
also contain an edge connecting two yellow vertices. Thus, for instance, if we considered some
physicochemical property of atoms (e.g. their partial charge) instead of brightness of colors, and
molecules instead of colored graphs, the corresponding networks could detect presence of a molec-
ular substructure similar to a prescribed pattern.
Example 6. Let us have the LRNN
          N ={(highPressure(X) ← stressed(X), 1), (highPressure(X) ← obese(X), 1),
             (highPressure(X) ← exercises(X), −1)}
and the set of weighted facts P = {(stressed(alice), 1), (obese(alice), 1), (stressed(bob), 1),
(exercises(bob), 1)}. Outputs of aggregation neurons corresponding to rules from N with the same
predicate in the head are combined using the activation functions g∨ . Intuitively, rules and facts with
the same predicate in the head can be seen as forming a logistic regression on the values given by the
aggregation neurons from the lower layers. When the LRNN has just one layer, as in this example,
one can achieve the same effect using techniques from propositionalization [12] – treating the bodies
of the rules as features and feeding them as attributes to a logistic regression classifier. However, as
soon as the LRNN has more layers, this effect cannot be emulated using propositionalization. In this
particular example, if we construct the ground LRNN of N ∪ P then the output of the atom neuron
AhighPressure(alice) will be higher than the output of the atom neuron AhighPressure(bob) (because alice is
stressed and obese whereas bob is just stressed and exercises).

The Max-Sigmoid activation function is obviously not the only one possible. It is useful when we
are interested in detecting one or more patterns (such as the existence of an edge as bright as possible
in Example 5) but less useful in situations similar to the one depicted in the next example.
Example 7. Let us consider the following simple LRNN for predicting individuals infected by flu
                   N ={(hasFlu(A) ← friends(A, B) ∧ hasFluDiagnosed(B), 1)}
and a set of weighted ground facts P about a group of people and their friendships. If we constructed
the ground neural networks of N ∪ P using the activation functions from the Max-Sigmoid family
then the prediction of whether an individual has flu would be entirely based on the existence of at
least one person who already had flu diagnosed. It would be obviously more meaningful to base the
predictions on the fraction of one’s friends who had flu diagnosed.


                                                         5
A family of activation functions which are more appropriate in situations similar to to the one de-
scribed in the above example is given by the next definition.
Definition 3 (Avg-Sigmoid Activation Functions). The Avg-Sigmoid (AS) collection of acti-
vation functions is composed
                                of the following three families of functions: g∧ (b1 , . . . , bk ) =
       Pk                    ∗                      1
                                                      Pm                                    Pk
sigm      i=1 bi − k + b0 , g∧ (b1 , . . . , bm ) = m  i=1 bi , and g∨ (b1 , . . . , bk ) =  i=1 bi + b0 .

Another advantage of the Avg-Sigmoid family of activation functions over the Max-Sigmoid family
is also that the functions from the Avg-Sigmoid family are everywhere differentiable (which sim-
plifies learning). We note that other activation function families based on combinations of different
aggregation functions might also be exploited for LRNN learning. Further learning scenarios and
LRNN modeling constructs can be found in [20].

3       Weight Learning
Let us have a LRNN N and a set of training examples E = {E 1 , . . . , E m } where each E j is some
structure represented by a set of weighted propositions (e.g., left part of Fig. 2), i.e. a LRNN
containing only facts3 . Let us also have a set Q = {{(q11 , t11 ), . . . , (qk11 , t1k1 )} , . . . , {(q1m , tm
                                                                                                               1 ), . . . ,
   m    m             j                                                                              j
(qkm , tkm )}} where qi are ground atoms, which we call training query atoms, and ti are their target
values. For any query atom qij , let yij denote the output of the atom neuron Aqj in the ground neural
                                                                                               i

network of N ∪ E j . The goal of the learning process is to find weights wh of the rules (and possibly
                                                                       Pm Pkj
facts) in N minimizing cost J on the training query atoms J(Q) = j=1 i=1              cost(yij , tji ) where
cost is some predefined cost function which measures the discrepancy between the output of the
atom neurons of the training query atoms and their desired target values. Similarly to conventional
NNs, weight adaptation is performed by gradient descent steps wh ← wh −γ ∂J(Q)    ∂wh where γ is some
given learning rate. The main difference is that in the case of LRNNs, the ground neural networks
may be very different for different learning examples E j . However, this is not a fundamental problem
because the weights for all the ground neural networks N ∪ E j are fully specified in the LRNN N .
Moreover, the weights from N can be repeated multiple times within a single N ∪ E j , but since
recursion is not allowed, the same weight can appear at most once on any simple path from a fact
neuron to an atom neuron. Therefore it is possible to learn the weights using conventional online
stochastic gradient descent algorithm4 , except that the increments for the shared weights must be
accumulated, which is a simple consequence of linearity of partial differentiation.
Specifically, our weight-learning algorithm works as follows. First, it grounds the given LRNN N
w.r.t. every example E j from the dataset which gives it a set of ground neural networks N ∪ E j with
shared weights (it keeps the information about the origin of each weight so that it could update the
respective weights in the template in each step of the iteration). It then iterates over the ground net-
works in a random order, computes gradient of the error function for the current particular example
given the current weights in the template, updates the weights accordingly and continues iterating
these steps (i.e., the standard stochastic gradient descent procedure). In order to reduce the risk of
getting stuck in poor quality local optima, we also employ a restart strategy for this algorithm. A
more detailed version of LRNN weight learning can be found in [20].

4       Related Work
The main inspiration for the work presented in this paper are lifted graphical models such as Markov
logic networks [18] or Bayesian logic programs [9]. However, none of these existing lifted graph-
ical models is particularly well suited for learning parameters of latent relational structures. Our
approach is also generally related to prior art in combining logical rules with neural networks, also
known as neural-symbolic integration [4], such as in the KBANN system. While the KBANN [21]
also constructs the network structure from given rules, these rules are propositional rather than re-
lational and do not serve as a lifted template. Therefore it is impossible to learn relational latent
    3
    The restriction of learning from facts only is actually not necessary but it will simplify this presentation.
    4
    Learning is slightly more complicated for LRNNs with the Max-Sigmoid family of activation functions
because the max operator introduces non-differentiable points to the optimization problem.


                                                            6
structures such as soft clustering of first-order-logic constants. A more recent system CILP++[6]
utilizes a relational representation, which is however converted into a propositional form through a
propositionalization technique [12]. This again means that latent non-ground relational structures
cannot be learned by CILP++ either. A somewhat more closely related paper of FONN method [3]
also designs a technique forming a network from relational rule set however this rule set is flat,
producing only 1-layer (shallow) networks in which relational patterns are not hierarchically aggre-
gated. While there are many other approaches of neural-symbolic integration aiming at relational
(and first-order) representations [1], e.g. based on the CORE method [8], they typically search for
a uniform model of the logic program in scope and thus principally differ from the presented lifted
modeling approach.
While standard feed-forward neural networks can be seen as a special case of LRNNs, since any
such a fixed neural architecture can be encoded in a corresponding ground rule set with respective
activation functions, a salient aspect of our method is that it allows for learning from structured
(relational) examples, rather than just attribute vectors. There has been previous work on adapting
neural networks to cope with certain facets of relational representations. For example, extension to
multi-instance learning was presented in [17]. A similarly directed work [2] facilitated aggregative
reasoning to process sets of related tuples from relational database as a sequence through recur-
rent neural network structure, which was also presented for more general structures in [19]. These
approaches are principally different from the presented method as they do not follow the lifted mod-
eling strategy to cope with variations in structure of relational samples.


5    Experiments

In this section we describe experiments performed on 78 datasets of organic molecules: Mutagenesis
dataset [15], four datasets from the predictive toxicollogy challenge and 73 NCI-GI datasets [16].
The Mutagenesis dataset contains 188 molecules with labels denoting their mutagenicity. A number
of the results published on the mutagenesis dataset use extended set of features, providing additional
expert knowledge on properties of molecules, degrading the role of learning capabilities in relational
models (i.e. the expert features are discriminative enough by themselves). We do not use any of the
extra features as we utilize only atom-bond information. The predictive toxicology challenge dataset
(PTC) [7] is composed of four datasets of molecules labeled by their toxicity for female rats (fr),
mouse (fm) and male rat (mr) and mouse (mm). Each of the NCI-GI datasets contains several
thousands of molecules labeled by their ability to inhibit growth of different types of tumors. We
compare performance of LRNNs to state-of-the-art relational learners kFOIL [13] and nFOIL [14],
where kFOIL combines relational rule learninng with support vector machines and nFOIL combines
relational rule learning with naive Bayes learning.
For LRNNs we use a simple hand-crafted template which is principally identical to the template
discussed in Figure 2. Using such a generic template for all the datasets, we make sure that there is
no additional expert knowledge involved 5 . The idea is that in the process of learning, useful latent
relational concepts are created within the neural network by the means of weight adaptation rather
than by explicit enumeration, in contrast to propositional approaches and ILP [5]. Indeed, none of
the rules used in this template is useful on itself for prediction as a hard logic rule without weight
adaptation.
To set the parameters of LRNNs we use the empirical risk minimization principle on the training
cross-validation folds to select the parameters such as step size, restarts, number of iterations, etc.
This way we obtain unbiased estimates of performance of our methods since test data is never
involved in parameter selection. The time for training a LRNN was in the order of few hours for
the larger NCI-GI datasets. The results of the experiments are summarized in Figure 4. LRNNs
perform clearly the best of the algorithms in terms of accuracy as they have lower prediction error
than kFOIL and nFOIL on significant majority of datasets. We also tried to compare LRNNs with
another recent algorithm combining logic and neural networks called CILP++ [6], but we were not
able to set up a proper relational representation for CILP++ and thus direct comparison remains as
a part of our future work.

   5
     I.e., the template does not relate to toxicity or any other specific property of molecules and might be as
well used for other classification tasks, too.


                                                      7
                                           LRNN test error comparison
          0.5

         0.45

          0.4

         0.35

          0.3
 Error


         0.25

          0.2

         0.15

          0.1

         0.05

           0


                           BT_549
                        M19_MEL


                         HS_578T


                       HCC_2998
                              PC_3


                          MDA_N
                             K_562


                            UO_31
                             T_47D
                         RXF_393


                      RPMI_8226


                         RXF_631


                              nci-gi


                           SF_268


                           SF_295


                              A498


                           SF_539
                     mutagenesis
                       P388_ADR


                          SW_620


                      A549_ATCC
                          MOLT_4
                            DLD_1


                MDA_MB_231_ATCC


                        OVCAR_4


                        OVCAR_8
                        OVCAR_3


                            CAKI_1
                             KM12


                            SN12C
                       UACC_257


                         SK_OV_3
                        OVCAR_5
                          ptc-mm
                         UACC_62


                               M14
                        DMS_114


                        NCI_H522

                        DMS_273
                          KM20L2


                        NCI_H226
                        NCI_H460


                       CCRF_CEM
                           XF_498


                              MCF7


                          HCT_15


                            ptc-mr
                         NCI_H23
                           SN12K1


                        LOX_IMVI


                              ptc-fr


                              EKVX


                       COLO_205
                    NCI_ADR_RES


                      SK_MEL_28
                             ACHN


                          IGROV1
                              P388


                                 SR


                        LXFL_529


                       SK_MEL_5


                         HCT_116


                       SK_MEL_2


                             TK_10
                              U251


                     MALME_3M
                          HOP_18
                              HT29


                          HOP_92


                          HOP_62
                          SNB_78


                          SNB_75


                          DU_145


                     NCI_H322M
                          SNB_19


                            ptc-fm
                             786_0


                    MDA_MB_435


                        HL_60_TB
                                                nFoil   kFoil   LRNN


Figure 4: Prediction errors of LRNNs, kFOIL and nFOIL measured by cross-validation on 78
datasets of organic molecules.


In order to test the discovery of latent relational concepts, we performed an additional experiment
with the Mutagenesis dataset. The relational concepts we were interested in were chains of varying
lengths (up to 5 atoms). We trained the resulting LRNN to optimize the template’s weights, however
here we were more interested in extracting the learned patterns. We determined the chains of atoms
which gave the highest output for the learned latent predicates. We obtained the following atom
chain structures: C-C-F, N-O, C-Cl, C-Br, C-C-O, O-N-C. At least some of these structures appear
to be directly relevant for the mutagenicity as they contain organic structures containing halogen
atoms (Br, F and Cl). The other structures may be relevant to mutagenicity in combination with
other structures.
Further, we have also investigated and confirmed the capability of LRNNs to learn proper weights of
the latent non-ground relational concepts. This can be demonstrated e.g. on soft clustering of FOL
predicates, a demonstration of which can be found in [20], together with evaluation and a closer
description of the latent relational concept learning with Lifted Relational Neural Networks.


6               Conclusions

In this paper, we have introduced a method combining relational-logic representations with feedfor-
ward neural networks. The introduced method is close in spirit to lifted graphical models as it can
be viewed as providing a lifted template for construction of ground neural networks. The performed
experiments indicate that it is possible to achieve state-of-the-art predictive accuracies by weight
learning with very generic templates and that it is able to induce notable latent relational concepts.
There are many directions for future work including structure learning, transfer learning or studying
different collections of activation functions. An important future direction is also the question of
extending LRNNs to support recursion.


Acknowledgments

GS and FŽ are supported by Cisco sponsored research project “Modelling network traffic with re-
lational features”. While with CTU, OK was supported by the Czech Science Foundation through
project P202/12/2032 and now by a grant from the Leverhulme Trust (RPG-2014-164).


                                                        8
References
 [1] Sebastian Bader and Pascal Hitzler. Dimensions of Neural-symbolic Integration - A Structured
     Survey. arXiv preprint, page 28, 2005.
 [2] H Blockeel and W Uwents. Using neural networks for relational learning. In ICML-2004
     Workshop on Statistical Relational Learning and its Connection to Other Fields, 2004.
 [3] M Botta, Giordana A, and R Piola. Combining first order logic with connectionist learning. In
     Proceedings of the 14th International Conference on Machine Learning, 1997.
 [4] Artur S. d’Avila Garcez, Krysia Broda, and Dov M. Gabbay. Neural-Symbolic Learning Sys-
     tems: Foundations and Applications. 2012.
 [5] Luc De Raedt. Logical and Relational Learning. Springer, 2008.
 [6] Manoel VM França, Gerson Zaverucha, and Artur S dAvila Garcez. Fast relational learning
     using bottom clause propositionalization with artificial neural networks. Machine learning,
     94(1):81–104, 2014.
 [7] Christoph Helma, Ross D. King, Stefan Kramer, and Ashwin Srinivasan. The predictive toxi-
     cology challenge 2000–2001. Bioinformatics, 17(1):107–108, 2001.
 [8] Steffen Hölldobler, Yvonne Kalinke, and Hans Peter Störr. Approximating the semantics of
     logic programs by recurrent neural networks. Applied Intelligence, 11(1):45–58, 1999.
 [9] Kristian Kersting and Luc De Raedt. Towards combining inductive logic programming with
     bayesian networks. In Inductive Logic Programming, 11th International Conference, ILP
     2001, Strasbourg, France, September 9-11, 2001, Proceedings, pages 118–131, 2001.
[10] A Kimmig, L Mihalkova, and L Getoor. Lifted graphical models: a survey. Machine Learning,
     99(1):1–45, 2015.
[11] George Klir and Bo Yuan. Fuzzy sets and fuzzy logic, volume 4. Prentice Hall New Jersey,
     1995.
[12] Mark-A Krogel, Simon Rawles, Filip Železný, Peter A Flach, Nada Lavrač, and Stefan Wrobel.
     Comparative evaluation of approaches to propositionalization. Springer, 2003.
[13] N. Landwehr, A. Passerini, L. De Raedt, and P. Frasconi. kFOIL: learning simple relational
     kernels. In AAAI’06: Proceedings of the 21st national conference on Artificial intelligence,
     pages 389–394. AAAI Press, 2006.
[14] Niels Landwehr, Kristian Kersting, and Luc De Raedt. Integrating naive bayes and foil. The
     Journal of Machine Learning Research, 8:481–507, 2007.
[15] Huma Lodhi and Stephen Muggleton. Is mutagenesis still challenging. ILP-Late-Breaking
     Papers, 35, 2005.
[16] Liva Ralaivola, Sanjay J. Swamidass, Hiroto Saigo, and Pierre Baldi. Graph kernels for chem-
     ical informatics. Neural Netw., 18(8):1093–1110, 2005.
[17] J Ramon and L De Raedt. Multi instance neural networks. In Proceedings of the ICML
     Workshop on Attribute-Value and Relational Learning, 2000.
[18] Matthew Richardson and Pedro Domingos. Markov logic networks. Machine learning, 62(1-
     2):107–136, 2006.
[19] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfar-
     dini. The graph neural network model. IEEE transactions on neural networks / a publication
     of the IEEE Neural Networks Council, 20(1):61–80, January 2009.
[20] Gustav Sourek, Vojtech Aschenbrenner, Filip Zelezny, and Ondrej Kuzelka. Lifted Relational
     Neural Networks. arXiv preprint, 2015.
[21] Geofrey G Towell, Jude W Shavlik, and Michiel O Noordewier. Refinement of approximate
     domain theories by knowledge-based neural networks. In Proceedings of the eighth National
     conference on Artificial intelligence, pages 861–866. Boston, MA, 1990.


                                                9

</pre>