<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Lifted Relational Neural Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gustav Sˇ ourek</string-name>
          <email>souregus@fel.cvut.cz</email>
          <email>v@asch.cz</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vojteˇch Aschenbrenner</string-name>
          <email>v@asch.cz</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Filip Zˇ elezny´</string-name>
          <email>zelezny@fel.cvut.cz</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ondrˇej Kuzˇelka</string-name>
          <email>KuzelkaO@cardiff.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Cardiff University</institution>
          ,
          <addr-line>Cardiff</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Charles University</institution>
          ,
          <addr-line>Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Czech Technical University</institution>
          ,
          <addr-line>Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We propose a method combining relational-logic representations with neural network learning. A general lifted architecture, possibly reflecting some background domain knowledge, is described through relational rules which may be handcrafted or learned. The relational rule-set serves as a template for unfolding possibly deep neural networks whose structures also reflect the structures of given training or testing relational examples. Different networks corresponding to different examples share their weights, which co-evolve during training by stochastic gradient descend algorithm. Discovery of notable latent relational concepts and experiments on 78 relational learning benchmarks demonstrate favorable performance of the method.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <sec id="sec-1-1">
        <title>Lifted models also known as templated models have attracted significant attention recently [10] in ar</title>
        <p>
          eas such as statistical relational learning. Lifted models define patterns from which specific (ground)
models can be unfolded. For example, a lifted Markov network model [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] may express that friends
of smokers tend to be smokers and such a pattern then constrains the probabilistic relationships in all
sets of vertices corresponding to particular friends-smokers in the derived ground Markov network.
        </p>
        <sec id="sec-1-1-1">
          <title>The lifted patterns are typically encoded in relational logic-based languages.</title>
          <p>
            Here we contribute a method for (deep) lifted feed-forward neural network learning, in which the
ground network structure is unfolded from a set of weighted rules in relational logic. The relational
rules are instantly interpretable and can be handcrafted by a domain expert or learned, e.g. through
techniques of Inductive Logic Programming (ILP) [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]. Weights of the ground neural networks are
determined by the weighted relational rules and can be learned by stochastic gradient descend
algorithm. This means that weights between different ground neurons constructed from the same
relational rule are tied in our framework, similarly to how weights are shared in lifted graphical
models in statistical relational learning or how weights are tied together by application of filters in
convolutional neural networks in deep learning. A salient property of our approach distinguishing it
from previous studies on adapting neural networks for relational learning is that the ground network
structure depends not only on the relational rule set but also on a particular example, i.e.,
different networks are constructed for different examples to exploit their particular relational properties.
          </p>
        </sec>
        <sec id="sec-1-1-2">
          <title>However, the different networks share their weights as these are all bound to the relational rules, and so weight-updates performed for one training example are reflected in networks produced for other examples, which allows the model to learn directly from relational data.</title>
        </sec>
        <sec id="sec-1-1-3">
          <title>Corresponding author.</title>
          <p>
            The main advantage of the presented approach is that it can effectively learn weights of latent
nonground relational structures, which is close in principle to predicate invention in ILP [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]. This is a
difficult task for existing lifted systems based on probabilistic inference because there one typically
needs to run expensive expectation maximization algorithms in order to learn parameters when
latent structures are present. On the other hand, deep neural networks, which we exploit in our work,
have been shown to effectively learn latent structures, although obviously only in the ground
nonrelational settings. By combining relational logic with deep neural networks, we obtain a framework
flexible enough to learn weights of latent relational structures, which we also verify experimentally.
          </p>
        </sec>
        <sec id="sec-1-1-4">
          <title>While there have been several works combining propositional or relational logic with neural net</title>
          <p>
            works [
            <xref ref-type="bibr" rid="ref21 ref3 ref6">21, 3, 6</xref>
            ], none of the existing methods is able to learn weights of latent non-ground relational
structures 1.
2
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Lifted Relational Neural Networks</title>
      <p>A lifted relational neural network (LRNN) N is a set of weighted definite clauses, i.e. pairs (Ri; wi)
where Ri is a function-free definite clause and wi is a real number. When N is a set of weighted
definite clauses, N will denote the corresponding set of the definite clauses without weights, i.e.
N = fC : (C; w) 2 N g. The set N must satisfy the following non-recursiveness2 requirement:
there must exist a strict ordering of predicates such that if there is a rule with a predicate p1 in the
head and a predicate p2 in the body then p1 p2.</p>
      <p>Given a LRNN N , let H be the least Herbrand model of N . We define grounding of the LRNN N
as N = f(h b1 ^ ^ bk ; w) : (h b1 ^ ^ bk; w) 2 N and fh ; b1 ; : : : ; bk g Hg.
That is, N is defined as the set of ground definite clauses which can be obtained by grounding rules
from the LRNN and which are active in the least Herbrand model of N (a rule is active in H if its
body is true in H). As already outlined in Introduction, LRNNs are templates for creating ground
neural networks. The requirement that ground rules should be active in H is beneficial for practice
because it provides us with flexibility in controlling complexity and tractability of the constructed
ground neural networks.</p>
      <p>Example 1. Let</p>
      <p>N =f(mother(C; M )</p>
      <p>parent(C; M ) ^ female(M ); 1);
(father(C; F )</p>
      <p>parent(C; F ) ^ male(F ); 2);
(female(alice); 1); (parent(bob; alice); 1); (parent(eve; alice); 1)g:</p>
      <sec id="sec-2-1">
        <title>Then for its grounding we have</title>
        <p>N =f(mother(bob; alice)
(mother(eve; alice)</p>
        <p>parent(bob; alice) ^ female(alice); 1);
parent(eve; alice) ^ female(alice); 1);
(female(alice); 1); (parent(bob; alice); 1); (parent(eve; alice); 1)g:
Notice that N does not contain the predicates male=1 or father=2 as there are no ground atoms based
on them in the least Herbrand model of N .</p>
        <p>Definition 1. Let N be a LRNN, and let N be its grounding. Let g_, g^ and g^ be families of
multivariate functions with exactly one function for each number of arguments. The ground neural
network of N is a feedforward neural network constructed as follows.</p>
        <p>For every ground atom h occurring in N , there is a neuron Ah, called atom neuron. The
activation functions of atom neurons are from the family g_.</p>
        <p>For every ground fact (h; w) 2 N , there is a neuron F(h;w), called fact neuron, which has
no input and always outputs a constant value.</p>
        <p>1Exemplification of latent non-ground relational concept learning and a more detailed description of Lifted</p>
      </sec>
      <sec id="sec-2-2">
        <title>Relational Neural Networks can be found in [20].</title>
        <p>2The reason why we do not allow recursion will be clearer when we explain weight learning in the next
section. Here, we just note that rule sets without recursion will allow us to directly exploit gradient descent
training of feed-forward neural networks.
For every ground rule h b1 ^ ^ bk 2 N , there is a neuron Rh b1 ^ ^bk ,
called rule neuron. It has the atom neurons Ab1 ; : : : ; Abk as inputs, all with weight 1.
The activation functions of rule neurons are from the family g^.</p>
        <p>For every rule (h b1 ^ ^ bk; w) 2 N and every h 2 H, there is a
neuron Agg(hh b1^ ^bk;w), called aggregation neuron. Its inputs are all rule neurons
Rh 0 b1 0^ ^bk 0 where h = h 0 with all weights equal to 1. The activation functions of
the aggregation neurons are from the family g .</p>
        <p>^
Inputs of an atom neuron Ah are the aggregation neurons Agg(hh b1^ ^bk;w) and fact
neurons F(h ;w). The weights of the input neurons are the respective w’s.</p>
        <sec id="sec-2-2-1">
          <title>Example 2. Let us consider the following LRNN</title>
          <p>N =f(foal(A) parent(A; P ) ^ horse(P ); wm); (foal(A) sibling(A; S) ^ horse(S); wn);
(horse(dakotta); w1); (horse(cheyenne); w2); (horse(aida); w3);
(parent(star; aida); w6); (parent(star; cheyenne); w5); (sibling(star; dakotta); w4)g:
The LRNN N and its ground neural network are shown in Fig. 1.</p>
          <p>Facts
horse(dakota)
horse(cheyenne)</p>
          <p>horse(aida)
parent(star,aida)
parent(star,cheyenne)
sibling(star,dakotta)</p>
          <p>Rule-bodies
parent(A,P)
1
1
horse(X) 1
sibling(B,S)</p>
          <p>A=HRule-heads</p>
          <p>foal(H)
B=H</p>
          <p>Fact neurons
parent(star,aida)</p>
          <p>horse(aida)
horse(cheyenne)
horse(dakotta)
w6
w3
w2
w1
parent(star,cheyenne) w5 parent(star,cheyenne) 1</p>
          <p>_
sibling(star,dakotta) w4 sibling(star,dakotta)</p>
          <p>_
Atoms neurons
parent(star,aida)</p>
          <p>_
horse(aida)</p>
          <p>_
horse(cheyenne)</p>
          <p>_
horse(dakotta)
_</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>What distinguishes LRNNs from ordinary neural networks the most is the following property. Hav</title>
        <p>ing a pre-trained LRNN N described by some general rules, we can extend it with description of
a particular case to obtain a ground neural network and then use the latter for prediction. This is
similar in spirit to lifted graphical models.</p>
        <p>Example 3. For instance, N may describe general rules for explosiveness of molecules (e.g.,
represented by a predicate explosive) and M1 and M2 may be sets of (weighted) facts describing two
particular molecules. Then to use the LRNN N for predicting whether M1 and M2 are explosive,
we can simply construct ground NNs of N [ M1 and N [ M2, and compute the output of the
respective atom neurons explosive1 2 N [ M1 and explosive2 2 N [ M2. As a distinctive feature
of lifted models, the two ground LRNNs for the two example molecules may have very different size
and structure because the least Herbrand models of N [ M1 and of N [ M2, which determine
the structures of the ground LRNNs, may be very different (because the structure and the size of the
molecules described by M1 and M2 are different). An illustration of this effect, for two example
molecules and a template N from Fig. 2, is displayed in Fig. 3.</p>
        <p>
          Dwietphednidffinergeonntbtheheauvsieodr.fFamoriliinetsuoitfivaecntievsast,ioinnofrudnecrtifoonrsrugl_e,sg(^h andbg1^^, we^cbakn;owb)tationbneehuarvael nsiemtwiloarrklys
to “if-then” rules, we should prefer the outputs of rule neurons to be high (e.g. close to 1) if and
only if all the inputs from the atom neurons corresponding to the literals from the body of the
rule have high outputs. Similarly, we should prefer the output of the atom neurons, which should
intuitively behave similarly to disjunction, to be high if and only if at least one of the rule neurons or
fact neurons, which are inputs for the given atom neuron, has high output. Logical operators from
various fuzzy logics [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] may serve as an inspiration for selecting suitable activation functions.
        </p>
        <p>H(h1) H1
bond(h1; h2)
O(X1)
N(X2)
.
.</p>
        <p>.</p>
        <p>H(Xn)</p>
        <p>wo1
w
o2 wn1
wn2
wh1
wh2</p>
        <p>group-types
X1 = A gr1(A) 1
X2 = A
Xn=A atom-atom-bond
X1=
X B
2 =B
Xn = B gr2(B)
1</p>
        <p>A = A
wh2
gr1(h1)</p>
        <p>gr2(h1)
gr1(o1)
wo2
H(h2)
b(o1; h2)
b(h2; o1)</p>
        <p>Example 4. In Goedel fuzzy logic, conjunction b1^ ^bk, where bi are fuzzy logic literals, is given
as mini bi and disjunction b1 _ _ bk is given as maxi bi. To emulate reasoning in Goedel logic,
we could simply set g^(b1; : : : ; bk) = mini bi; g^(b1; : : : ; bm) = maxi bi; and g_(b1; : : : ; bm) =
maxi bi. Here, the output of any rule neuron Rh b1^ ^bk is the minimum value which makes the
fuzzy truth value of the implication h b1 ^ ^ bk equal to 1 in the Goedel fuzzy logic. Likewise,
the output of any aggregation neuron is the minimum value which makes the fuzzy truth value of all
the respective ground implications equal to 1 simultaneously. This way, LRNNs can emulate fuzzy
logic programming.</p>
      </sec>
      <sec id="sec-2-4">
        <title>Next, we introduce two particular collections of activation functions inspired by fuzzy logic which will be used in the experiments (note that the activation functions shown in the above example would not be very suitable for gradient-based learning). 4</title>
        <p>Definition 2 (Max-Sigmoid Activation Functions). The Max-Sigmoid (MS)
collection of activation functions is composed of the following three families of functions:
g^(b1; : : : ; bk) = sigm Pik=1 bi k + b0 ; g^(b1; : : : ; bm) = maxi bi, and g_(b1; : : : ; bk) =
sigm</p>
        <p>Pk</p>
        <p>i=1 bi + b0 .</p>
        <p>The rationale for this family of activation functions is as follows. As already mentioned, the
activation function g^ should have high output if and only if all its inputs are high. To achieve this, we can
crudely approximate Lukasiewicz fuzzy conjunction, which is given as maxf0; b1 + +bk k+1g,
by the function sigm (b1 + + bk k + b0). The activation function g^ outputs the value equal
to the highest of its inputs. Example 5 illustrates that this can be seen as finding the best “match”
of a pattern (rule). The activation function g_ should have high output if at least one of the inputs
is high or if all inputs are somewhat high. To satisfy this, we can crudely approximate Lukasiewicz
fuzzy disjunction, which is given as minf1; b1 + +bkg by the function sigm (b1 + + bk + b0).
Example 6 illustrates the intuition for the activation function g_.</p>
        <sec id="sec-2-4-1">
          <title>Example 5. Let us consider the LRNN</title>
          <p>N =f(hasBrightEdge
isBright(E); 1); (isBright(E)
edge(E; U; V ) ^ bright(U ) ^ bright(V );
1); (bright(U )
yellow(U ); 2); (bright(U )
red(U ); 1); (bright(U )
blue(U ); 0:5)g:
Let us also have a set G describing a graph with colored vertices.</p>
          <p>G =f(edge(e1; v1; v2); 1); (edge(e2; v2; v3); 1); (edge(e3; v3; v4); 1); (edge(e4; v4; v1); 1);
(red(v1); 1); (blue(v2); 1); (yellow(v3); 1); (yellow(v4); 1)g
The output of the atom neuron AhasBrightEdge will only depend on the “brightest edge”, i.e. in this
case on the edge e3. The output would be the same for any other colored graph G0, which would
also contain an edge connecting two yellow vertices. Thus, for instance, if we considered some
physicochemical property of atoms (e.g. their partial charge) instead of brightness of colors, and
molecules instead of colored graphs, the corresponding networks could detect presence of a
molecular substructure similar to a prescribed pattern.</p>
        </sec>
        <sec id="sec-2-4-2">
          <title>Example 6. Let us have the LRNN</title>
          <p>N =f(highPressure(X)
(highPressure(X)
stressed(X); 1); (highPressure(X)</p>
          <p>
            obese(X); 1);
exercises(X); 1)g
and the set of weighted facts P = f(stressed(alice); 1); (obese(alice); 1); (stressed(bob); 1);
(exercises(bob); 1)g: Outputs of aggregation neurons corresponding to rules from N with the same
predicate in the head are combined using the activation functions g_. Intuitively, rules and facts with
the same predicate in the head can be seen as forming a logistic regression on the values given by the
aggregation neurons from the lower layers. When the LRNN has just one layer, as in this example,
one can achieve the same effect using techniques from propositionalization [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ] – treating the bodies
of the rules as features and feeding them as attributes to a logistic regression classifier. However, as
soon as the LRNN has more layers, this effect cannot be emulated using propositionalization. In this
particular example, if we construct the ground LRNN of N [ P then the output of the atom neuron
AhighPressure(alice) will be higher than the output of the atom neuron AhighPressure(bob) (because alice is
stressed and obese whereas bob is just stressed and exercises).
          </p>
        </sec>
      </sec>
      <sec id="sec-2-5">
        <title>The Max-Sigmoid activation function is obviously not the only one possible. It is useful when we</title>
        <p>are interested in detecting one or more patterns (such as the existence of an edge as bright as possible
in Example 5) but less useful in situations similar to the one depicted in the next example.</p>
        <sec id="sec-2-5-1">
          <title>Example 7. Let us consider the following simple LRNN for predicting individuals infected by flu</title>
          <p>N =f(hasFlu(A)</p>
          <p>friends(A; B) ^ hasFluDiagnosed(B); 1)g
and a set of weighted ground facts P about a group of people and their friendships. If we constructed
the ground neural networks of N [ P using the activation functions from the Max-Sigmoid family
then the prediction of whether an individual has flu would be entirely based on the existence of at
least one person who already had flu diagnosed. It would be obviously more meaningful to base the
predictions on the fraction of one’s friends who had flu diagnosed.</p>
        </sec>
      </sec>
      <sec id="sec-2-6">
        <title>A family of activation functions which are more appropriate in situations similar to to the one described in the above example is given by the next definition.</title>
        <p>Definition 3 (Avg-Sigmoid Activation Functions). The Avg-Sigmoid (AS) collection of
activation functions is composed of the following three families of functions: g^(b1; : : : ; bk) =
sigm Pik=1 bi k + b0 ; g^(b1; : : : ; bm) = m1 Pim=1 bi, and g_(b1; : : : ; bk) = Pik=1 bi + b0.</p>
      </sec>
      <sec id="sec-2-7">
        <title>Another advantage of the Avg-Sigmoid family of activation functions over the Max-Sigmoid family</title>
        <p>is also that the functions from the Avg-Sigmoid family are everywhere differentiable (which
simplifies learning). We note that other activation function families based on combinations of different
aggregation functions might also be exploited for LRNN learning. Further learning scenarios and</p>
      </sec>
      <sec id="sec-2-8">
        <title>LRNN modeling constructs can be found in [20].</title>
        <p>3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Weight Learning</title>
      <p>
        Let us have a LRNN N and a set of training examples E = fE 1; : : : ; E mg where each E j is some
structure represented by a set of weighted propositions (e.g., left part of Fig. 2), i.e. a LRNN
containing only facts3. Let us also have a set Q = ff(q11; t11); : : : ; (qk11 ; t1k1 )g ; : : : ; f(q1m; t1m); : : : ;
(qkmm ; tkmm )gg where qij are ground atoms, which we call training query atoms, and tij are their target
values. For any query atom qij , let yij denote the output of the atom neuron Aqj in the ground neural
i
network of N [ E j . The goal of the learning process is to find weights wh of the rules (and possibly
facts) in N minimizing cost J on the training query atoms J (Q) = Pjm=1 Pik=j1 cost(yij ; tij ) where
cost is some predefined cost function which measures the discrepancy between the output of the
atom neurons of the training query atoms and their desired target values. Similarly to conventional
NNs, weight adaptation is performed by gradient descent steps wh wh @@Jw(Qh ) where is some
given learning rate. The main difference is that in the case of LRNNs, the ground neural networks
may be very different for different learning examples E j . However, this is not a fundamental problem
because the weights for all the ground neural networks N [ E j are fully specified in the LRNN N .
Moreover, the weights from N can be repeated multiple times within a single N [ E j , but since
recursion is not allowed, the same weight can appear at most once on any simple path from a fact
neuron to an atom neuron. Therefore it is possible to learn the weights using conventional online
stochastic gradient descent algorithm4, except that the increments for the shared weights must be
accumulated, which is a simple consequence of linearity of partial differentiation.
Specifically, our weight-learning algorithm works as follows. First, it grounds the given LRNN N
w.r.t. every example E j from the dataset which gives it a set of ground neural networks N [ E j with
shared weights (it keeps the information about the origin of each weight so that it could update the
respective weights in the template in each step of the iteration). It then iterates over the ground
networks in a random order, computes gradient of the error function for the current particular example
given the current weights in the template, updates the weights accordingly and continues iterating
these steps (i.e., the standard stochastic gradient descent procedure). In order to reduce the risk of
getting stuck in poor quality local optima, we also employ a restart strategy for this algorithm. A
more detailed version of LRNN weight learning can be found in [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Related Work</title>
      <p>
        The main inspiration for the work presented in this paper are lifted graphical models such as Markov
logic networks [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] or Bayesian logic programs [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. However, none of these existing lifted
graphical models is particularly well suited for learning parameters of latent relational structures. Our
approach is also generally related to prior art in combining logical rules with neural networks, also
known as neural-symbolic integration [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], such as in the KBANN system. While the KBANN [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]
also constructs the network structure from given rules, these rules are propositional rather than
relational and do not serve as a lifted template. Therefore it is impossible to learn relational latent
3The restriction of learning from facts only is actually not necessary but it will simplify this presentation.
4Learning is slightly more complicated for LRNNs with the Max-Sigmoid family of activation functions
because the max operator introduces non-differentiable points to the optimization problem.
structures such as soft clustering of first-order-logic constants. A more recent system CILP++[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
utilizes a relational representation, which is however converted into a propositional form through a
propositionalization technique [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. This again means that latent non-ground relational structures
cannot be learned by CILP++ either. A somewhat more closely related paper of FONN method [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
also designs a technique forming a network from relational rule set however this rule set is flat,
producing only 1-layer (shallow) networks in which relational patterns are not hierarchically
aggregated. While there are many other approaches of neural-symbolic integration aiming at relational
(and first-order) representations [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], e.g. based on the CORE method [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], they typically search for
a uniform model of the logic program in scope and thus principally differ from the presented lifted
modeling approach.
      </p>
      <p>
        While standard feed-forward neural networks can be seen as a special case of LRNNs, since any
such a fixed neural architecture can be encoded in a corresponding ground rule set with respective
activation functions, a salient aspect of our method is that it allows for learning from structured
(relational) examples, rather than just attribute vectors. There has been previous work on adapting
neural networks to cope with certain facets of relational representations. For example, extension to
multi-instance learning was presented in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. A similarly directed work [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] facilitated aggregative
reasoning to process sets of related tuples from relational database as a sequence through
recurrent neural network structure, which was also presented for more general structures in [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. These
approaches are principally different from the presented method as they do not follow the lifted
modeling strategy to cope with variations in structure of relational samples.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Experiments</title>
      <sec id="sec-5-1">
        <title>In this section we describe experiments performed on 78 datasets of organic molecules: Mutagenesis</title>
        <p>
          dataset [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], four datasets from the predictive toxicollogy challenge and 73 NCI-GI datasets [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
The Mutagenesis dataset contains 188 molecules with labels denoting their mutagenicity. A number
of the results published on the mutagenesis dataset use extended set of features, providing additional
expert knowledge on properties of molecules, degrading the role of learning capabilities in relational
models (i.e. the expert features are discriminative enough by themselves). We do not use any of the
extra features as we utilize only atom-bond information. The predictive toxicology challenge dataset
(PTC) [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] is composed of four datasets of molecules labeled by their toxicity for female rats (fr),
mouse (fm) and male rat (mr) and mouse (mm). Each of the NCI-GI datasets contains several
thousands of molecules labeled by their ability to inhibit growth of different types of tumors. We
compare performance of LRNNs to state-of-the-art relational learners kFOIL [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] and nFOIL [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ],
where kFOIL combines relational rule learninng with support vector machines and nFOIL combines
relational rule learning with naive Bayes learning.
        </p>
        <p>
          For LRNNs we use a simple hand-crafted template which is principally identical to the template
discussed in Figure 2. Using such a generic template for all the datasets, we make sure that there is
no additional expert knowledge involved 5. The idea is that in the process of learning, useful latent
relational concepts are created within the neural network by the means of weight adaptation rather
than by explicit enumeration, in contrast to propositional approaches and ILP [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Indeed, none of
the rules used in this template is useful on itself for prediction as a hard logic rule without weight
adaptation.
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>To set the parameters of LRNNs we use the empirical risk minimization principle on the training</title>
        <p>
          cross-validation folds to select the parameters such as step size, restarts, number of iterations, etc.
This way we obtain unbiased estimates of performance of our methods since test data is never
involved in parameter selection. The time for training a LRNN was in the order of few hours for
the larger NCI-GI datasets. The results of the experiments are summarized in Figure 4. LRNNs
perform clearly the best of the algorithms in terms of accuracy as they have lower prediction error
than kFOIL and nFOIL on significant majority of datasets. We also tried to compare LRNNs with
another recent algorithm combining logic and neural networks called CILP++ [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], but we were not
able to set up a proper relational representation for CILP++ and thus direct comparison remains as
a part of our future work.
        </p>
        <p>5I.e., the template does not relate to toxicity or any other specific property of molecules and might be as
well used for other classification tasks, too.</p>
        <p>LRNN test error comparison
0.5
0.45
0.4
0.35
0.3</p>
        <p>In order to test the discovery of latent relational concepts, we performed an additional experiment
with the Mutagenesis dataset. The relational concepts we were interested in were chains of varying
lengths (up to 5 atoms). We trained the resulting LRNN to optimize the template’s weights, however
here we were more interested in extracting the learned patterns. We determined the chains of atoms
which gave the highest output for the learned latent predicates. We obtained the following atom
chain structures: C-C-F, N-O, C-Cl, C-Br, C-C-O, O-N-C. At least some of these structures appear
to be directly relevant for the mutagenicity as they contain organic structures containing halogen
atoms (Br, F and Cl). The other structures may be relevant to mutagenicity in combination with
other structures.</p>
      </sec>
      <sec id="sec-5-3">
        <title>Further, we have also investigated and confirmed the capability of LRNNs to learn proper weights of the latent non-ground relational concepts. This can be demonstrated e.g. on soft clustering of FOL predicates, a demonstration of which can be found in [20], together with evaluation and a closer description of the latent relational concept learning with Lifted Relational Neural Networks.</title>
        <p>6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <sec id="sec-6-1">
        <title>In this paper, we have introduced a method combining relational-logic representations with feedfor</title>
        <p>ward neural networks. The introduced method is close in spirit to lifted graphical models as it can
be viewed as providing a lifted template for construction of ground neural networks. The performed
experiments indicate that it is possible to achieve state-of-the-art predictive accuracies by weight
learning with very generic templates and that it is able to induce notable latent relational concepts.</p>
      </sec>
      <sec id="sec-6-2">
        <title>There are many directions for future work including structure learning, transfer learning or studying different collections of activation functions. An important future direction is also the question of extending LRNNs to support recursion.</title>
        <p>Acknowledgments
GS and F Zˇ are supported by Cisco sponsored research project “Modelling network traffic with
relational features”. While with CTU, OK was supported by the Czech Science Foundation through
project P202/12/2032 and now by a grant from the Leverhulme Trust (RPG-2014-164).</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Bader</surname>
          </string-name>
          and
          <string-name>
            <given-names>Pascal</given-names>
            <surname>Hitzler</surname>
          </string-name>
          .
          <article-title>Dimensions of Neural-symbolic Integration - A Structured Survey</article-title>
          .
          <source>arXiv preprint, page 28</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H</given-names>
            <surname>Blockeel</surname>
          </string-name>
          and
          <string-name>
            <given-names>W</given-names>
            <surname>Uwents</surname>
          </string-name>
          .
          <article-title>Using neural networks for relational learning</article-title>
          .
          <source>In ICML-2004 Workshop on Statistical Relational Learning and its Connection to Other Fields</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M</given-names>
            <surname>Botta</surname>
          </string-name>
          ,
          <string-name>
            <surname>Giordana</surname>
            <given-names>A</given-names>
          </string-name>
          , and
          <string-name>
            <given-names>R</given-names>
            <surname>Piola.</surname>
          </string-name>
          <article-title>Combining first order logic with connectionist learning</article-title>
          .
          <source>In Proceedings of the 14th International Conference on Machine Learning</source>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Artur S. d'Avila Garcez</surname>
          </string-name>
          , Krysia Broda, and
          <string-name>
            <surname>Dov</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Gabbay</surname>
          </string-name>
          .
          <source>Neural-Symbolic Learning Systems: Foundations and Applications</source>
          .
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Luc</given-names>
            <surname>De Raedt</surname>
          </string-name>
          .
          <source>Logical and Relational Learning</source>
          . Springer,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Manoel</surname>
            <given-names>VM</given-names>
          </string-name>
          <article-title>Franc¸a, Gerson Zaverucha, and Artur S dAvila Garcez</article-title>
          .
          <article-title>Fast relational learning using bottom clause propositionalization with artificial neural networks</article-title>
          .
          <source>Machine learning</source>
          ,
          <volume>94</volume>
          (
          <issue>1</issue>
          ):
          <fpage>81</fpage>
          -
          <lpage>104</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Christoph</given-names>
            <surname>Helma</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ross D. King</surname>
            ,
            <given-names>Stefan</given-names>
          </string-name>
          <string-name>
            <surname>Kramer</surname>
            , and
            <given-names>Ashwin</given-names>
          </string-name>
          <string-name>
            <surname>Srinivasan</surname>
          </string-name>
          .
          <article-title>The predictive toxicology challenge 2000-2001</article-title>
          . Bioinformatics,
          <volume>17</volume>
          (
          <issue>1</issue>
          ):
          <fpage>107</fpage>
          -
          <lpage>108</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Steffen</given-names>
            <surname>Ho</surname>
          </string-name>
          <article-title>¨lldobler, Yvonne Kalinke, and Hans Peter Sto¨rr. Approximating the semantics of logic programs by recurrent neural networks</article-title>
          .
          <source>Applied Intelligence</source>
          ,
          <volume>11</volume>
          (
          <issue>1</issue>
          ):
          <fpage>45</fpage>
          -
          <lpage>58</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Kristian</given-names>
            <surname>Kersting and Luc De Raedt</surname>
          </string-name>
          .
          <article-title>Towards combining inductive logic programming with bayesian networks</article-title>
          .
          <source>In Inductive Logic Programming</source>
          , 11th International Conference, ILP 2001, Strasbourg, France, September 9-
          <issue>11</issue>
          ,
          <year>2001</year>
          , Proceedings, pages
          <fpage>118</fpage>
          -
          <lpage>131</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A</given-names>
            <surname>Kimmig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L</given-names>
            <surname>Mihalkova</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L</given-names>
            <surname>Getoor</surname>
          </string-name>
          .
          <article-title>Lifted graphical models: a survey</article-title>
          .
          <source>Machine Learning</source>
          ,
          <volume>99</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>45</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>George</given-names>
            <surname>Klir</surname>
          </string-name>
          and
          <string-name>
            <given-names>Bo</given-names>
            <surname>Yuan</surname>
          </string-name>
          .
          <article-title>Fuzzy sets and fuzzy logic</article-title>
          , volume
          <volume>4</volume>
          . Prentice Hall New Jersey,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Mark-A Krogel</surname>
          </string-name>
          , Simon Rawles, Filip Zˇ elezny´,
          <source>Peter A Flach</source>
          ,
          <string-name>
            <surname>Nada</surname>
            <given-names>Lavracˇ</given-names>
          </string-name>
          , and Stefan Wrobel.
          <article-title>Comparative evaluation of approaches to propositionalization</article-title>
          . Springer,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>N.</given-names>
            <surname>Landwehr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Passerini</surname>
          </string-name>
          , L. De Raedt, and
          <string-name>
            <given-names>P.</given-names>
            <surname>Frasconi</surname>
          </string-name>
          .
          <article-title>kFOIL: learning simple relational kernels</article-title>
          .
          <source>In AAAI'06: Proceedings of the 21st national conference on Artificial intelligence</source>
          , pages
          <fpage>389</fpage>
          -
          <lpage>394</lpage>
          . AAAI Press,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Niels</surname>
            <given-names>Landwehr</given-names>
          </string-name>
          , Kristian Kersting, and Luc De Raedt.
          <article-title>Integrating naive bayes and foil</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          ,
          <volume>8</volume>
          :
          <fpage>481</fpage>
          -
          <lpage>507</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Huma</given-names>
            <surname>Lodhi</surname>
          </string-name>
          and
          <string-name>
            <given-names>Stephen</given-names>
            <surname>Muggleton</surname>
          </string-name>
          .
          <article-title>Is mutagenesis still challenging</article-title>
          .
          <source>ILP-Late-Breaking Papers</source>
          ,
          <volume>35</volume>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Liva</surname>
            <given-names>Ralaivola</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Sanjay J.</given-names>
            <surname>Swamidass</surname>
          </string-name>
          , Hiroto Saigo, and
          <string-name>
            <given-names>Pierre</given-names>
            <surname>Baldi</surname>
          </string-name>
          .
          <article-title>Graph kernels for chemical informatics</article-title>
          .
          <source>Neural Netw</source>
          .,
          <volume>18</volume>
          (
          <issue>8</issue>
          ):
          <fpage>1093</fpage>
          -
          <lpage>1110</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J</given-names>
            <surname>Ramon and L De Raedt</surname>
          </string-name>
          <article-title>. Multi instance neural networks</article-title>
          .
          <source>In Proceedings of the ICML Workshop on Attribute-Value and Relational Learning</source>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Matthew</given-names>
            <surname>Richardson</surname>
          </string-name>
          and
          <string-name>
            <given-names>Pedro</given-names>
            <surname>Domingos</surname>
          </string-name>
          .
          <article-title>Markov logic networks</article-title>
          .
          <source>Machine learning</source>
          ,
          <volume>62</volume>
          (
          <issue>1- 2</issue>
          ):
          <fpage>107</fpage>
          -
          <lpage>136</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Franco</surname>
            <given-names>Scarselli</given-names>
          </string-name>
          , Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and
          <string-name>
            <given-names>Gabriele</given-names>
            <surname>Monfardini</surname>
          </string-name>
          .
          <article-title>The graph neural network model</article-title>
          .
          <source>IEEE transactions on neural networks / a publication of the IEEE Neural Networks Council</source>
          ,
          <volume>20</volume>
          (
          <issue>1</issue>
          ):
          <fpage>61</fpage>
          -
          <lpage>80</lpage>
          ,
          <year>January 2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Gustav</surname>
            <given-names>Sourek</given-names>
          </string-name>
          , Vojtech Aschenbrenner, Filip Zelezny, and
          <string-name>
            <given-names>Ondrej</given-names>
            <surname>Kuzelka</surname>
          </string-name>
          .
          <source>Lifted Relational Neural Networks. arXiv preprint</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Geofrey</surname>
            <given-names>G Towell</given-names>
          </string-name>
          , Jude W Shavlik, and Michiel O Noordewier.
          <article-title>Refinement of approximate domain theories by knowledge-based neural networks</article-title>
          .
          <source>In Proceedings of the eighth National conference on Artificial intelligence</source>
          , pages
          <fpage>861</fpage>
          -
          <lpage>866</lpage>
          . Boston, MA,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>