<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Relational Knowledge Extraction from Neural Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Manoel Vitor Macedo Franc¸a</string-name>
          <email>manoel.franca@city.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Artur S. d'Avila Garcez</string-name>
          <email>a.garcez@city.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gerson Zaverucha</string-name>
          <email>gerson@cos.ufrj.br</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, City University London</institution>
          ,
          <addr-line>London</addr-line>
          ,
          <country country="UK">United Kingdom</country>
          <addr-line>EC1V 0HB</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Prog. de Eng. de Sistemas e Computac ̧a ̃o, Universidade Federal do Rio de Janeiro</institution>
          ,
          <addr-line>Rio de Janeiro, Brazil 21941-972</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>The effective integration of learning and reasoning is a well-known and challenging area of research within artificial intelligence. Neural-symbolic systems seek to integrate learning and reasoning by combining neural networks and symbolic knowledge representation. In this paper, a novel methodology is proposed for the extraction of relational knowledge from neural networks which are trainable by the efficient application of the backpropagation learning algorithm. First-order logic rules are extracted from the neural networks, offering interpretable symbolic relational models on which logical reasoning can be performed. The wellknown knowledge extraction algorithm TREPAN was adapted and incorporated into the first-order version of the neural-symbolic system CILP++. Empirical results obtained in comparison with a probabilistic model for relational learning, Markov Logic Networks, and a state-of-the-art Inductive Logic Programming system, Aleph, indicate that the proposed methodology achieves competitive accuracy results consistently in all datasets investigated, while either Markov Logic Networks or Aleph show considerably worse results in at least one dataset. It is expected that effective knowledge extraction from neural networks can contribute to the integration of heterogeneous knowledge representations.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Integrating learning and reasoning efficiently and accurately has a vast track of research and
publications in artificial intelligence [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1, 2, 3, 4</xref>
        ]. This integration can be done at different stages of learning,
from data pre-processing, feature extraction, the learning algorithm, up to reasoning about learning.
Neural-symbolic systems seek to integrate learning and reasoning by combining neural networks
and symbolic knowledge representations using, e.g., propositional logic or first-order logic.
Relational learning can be described as the process of learning a first-order logic theory from
examples and domain knowledge [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. Differently from propositional learning, relational learning
does not use a set of attributes and values. Instead, it is based on objects and relations among
objects, which are represented by constants and predicates, respectively. Relational learning has had
applications in bioinformatics, graph mining and link analysis [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ].
      </p>
      <p>
        Inductive Logic Programming (ILP) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] performs relational learning either directly by
manipulating first-order rules or through a process called propositionalization [
        <xref ref-type="bibr" rid="ref10 ref11 ref9">9, 10, 11</xref>
        ], which brings the
relational task down to the propositional level by representing subsets of relations as features that
can be used as attributes. In comparison with direct ILP, propositionalization normally exchanges
accuracy for efficiency [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], as it enables the use of fast attribute-value learners [
        <xref ref-type="bibr" rid="ref10 ref13 ref9">13, 10, 9</xref>
        ], although
the translation of first-order rules into features can cause information loss [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        Much work has been done combining relational learning tasks with propositional learners,
including decision trees or neural networks [
        <xref ref-type="bibr" rid="ref15 ref16 ref17 ref18 ref19">15, 16, 17, 18, 19</xref>
        ]. In this paper, we are interested in the,
less investigated, inverse problem: how to extract first-order logic descriptions from propositional
learners, in particular, neural networks, trained to solve relational learning tasks?
We extend the well-known CILP neural-symbolic system [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to allow the extraction of meaningful
first-order logic rules from trained neural networks. Propositionalization and subsequent
attributevalue learning can destroy the original relational structure of the task at hand, so much so that the
provision of interpretable relational knowledge following learning is made impossible [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. In this
paper, we show that by adapting the first-order version of the CILP system, called CILP++ [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], so as
to enable the application of a variation of the well-known TREPAN knowledge extraction algorithm
[
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], a revised set of first-order rules can be extracted from trained neural networks efficiently and
accurately, enabling first-order logical reasoning about what has been learned by the network. The
result is a neural network, trained using an efficient backpropagation learning algorithm, and capable
of receiving “first-order” examples as input and producing first order rules as output. The ability
to perform reasoning directly opens a number of research and application possibilities integrating
reasoning and learning [
        <xref ref-type="bibr" rid="ref21 ref22 ref7 ref8">21, 7, 8, 22</xref>
        ].
      </p>
      <p>
        We have compared relational knowledge extraction in CILP++ with state-of-the-art ILP system
Aleph [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] and Markov Logic Networks (MLN’s) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] on the Mutagenesis [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], UW-CSE [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ],
Alzheimer-amine [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] and Cora [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] datasets. Results indicate that the relational theories extracted
from CILP++ have high fidelity to the trained neural network models, and that the use of neural
networks can provide considerable speed-ups while achieving comparable accuracy and area under
ROC curve results.
      </p>
      <p>
        The choice of using MLN’s and Aleph for empirical comparisons is due the nature of their
methodology for tackling relational learning, which are distinctively different: MLN’s take a probabilistic
approach for the relational learning problem, by attempting to find a distribution model that fits the
ground atoms of a hypothesis interpretation as best as possible, while Aleph performs relational
learning by searching the Herbrand space [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] of possible literals for a given dataset.
The remainder of the paper is as follows: section 2 introduces CILP++, the neural-symbolic system
that uses the proposed approach in this paper for relational knowledge extraction from trained neural
networks. Section 3 presents obtained experimental results with CILP++ on the Mutagenesis,
UWCSE, Alzheimer-amine and Cora datasets, comparing against MLN’s and Aleph. Lastly, section 4
discusses outcomes from the experiments performed and also does an overview of other systems
that are closely related with the work being presented in this paper.
      </p>
      <p>CILP++ is available from Sourceforge (https://sourceforge.net/projects/
cilppp/) and experimental settings will be made available for download.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Relational Learning with Neural Networks</title>
      <p>This section introduces the proposed system for relational knowledge extraction from trained neural
networks. It starts by presenting each module of the CILP++ system and how they can be adapted
to allow direct first-order knowledge extraction and inference from the trained models.
2.1</p>
      <sec id="sec-2-1">
        <title>Bottom Clause Propositionalization</title>
        <p>Relational learning with CILP++ starts by applying bottom clause propositionalization (BCP) onto
the first-order examples set. Each first-order example, in the form of a instantiated target clause, e.g.
target(a1; : : : ; an), is converted into a numerical vector that a neural network can use as input. In
order to achieve this, each example is transformed into a bottom clause and mapped onto features
on an attribute-value table, and numerical vectors are generated for each example. Thus, BCP has
three steps: bottom clause generation, feature generation and attribute-value mapping.
Firstly, before describing each BCP step, we present three first-order concepts which are used in this
work: clause, relational domain and bottom clause.
– A clause is a definition of relations between facts, with structure</p>
        <p>
          pt (V1t ; : : : ;Vnt ) :- p1(V11 ; : : : ;Vn1 ); p2(V12 ; : : : ;Vn2 ); : : : ; pm(V1m ; : : : ;Vnm ),
where fpi; 1 i mg S fpt g is a set of predicates (relations), Vj is a set of variables,
p1(V11 ; : : : ;Vn1 ); p2(V12 ; : : : ;Vn2 ); : : : ; pm(V1m ; : : : ;Vnm ) are literals and :- represents implication.
Literals on the left-hand side of the consequence operator are known as head literals and literals on the
right-hand side are known as body literals.
– A relational domain is a tuple &lt; E; BK &gt;, where: E is a set of ground atoms of target concept(s)
(i.e. first-order logic examples), in which labels are truth-values; and BK is a set of clauses known
as background knowledge, which can be facts (grounded single-literal clauses that define what is
known about a given task) or clauses, as define above.
– A bottom clause is a boundary in the hypothesis search space during ILP learning [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ], and is built
from one random positive example, background knowledge and language bias (a set of clauses that
define how clauses can be built in an ILP model). A bottom clause is the most specific clause (with
most literals) that can be considered a candidate hypothesis.
        </p>
        <p>
          Having introduced clauses and relational domain, we are in position to describe BCP. In the first step
of BCP, bottom clause generation, each example ei from a first-order example set E is given to the
bottom clause generation algorithm [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] to create a corresponding bottom clause set E?, containing
one bottom clause ?i for each example ei. To do so, a slight modification is needed to allow the
same hash function to be shared among all examples, in order to keep consistency between variable
associations, and to allow negative examples to have bottom clauses as well; the original algorithm
deals with positive examples only. An extensive algorithm description is provided in [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
In order to illustrate each BCP step, we introduce a small family relationship relational domain [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ],
with background knowledge
        </p>
        <p>BK = fmother(mom1, daughter1), wife(daughter1, husband1), wife(daughter2, husband2)g,
with one positive example and one negative example motherInLaw(mom1, husband1) and
motherInLaw(daughter1, husband2), respectively. It can be noticed that the relation between mom1 and
husband1, which the positive example establishes, can be alternatively described by the sequence of
facts mother(mom1, daughter1) and wife(daughter1, husband1) in the background knowledge. This
states semantically that mom1 is a mother-in-law because mom1 has a married daughter, namely,
daughter1. Applied to this example, the bottom clause generation algorithm would create a clause
?i = motherInLaw(A; B) mother(A, C), wife(C, B). Comparing ? with the sequence of facts
above, we notice that ?i describes one possible meaning of mother-in-law: “A is a mother-in-law of
B if A is a mother of C and C is wife of B”, i.e. the mother of a married daughter is a mother-in-law.
To generate features from bottom clauses, BCP generates one bottom clause for each (positive or
negative) example e, which we denote as ?e. At the end of the first step of BCP, we end with a
bottom clause set containing both positive and negative examples:</p>
        <p>E? = fmotherInLaw(A; B) :</p>
        <p>
          mother(A;C); wi f e(C; B);
motherInLaw(A; B) : wi f e(A;C)g
:
In the second step, feature generation, a feature table F is generated from E?. Earlier versions of
CILP++ used bottom clause literals directly as features, but this approach can lead to inconsistencies
if knowledge is to be extracted from models which used such features [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]. In order to tackle this,
an adapted version of the first-order feature generation algorithm presented in [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] has been used to
generate independent propositional features which represent first-order descriptions.
For illustrating the second step of BCP, consider the following bottom clause R?:
        </p>
        <p>
          wi f e(A; D); brother(B; D)
Semi-propositionalization [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] is used to generate a set of first-order features for R?. First-order
features are sets of literals that share variables that are not inside any head literal. Those variables
are known as local variables. From the family relationship example, the following features are
obtained:
BCP treats each decomposition as a feature and in the example above, two clauses would be
generated from the decomposition of R?:
        </p>
        <p>F1 = fparent(A;C); wi f e(C; B)g</p>
        <p>F2 = fwi f e(A; D); brother(B; D)g
L1(A; B) : parent(A;C); wi f e(C; B)</p>
        <p>L2(A; B) : wi f e(A; D); brother(B; D)
Therefore, R? can be rewritten as the following semi-propositional rule R0?:</p>
        <p>motherInLaw(A; B) : L1(A; B); L2(A; B)
If the only example to be propositionalized by BCP is r, the feature table F would, at the end, contain
only two elements: L1(A; B) and L2(A; B).</p>
        <p>Lastly, in the third step of BCP, the feature table F is applied onto E in order to generate binary
vectors that a neural network can process. The algorithm, implemented on CILP++, is as follows:
1. Let jFj be the number of elements in F;
2. Let Ev be the set of binary vectors, converted from E, initially empty;
3. For each example ei 2 E do
(a) For each feature f j 2 F do
i. Query E against the correspondent first-order description L j of f j against the
relational domain background knowledge BK;
ii. If query succeeds, assign 1 to the position j binary vector vi; if not, assign 0
instead;
(b) Associate a label 1 to vi if ei is a positive example, and 1 otherwise;
(c) Add vi to Ev;
4. Return Ev.</p>
        <p>
          Continuing the family relationship example: jFj is equal to 2, since there is only two features in the
table: L1(A; B) and L2(A; B). Since r contain both features on its bottom clause, vr = (1; 1). See [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]
for more details.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Neural Network Learning and Relational Knowledge Extraction</title>
        <p>
          CILP++ uses resilient backpropagation [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ], with early stopping [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ] for learning. Resilient
backpropagation takes into account only the sign of the partial derivative over all training examples (not
the magnitude), and acts independently on each weight. For each weight, if there was a sign change
of the partial derivative of the total error function compared to the last iteration, the update value for
that weight is multiplied by a factor h , where h &lt; 1. If the last iteration produced the same sign,
the update value is multiplied by a factor of h+ where h+ &gt; 1. The update values are calculated
for each weight in the above manner, and finally each weight is changed by its own update value, in
the opposite direction of that weight’s partial derivative, so as to minimize the total error function.
We set h+ and h through validation.
        </p>
        <p>
          With early stopping, when the validation error measure starts to increase, training is stopped. We
have used a more permissive version of early stopping [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ], which does not halt training immediately
after the validation error increases. It stops when a combined measure of both number of consecutive
epochs with increasing validation set error and absolute value of current validation set error reaches
a certain threshold.
        </p>
        <p>
          Following network training, in order to perform relational knowledge extraction, an adapted version
of the TREPAN rule extractor [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] is applied to the trained neural network. TREPAN is originally a
m-of-n propositional tree inducer which uses a learned neural network as oracle and through a set of
examples S, possibly distinct from the example set used for training the neural network, a decision
tree is recursively built, based on an information gain-based heuristic. We adapted TREPAN in order
to allow the generation and query of first-order rules into Prolog [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ], a well-known general purpose
logic programming. Several simplifications have also been done in order to improve efficiency and
readability. The adapted pseudo-algorithm for TREPAN can be seen on Algorithm 1, based on the
original TREPAN algorithm [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. Changes from the original are highlighted with an underline.
The adapted version of TREPAN presented on Algorithm 1 have the following differences when
compared to original TREPAN:
Line 7: Tree generation has been simplified, only maximum size criterion is used for
stopping the process.
        </p>
        <p>Line 14: The search heuristic for best m-of-n split is now weighted by size of m. The
original heuristic value for a given split is now subtracted by m=n.</p>
        <p>
          Lines 26-32: The m-of-n tree is transformed into a set of (possibly) disjunctive rules, in
order to allow first-order inference with logic programming languages such as Prolog.
After extracting rules from the trained network (after obtaining T H), the definitions of the
semipropositional first-order features (Li clauses) obtained during BCP are added to T H, resulting in a
first-order theory that can be used to perform first-order inference. In the following, the well-known
east-west trains ILP dataset [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ] is used in order to demonstrate how CILP++ performs relational
learning, relational knowledge extraction and reasoning.
        </p>
        <p>In the first step of CILP++ (propositionalization with BCP), 20 bottom clauses were generated from
the 10 positive and 10 negative examples of eastbound and westbound trains. From those bottom
clauses, 41 features were obtained by using semi-propositionalization. Therefore, 41 input neurons
will be created in CILP++’s initial neural network structure, each one representing one feature. A
small sample of bottom clauses generated, the features generated with BCP and the resulting initial
neural network structure are presented in Figure 1.</p>
        <p>Bottom clauses
eastbound(A)
:</p>
        <p>has_car(A,B), long(B), wheels(B,2).
eastbound(A)
:</p>
        <p>has_car(A,B), has_car(A,C), long(C), double(B).
eastbound(A)
:has_car(A,B), has_car(A,C), long(B), shape(C,u_shaped).</p>
        <p>Features
F1(A) :- has_car(A,B), long(B), wheels(B,2).</p>
        <p>F2(A) :- has_car(A,B), double(B).</p>
        <p>F3(A) :- has_car(A,B), long(B).</p>
        <p>F4(A) :- has_car(A,C), shape(C, u_shaped).</p>
        <p>Neural network
eastbound</p>
        <p>...</p>
        <p>F1</p>
        <p>F2</p>
        <p>F3</p>
        <p>F4
After neural network training, the adapted TREPAN rule extractor algorithm (Algorithm 1) is used
to generate first-order rules from the network. Leave-one-out cross-validation was used, i.e., 20
folds have been generated from the 20 first-order examples. Figure 2 shows the resulting first-order
theory. The first part of the generated theory is the extracted TREPAN rules, whilst the second part
is the added semi-propositional clauses generated by BCP.</p>
        <p>Generated theory
eastbound(A) :- F1(A).
eastbound(A) :- F2(A).</p>
        <p>Semi-propositionalization clauses
F1(A) :- has_car(A,B), short(B), wheels(B,1).</p>
        <p>
          F2(A) :- has_car(A,B), short(B), closed(B), wheels(B, 2), jagged(B).
Fidelity: 95% of fidelity between the trained neural network and the extracted rules.
Fidelity is defined as the percentage of examples classified in the same way by both models.
It does not matter if the result is a hit or a miss for a given example e: as long as both the
neural network and the rules classify e identically, it is considered a hit towards fidelity.
1Compare with the rules extracted by LINUS in [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]; our method seems to produce more readable rules.
CILP++ has been tested empirically and results over ten-fold cross validation for the trained neural
network can be seen on Table 1. CILP++ is being tested against a well-known ILP system, Aleph
[
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] and Markov Logic Networks (MLN’s) [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. Four relational domains have been used:
Mutagenesis [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], UW-CSE [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], Alzheimers-anime [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] and Cora [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. The same parameters as [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]
have been used for training CILP++. For Aleph, the settings suggested in [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] have been used.
For MLN’s, the reported results on three publications [
          <xref ref-type="bibr" rid="ref15 ref32 ref33">15, 32, 33</xref>
          ] have been collected. Lastly, on
TREPAN, treesizel imit has been set as 5. All experiments were run on a 3.2 Ghz Intel Core i3-2100
with 4 GB RAM.
        </p>
        <p>
          Results show that CILP++ has comparable accuracy and AUC measurements with both Aleph and
MLN’s, while having considerably better runtimes. While CILP++ was able to run and generate
competitive results on all tested datasets, Aleph ran out of memory while running Cora. Standard
MLN’s performed very poorly on Alzheimer-amine [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ] and had higher training times.
Results show that competitive accuracy with Aleph has been maintained after extraction, and also
good fidelity measures have been obtained in comparison with the trained neural network. This
indicates that CILP++ neural networks are capable of efficiently solve relational tasks with BCP.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Concluding Remarks</title>
      <p>
        In this paper, we have presented an integrated and efficient method and system for the extraction
of first-order logic rules from neural networks. Experimental results show that the first-order rules
extracted from trained neural networks, in terms of accuracy and AUC, are comparable with a
wellknown probabilistic system for relational learning, MLN’s, and a search-based ILP system, Aleph,
while being considerably faster. Those results indicate the promise of CILP++ as a relational learner.
Further comparisons with related work include the analysis of propositionalization-based systems
such as LINUS/DINUS/SINUS [
        <xref ref-type="bibr" rid="ref11 ref9">9, 11</xref>
        ] and RelF[
        <xref ref-type="bibr" rid="ref34">34</xref>
        ], which rely on the quality of their feature
generation to reduce the information loss of the propositionalization approach and, consequently,
within the rules extracted from the learner. Both the LINUS/DINUS/SINUS family of ILP systems
and RelF generate a number of constrained first-order features f from the Herbrand base H (H is the
set of possible clauses for a given domain knowledge). From the collection of features f , a final set
of features F is obtained for representing the training examples, according to a given score function.
CILP++, on the other hand, uses the concept of bottom clause, which is a clause that uniquely
describes a single relational example. CILP++ uses bottom clauses to train a neural network, and an
algorithm based on the concept of semi-propositionalization [
        <xref ref-type="bibr" rid="ref16 ref27">16, 27</xref>
        ] to generate F.
Approaches based on Bayesian networks [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ] also perform relational learning, but represent the
learned knowledge without the use of explicit relational rules. Statistical Relational Models [
        <xref ref-type="bibr" rid="ref36">36</xref>
        ]
contain a rich representation language which combines a frame-based logical representation with
probabilistic semantics based on bayesian networks. BLOG [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ] is a first-order probabilistic
modeling language that specifies probability distributions over possible worlds with varying sets of objects.
A BLOG model contains statements that define conditional probability distributions for a certain set
of random variables; the model also specifies certain context-specific independence properties.
Inference is done on BLOG using Markov Chain Monte Carlo [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ] algorithms. In CILP++, inference
is deterministic with first-order rules learned from the neural (statistical) model being applicable
directly onto a Prolog theorem prover for reasoning.
      </p>
      <p>As future work, a study on how CILP++ deals with noisy datasets (noise in the background
knowledge and/or examples) can provide interesting results, due to how backpropagation naturally deals
with incomplete data and noisy inputs. Also, an investigation on how CILP++ can be adapted to deal
directly with numeric data can overcome a well-known flaw in ILP systems, which is its inability
to deal directly with numbers. ILP systems use auxiliary predicates to indicate relations between
numeric variables such as greater-than, less-than and so on.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ho</surname>
          </string-name>
          <article-title>¨lldobler and Y</article-title>
          . Kalinke, “
          <article-title>Towards a massively parallel computational model for logic programming</article-title>
          ,
          <source>” in In: Proceedings of the ECAI94 Workshop on Combining Symbolic and Connectionist Processing</source>
          , pp.
          <fpage>68</fpage>
          -
          <lpage>77</lpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G. G.</given-names>
            <surname>Towell</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Shavlik</surname>
          </string-name>
          , “
          <source>Knowledge-Based Artificial Neural Networks,” Artif. Intell.</source>
          , vol.
          <volume>70</volume>
          , no.
          <issue>1-2</issue>
          , pp.
          <fpage>119</fpage>
          -
          <lpage>165</lpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A. S. D.</given-names>
            <surname>Garcez</surname>
          </string-name>
          and G. Zaverucha, “
          <source>The Connectionist Inductive Learning and Logic Programming System,” Applied Intelligence</source>
          , vol.
          <volume>11</volume>
          , pp.
          <fpage>59</fpage>
          -
          <lpage>77</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Basilio</surname>
          </string-name>
          , G. Zaverucha, and
          <string-name>
            <given-names>V.</given-names>
            <surname>Barbosa</surname>
          </string-name>
          , “
          <article-title>Learning Logic Programs with Neural Networks,” in Inductive Logic Programming</article-title>
          , vol.
          <volume>2157</volume>
          of LNAI, pp.
          <fpage>15</fpage>
          -
          <lpage>26</lpage>
          , Springer Berlin / Heidelberg,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Dzˇeroski and N. Lavracˇ</surname>
          </string-name>
          ,
          <source>Relational Data Mining. Relational Data Mining</source>
          , Springer,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>De Raedt</surname>
          </string-name>
          , Logical and
          <string-name>
            <given-names>Relational</given-names>
            <surname>Learning</surname>
          </string-name>
          .
          <source>Cognitive Technologies</source>
          , Springer,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Srinivasan</surname>
          </string-name>
          and
          <string-name>
            <given-names>S. H.</given-names>
            <surname>Muggleton</surname>
          </string-name>
          , “
          <article-title>Mutagenesis: ILP experiments in a non-determinate biological domain</article-title>
          ,”
          <source>in Proceedings of the 4th International Workshop on Inductive Logic Programming</source>
          , volume
          <volume>237</volume>
          <source>of GMD-Studien</source>
          , pp.
          <fpage>217</fpage>
          -
          <lpage>232</lpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. S.</given-names>
            <surname>Burnside</surname>
          </string-name>
          , I. de Castro Dutra,
          <string-name>
            <given-names>D.</given-names>
            <surname>Page</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V. S.</given-names>
            <surname>Costa</surname>
          </string-name>
          , “
          <article-title>An integrated approach to learning bayesian networks of rules,” in ECML (J</article-title>
          .
          <string-name>
            <surname>Gama</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Camacho</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Brazdil</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Jorge</surname>
          </string-name>
          , and L. Torgo, eds.), vol.
          <volume>3720</volume>
          of Lecture Notes in Computer Science, pp.
          <fpage>84</fpage>
          -
          <lpage>95</lpage>
          , Springer,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>N.</surname>
          </string-name>
          <article-title>Lavracˇ and S. Dzˇeroski, Inductive logic programming: techniques and applications</article-title>
          . Ellis Horwood series in artificial intelligence, E. Horwood,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>F.</surname>
          </string-name>
          <article-title>Zˇelezny´</article-title>
          and N. Lavracˇ, “
          <article-title>Propositionalization-based Relational Subgroup Discovery With RSD,” Machine Learning</article-title>
          , vol.
          <volume>62</volume>
          , pp.
          <fpage>33</fpage>
          -
          <lpage>63</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kramer</surname>
          </string-name>
          , N. Lavracˇ, and P. Flach, “
          <article-title>Relational Data Mining,” ch. Propositionalization approaches to relational data mining</article-title>
          , pp.
          <fpage>262</fpage>
          -
          <lpage>286</lpage>
          , New York, NY, USA: Springer-Verlag New York, Inc.,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>M.-A. Krogel</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Rawles</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <article-title>Zˇ elezny´</article-title>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Flach</surname>
          </string-name>
          , N. Lavracˇ, and S. Wrobel, “Comparative Evaluation Of Approaches To Propositionalization,” in ILP, vol.
          <volume>2835</volume>
          of LNAI, pp.
          <fpage>194</fpage>
          -
          <lpage>217</lpage>
          , Springer-Verlag,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>M. V. M.</surname>
          </string-name>
          <article-title>Franc¸a, G. Zaverucha, and</article-title>
          <string-name>
            <surname>A</surname>
          </string-name>
          . dAvila Garcez, “
          <article-title>Fast relational learning using bottom clause propositionalization with artificial neural networks</article-title>
          ,
          <source>” Machine Learning</source>
          , vol.
          <volume>94</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>81</fpage>
          -
          <lpage>104</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>M. V. M.</surname>
          </string-name>
          <article-title>Franc¸a,</article-title>
          <string-name>
            <given-names>A. S. D.</given-names>
            <surname>Garcez</surname>
          </string-name>
          , and G. Zaverucha, “
          <article-title>Relational Knowledge Extraction from AttributeValue Learners,</article-title>
          ” in
          <source>2013 Imperial College Computing Student Workshop</source>
          , vol.
          <volume>35</volume>
          of OpenAccess Series in Informatics (OASIcs), (Dagstuhl, Germany), pp.
          <fpage>35</fpage>
          -
          <lpage>42</lpage>
          ,
          <string-name>
            <given-names>Schloss</given-names>
            <surname>Dagstuhl</surname>
          </string-name>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Richardson</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Domingos</surname>
          </string-name>
          , “
          <article-title>Markov logic networks</article-title>
          ,
          <source>” Machine Learning</source>
          , vol.
          <volume>62</volume>
          , pp.
          <fpage>107</fpage>
          -
          <lpage>136</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>N.</given-names>
            <surname>Lavracˇ and P. A. Flach</surname>
          </string-name>
          , “
          <article-title>An extended transformation approach to inductive logic programming</article-title>
          ,
          <source>” ACM Trans. Comput. Logic</source>
          , vol.
          <volume>2</volume>
          , no.
          <issue>4</issue>
          , pp.
          <fpage>458</fpage>
          -
          <lpage>494</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>B.</given-names>
            <surname>Kijsirikul</surname>
          </string-name>
          and
          <string-name>
            <given-names>B. K.</given-names>
            <surname>Lerdlamnaochai</surname>
          </string-name>
          , “
          <article-title>First-Order Logical Neural Networks,”</article-title>
          <string-name>
            <given-names>Int. J. Hybrid</given-names>
            <surname>Intell</surname>
          </string-name>
          . Syst., vol.
          <volume>2</volume>
          , pp.
          <fpage>253</fpage>
          -
          <lpage>267</lpage>
          , Dec.
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Paes</surname>
          </string-name>
          , F. Zˇelezny´, G. Zaverucha,
          <string-name>
            <given-names>D.</given-names>
            <surname>Page</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Srinivasan</surname>
          </string-name>
          , “
          <article-title>ILP Through Propositionalization and Stochastic k-Term DNF Learning,” in ILP</article-title>
          , vol.
          <volume>4455</volume>
          of LNAI, (Berlin, Heidelberg), pp.
          <fpage>379</fpage>
          -
          <lpage>393</lpage>
          , Springer-Verlag,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>R.</given-names>
            <surname>Basilio</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Zaverucha, and</article-title>
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Garcez</surname>
          </string-name>
          , “
          <article-title>Inducing Relational Concepts with Neural Networks via the LINUS System,</article-title>
          ” in In ICONIP, pp.
          <fpage>1507151</fpage>
          -
          <lpage>0</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>M.</given-names>
            <surname>Craven</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Shavlik</surname>
          </string-name>
          , “
          <article-title>Extracting Tree-Structured Representations of Trained Networks</article-title>
          ,” in NIPS, pp.
          <fpage>24</fpage>
          -
          <lpage>30</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>R.</given-names>
            <surname>King</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Srinivasan</surname>
          </string-name>
          , “
          <article-title>Relating chemical activity to structure: An examination of ILP successes,” New Generation Computing</article-title>
          , vol.
          <volume>13</volume>
          , no.
          <issue>3-4</issue>
          , pp.
          <fpage>411</fpage>
          -
          <lpage>434</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bilenko</surname>
          </string-name>
          and
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Mooney</surname>
          </string-name>
          , “
          <article-title>Adaptive duplicate detection using learnable string similarity measures,” in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining</article-title>
          ,
          <source>KDD '03</source>
          , (New York, NY, USA), pp.
          <fpage>39</fpage>
          -
          <lpage>48</lpage>
          , ACM,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>A.</given-names>
            <surname>Srinivasan</surname>
          </string-name>
          , “
          <article-title>The Aleph System, version 5</article-title>
          .” http://www.cs.ox.ac.uk/activities/machlearn/Aleph/ aleph.html,
          <year>2007</year>
          . Last accessed on may/
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>S.</given-names>
            <surname>Muggleton</surname>
          </string-name>
          , “Inverse Entailment and Progol,”
          <article-title>New Generation Computing, Special issue on Inductive Logic Programming</article-title>
          , vol.
          <volume>13</volume>
          , no.
          <issue>3-4</issue>
          , pp.
          <fpage>245</fpage>
          -
          <lpage>286</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>A.</given-names>
            <surname>Tamaddoni-Nezhad</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Muggleton</surname>
          </string-name>
          , “
          <article-title>The lattice structure and refinement operators for the hypothesis space bounded by a bottom clause,” Machine Learning</article-title>
          , vol.
          <volume>76</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>37</fpage>
          -
          <lpage>72</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>S.</given-names>
            <surname>Muggleton</surname>
          </string-name>
          and
          <string-name>
            <given-names>L. D.</given-names>
            <surname>Raedt</surname>
          </string-name>
          , “
          <article-title>Inductive Logic Programming: Theory and Methods</article-title>
          ,
          <source>” Journal of Logic Programming</source>
          , vol.
          <volume>19</volume>
          , no.
          <issue>20</issue>
          , pp.
          <fpage>629</fpage>
          -
          <lpage>679</lpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>M. V. M.</surname>
          </string-name>
          <article-title>Franc¸a, G. Zaverucha, and</article-title>
          <string-name>
            <given-names>A. S. D.</given-names>
            <surname>Garcez</surname>
          </string-name>
          , “
          <article-title>Neural relational learning through semipropositionalization of bottom clauses</article-title>
          ,” in AAAI Spring Symposium Series,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Jacobs</surname>
          </string-name>
          , “
          <article-title>Increased rates of convergence through learning rate adaptation,” Neural Networks</article-title>
          , vol.
          <volume>1</volume>
          , no.
          <issue>4</issue>
          , pp.
          <fpage>295</fpage>
          -
          <lpage>307</lpage>
          ,
          <year>1988</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>L.</given-names>
            <surname>Prechelt</surname>
          </string-name>
          , “
          <article-title>Early stopping - but when?,” in Neural Networks: Tricks of the Trade</article-title>
          , volume
          <volume>1524</volume>
          <source>of LNCS, chapter 2</source>
          , pp.
          <fpage>55</fpage>
          -
          <lpage>69</lpage>
          , Springer-Verlag,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Kowalski</surname>
          </string-name>
          , “
          <article-title>The early years of logic programming</article-title>
          ,
          <source>” Commun. ACM</source>
          , vol.
          <volume>31</volume>
          , pp.
          <fpage>38</fpage>
          -
          <lpage>43</lpage>
          , Jan.
          <year>1988</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>J.</given-names>
            <surname>Larson</surname>
          </string-name>
          and R. S. Michalski, “
          <article-title>Inductive inference of VL decision rules,” SIGART Bull</article-title>
          ., no.
          <issue>63</issue>
          , pp.
          <fpage>38</fpage>
          -
          <lpage>44</lpage>
          ,
          <year>1977</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>T. N.</given-names>
            <surname>Huynh</surname>
          </string-name>
          and
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Mooney</surname>
          </string-name>
          , “
          <article-title>Discriminative structure and parameter learning for markov logic networks</article-title>
          ,
          <source>” in Proceedings of the 25th International Conference on Machine Learning</source>
          , ICML '
          <fpage>08</fpage>
          , (New York, NY, USA), pp.
          <fpage>416</fpage>
          -
          <lpage>423</lpage>
          , ACM,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kok</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Domingos</surname>
          </string-name>
          , “
          <article-title>Learning the structure of markov logic networks</article-title>
          ,
          <source>” in Proceedings of the 22Nd International Conference on Machine Learning</source>
          , (New York, NY, USA), pp.
          <fpage>441</fpage>
          -
          <lpage>448</lpage>
          , ACM,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>O.</given-names>
            <surname>Kuzˇelka</surname>
          </string-name>
          and F. Zˇelezny´, “
          <article-title>Block-wise construction of tree-like relational features with monotone reducibility and redundancy</article-title>
          ,”
          <source>Machine Learning</source>
          , vol.
          <volume>83</volume>
          , pp.
          <fpage>163</fpage>
          -
          <lpage>192</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pearl</surname>
          </string-name>
          , Causality: Models,
          <string-name>
            <surname>Reasoning</surname>
          </string-name>
          , and Inference. Cambridge University Press,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>N.</given-names>
            <surname>Friedman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Getoor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Koller</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Pfeffer</surname>
          </string-name>
          , “
          <article-title>Learning probabilistic relational models,” in In IJCAI</article-title>
          , pp.
          <fpage>1300</fpage>
          -
          <lpage>1309</lpage>
          , Springer-Verlag,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>B.</given-names>
            <surname>Milch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Marthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Russell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sontag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. L.</given-names>
            <surname>Ong</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolobov</surname>
          </string-name>
          , “
          <article-title>BLOG: probabilistic models with unknown objects,”</article-title>
          <string-name>
            <surname>IJCAI</surname>
          </string-name>
          , (San Francisco, CA, USA), pp.
          <fpage>1352</fpage>
          -
          <lpage>1359</lpage>
          , Morgan Kaufmann Publishers Inc.,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>A. E.</given-names>
            <surname>Gelfand</surname>
          </string-name>
          and
          <string-name>
            <given-names>A. F. M.</given-names>
            <surname>Smith</surname>
          </string-name>
          , “
          <article-title>Sampling-based approaches to calculating marginal densities</article-title>
          ,
          <source>” Journal of the American Statistical Association</source>
          , vol.
          <volume>85</volume>
          , no.
          <issue>410</issue>
          , pp.
          <fpage>398</fpage>
          -
          <lpage>409</lpage>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>