Relational Knowledge Extraction from Neural
                                 Networks


                   Manoel Vitor Macedo França                          Artur S. d’Avila Garcez
                  Department of Computer Science                     Department of Computer Science
                      City University London                             City University London
                London, United Kingdom EC1V 0HB                    London, United Kingdom EC1V 0HB
                 manoel.franca@city.ac.uk                              a.garcez@city.ac.uk

                                                  Gerson Zaverucha
                                        Prog. de Eng. de Sistemas e Computação
                                         Universidade Federal do Rio de Janeiro
                                           Rio de Janeiro, Brazil 21941–972
                                               gerson@cos.ufrj.br


                                                        Abstract

                  The effective integration of learning and reasoning is a well-known and challeng-
                  ing area of research within artificial intelligence. Neural-symbolic systems seek
                  to integrate learning and reasoning by combining neural networks and symbolic
                  knowledge representation. In this paper, a novel methodology is proposed for the
                  extraction of relational knowledge from neural networks which are trainable by
                  the efficient application of the backpropagation learning algorithm. First-order
                  logic rules are extracted from the neural networks, offering interpretable sym-
                  bolic relational models on which logical reasoning can be performed. The well-
                  known knowledge extraction algorithm TREPAN was adapted and incorporated
                  into the first-order version of the neural-symbolic system CILP++. Empirical re-
                  sults obtained in comparison with a probabilistic model for relational learning,
                  Markov Logic Networks, and a state-of-the-art Inductive Logic Programming sys-
                  tem, Aleph, indicate that the proposed methodology achieves competitive accu-
                  racy results consistently in all datasets investigated, while either Markov Logic
                  Networks or Aleph show considerably worse results in at least one dataset. It is
                  expected that effective knowledge extraction from neural networks can contribute
                  to the integration of heterogeneous knowledge representations.


         1   Introduction

         Integrating learning and reasoning efficiently and accurately has a vast track of research and publica-
         tions in artificial intelligence [1, 2, 3, 4]. This integration can be done at different stages of learning,
         from data pre-processing, feature extraction, the learning algorithm, up to reasoning about learning.
         Neural-symbolic systems seek to integrate learning and reasoning by combining neural networks
         and symbolic knowledge representations using, e.g., propositional logic or first-order logic.
         Relational learning can be described as the process of learning a first-order logic theory from ex-
         amples and domain knowledge [5, 6]. Differently from propositional learning, relational learning
         does not use a set of attributes and values. Instead, it is based on objects and relations among ob-
         jects, which are represented by constants and predicates, respectively. Relational learning has had
         applications in bioinformatics, graph mining and link analysis [7, 8].

                                                              1

Copyright © 2015 for this paper by its authors. Copying permitted for private and academic purposes.
Inductive Logic Programming (ILP) [5] performs relational learning either directly by manipulat-
ing first-order rules or through a process called propositionalization [9, 10, 11], which brings the
relational task down to the propositional level by representing subsets of relations as features that
can be used as attributes. In comparison with direct ILP, propositionalization normally exchanges
accuracy for efficiency [12], as it enables the use of fast attribute-value learners [13, 10, 9], although
the translation of first-order rules into features can cause information loss [14].
Much work has been done combining relational learning tasks with propositional learners, includ-
ing decision trees or neural networks [15, 16, 17, 18, 19]. In this paper, we are interested in the,
less investigated, inverse problem: how to extract first-order logic descriptions from propositional
learners, in particular, neural networks, trained to solve relational learning tasks?
We extend the well-known CILP neural-symbolic system [3] to allow the extraction of meaningful
first-order logic rules from trained neural networks. Propositionalization and subsequent attribute-
value learning can destroy the original relational structure of the task at hand, so much so that the
provision of interpretable relational knowledge following learning is made impossible [14]. In this
paper, we show that by adapting the first-order version of the CILP system, called CILP++ [13], so as
to enable the application of a variation of the well-known TREPAN knowledge extraction algorithm
[20], a revised set of first-order rules can be extracted from trained neural networks efficiently and
accurately, enabling first-order logical reasoning about what has been learned by the network. The
result is a neural network, trained using an efficient backpropagation learning algorithm, and capable
of receiving “first-order” examples as input and producing first order rules as output. The ability
to perform reasoning directly opens a number of research and application possibilities integrating
reasoning and learning [21, 7, 8, 22].
We have compared relational knowledge extraction in CILP++ with state-of-the-art ILP system
Aleph [23] and Markov Logic Networks (MLN’s) [15] on the Mutagenesis [7], UW-CSE [8],
Alzheimer-amine [21] and Cora [22] datasets. Results indicate that the relational theories extracted
from CILP++ have high fidelity to the trained neural network models, and that the use of neural
networks can provide considerable speed-ups while achieving comparable accuracy and area under
ROC curve results.
The choice of using MLN’s and Aleph for empirical comparisons is due the nature of their method-
ology for tackling relational learning, which are distinctively different: MLN’s take a probabilistic
approach for the relational learning problem, by attempting to find a distribution model that fits the
ground atoms of a hypothesis interpretation as best as possible, while Aleph performs relational
learning by searching the Herbrand space [5] of possible literals for a given dataset.
The remainder of the paper is as follows: section 2 introduces CILP++, the neural-symbolic system
that uses the proposed approach in this paper for relational knowledge extraction from trained neural
networks. Section 3 presents obtained experimental results with CILP++ on the Mutagenesis, UW-
CSE, Alzheimer-amine and Cora datasets, comparing against MLN’s and Aleph. Lastly, section 4
discusses outcomes from the experiments performed and also does an overview of other systems
that are closely related with the work being presented in this paper.
CILP++ is available from Sourceforge (https://sourceforge.net/projects/
cilppp/) and experimental settings will be made available for download.


2     Relational Learning with Neural Networks

This section introduces the proposed system for relational knowledge extraction from trained neural
networks. It starts by presenting each module of the CILP++ system and how they can be adapted
to allow direct first-order knowledge extraction and inference from the trained models.

2.1   Bottom Clause Propositionalization

Relational learning with CILP++ starts by applying bottom clause propositionalization (BCP) onto
the first-order examples set. Each first-order example, in the form of a instantiated target clause, e.g.
target(a1 , . . . , an ), is converted into a numerical vector that a neural network can use as input. In
order to achieve this, each example is transformed into a bottom clause and mapped onto features


                                                    2
on an attribute-value table, and numerical vectors are generated for each example. Thus, BCP has
three steps: bottom clause generation, feature generation and attribute-value mapping.
Firstly, before describing each BCP step, we present three first-order concepts which are used in this
work: clause, relational domain and bottom clause.
– A clause is a definition of relations between facts, with structure

                 pt (V1t , . . . ,Vnt ) :- p1 (V11 , . . . ,Vn1 ), p2 (V12 , . . . ,Vn2 ), . . . , pm (V1m , . . . ,Vnm ),

where {pi , 1 ≤ i ≤ m}                    {pt } is a set of predicates (relations), V j is a set of variables,
                                  S

p1 (V11 , . . . ,Vn1 ), p2 (V12 , . . . ,Vn2 ), . . . , pm (V1m , . . . ,Vnm ) are literals and :- represents implication. Lit-
erals on the left-hand side of the consequence operator are known as head literals and literals on the
right-hand side are known as body literals.
– A relational domain is a tuple < E, BK >, where: E is a set of ground atoms of target concept(s)
(i.e. first-order logic examples), in which labels are truth-values; and BK is a set of clauses known
as background knowledge, which can be facts (grounded single-literal clauses that define what is
known about a given task) or clauses, as define above.
– A bottom clause is a boundary in the hypothesis search space during ILP learning [24], and is built
from one random positive example, background knowledge and language bias (a set of clauses that
define how clauses can be built in an ILP model). A bottom clause is the most specific clause (with
most literals) that can be considered a candidate hypothesis.
Having introduced clauses and relational domain, we are in position to describe BCP. In the first step
of BCP, bottom clause generation, each example ei from a first-order example set E is given to the
bottom clause generation algorithm [25] to create a corresponding bottom clause set E⊥ , containing
one bottom clause ⊥i for each example ei . To do so, a slight modification is needed to allow the
same hash function to be shared among all examples, in order to keep consistency between variable
associations, and to allow negative examples to have bottom clauses as well; the original algorithm
deals with positive examples only. An extensive algorithm description is provided in [13].
In order to illustrate each BCP step, we introduce a small family relationship relational domain [26],
with background knowledge

    BK = {mother(mom1, daughter1), wife(daughter1, husband1), wife(daughter2, husband2)},

with one positive example and one negative example motherInLaw(mom1, husband1) and moth-
erInLaw(daughter1, husband2), respectively. It can be noticed that the relation between mom1 and
husband1, which the positive example establishes, can be alternatively described by the sequence of
facts mother(mom1, daughter1) and wife(daughter1, husband1) in the background knowledge. This
states semantically that mom1 is a mother-in-law because mom1 has a married daughter, namely,
daughter1. Applied to this example, the bottom clause generation algorithm would create a clause
⊥i = motherInLaw(A, B) ← mother(A, C), wife(C, B). Comparing ⊥ with the sequence of facts
above, we notice that ⊥i describes one possible meaning of mother-in-law: “A is a mother-in-law of
B if A is a mother of C and C is wife of B”, i.e. the mother of a married daughter is a mother-in-law.
To generate features from bottom clauses, BCP generates one bottom clause for each (positive or
negative) example e, which we denote as ⊥e . At the end of the first step of BCP, we end with a
bottom clause set containing both positive and negative examples:


                           E⊥ = {motherInLaw(A, B) : −mother(A,C), wi f e(C, B);
                               ∼ motherInLaw(A, B) : −wi f e(A,C)}             .

In the second step, feature generation, a feature table F is generated from E⊥ . Earlier versions of
CILP++ used bottom clause literals directly as features, but this approach can lead to inconsistencies
if knowledge is to be extracted from models which used such features [27]. In order to tackle this,
an adapted version of the first-order feature generation algorithm presented in [16] has been used to
generate independent propositional features which represent first-order descriptions.
For illustrating the second step of BCP, consider the following bottom clause R⊥ :


                                                                    3
                          motherInLaw(A, B) : −parent(A,C), wi f e(C, B),
                                               wi f e(A, D), brother(B, D)

Semi-propositionalization [16] is used to generate a set of first-order features for R⊥ . First-order
features are sets of literals that share variables that are not inside any head literal. Those variables
are known as local variables. From the family relationship example, the following features are
obtained:


                                  F1 = {parent(A,C), wi f e(C, B)}
                                  F2 = {wi f e(A, D), brother(B, D)}

BCP treats each decomposition as a feature and in the example above, two clauses would be gener-
ated from the decomposition of R⊥ :


                                L1 (A, B) : −parent(A,C), wi f e(C, B)
                                L2 (A, B) : −wi f e(A, D), brother(B, D)

Therefore, R⊥ can be rewritten as the following semi-propositional rule R0⊥ :


                              motherInLaw(A, B) : −L1 (A, B), L2 (A, B)

If the only example to be propositionalized by BCP is r, the feature table F would, at the end, contain
only two elements: L1 (A, B) and L2 (A, B).
Lastly, in the third step of BCP, the feature table F is applied onto E in order to generate binary
vectors that a neural network can process. The algorithm, implemented on CILP++, is as follows:

      1. Let |F| be the number of elements in F;
      2. Let Ev be the set of binary vectors, converted from E, initially empty;
      3. For each example ei ∈ E do
          (a) For each feature f j ∈ F do
               i. Query E against the correspondent first-order description L j of f j against the rela-
                  tional domain background knowledge BK;
              ii. If query succeeds, assign 1 to the position j binary vector vi ; if not, assign 0
                  instead;
          (b) Associate a label 1 to vi if ei is a positive example, and −1 otherwise;
          (c) Add vi to Ev ;
      4. Return Ev .

Continuing the family relationship example: |F| is equal to 2, since there is only two features in the
table: L1 (A, B) and L2 (A, B). Since r contain both features on its bottom clause, vr = (1, 1). See [13]
for more details.

2.2   Neural Network Learning and Relational Knowledge Extraction

CILP++ uses resilient backpropagation [28], with early stopping [29] for learning. Resilient back-
propagation takes into account only the sign of the partial derivative over all training examples (not
the magnitude), and acts independently on each weight. For each weight, if there was a sign change
of the partial derivative of the total error function compared to the last iteration, the update value for
that weight is multiplied by a factor η−, where η− < 1. If the last iteration produced the same sign,
the update value is multiplied by a factor of η+ where η+ > 1. The update values are calculated


                                                    4
for each weight in the above manner, and finally each weight is changed by its own update value, in
the opposite direction of that weight’s partial derivative, so as to minimize the total error function.
We set η+ and η− through validation.
With early stopping, when the validation error measure starts to increase, training is stopped. We
have used a more permissive version of early stopping [29], which does not halt training immediately
after the validation error increases. It stops when a combined measure of both number of consecutive
epochs with increasing validation set error and absolute value of current validation set error reaches
a certain threshold.
Following network training, in order to perform relational knowledge extraction, an adapted version
of the TREPAN rule extractor [20] is applied to the trained neural network. TREPAN is originally a
m-of-n propositional tree inducer which uses a learned neural network as oracle and through a set of
examples S, possibly distinct from the example set used for training the neural network, a decision
tree is recursively built, based on an information gain-based heuristic. We adapted TREPAN in order
to allow the generation and query of first-order rules into Prolog [30], a well-known general purpose
logic programming. Several simplifications have also been done in order to improve efficiency and
readability. The adapted pseudo-algorithm for TREPAN can be seen on Algorithm 1, based on the
original TREPAN algorithm [20]. Changes from the original are highlighted with an underline.


Algorithm 1 Adapted TREPAN algorithm
 1: Inputs: training set E, feature set F
 2: for all example e ∈ E do
 3:    class label for e = Oracle(E)
 4: end for
 5: initialize the root of the tree, T , as leaf node
 6: initialize queue with tuple < T, E, {} >
 7: while queue not empty and size(T ) < tree size limit do
 8:    remove node N from head of queue
 9:    examplesn = example set stored with N
10:    constraintsn = constraint set stored with N
11:    use F to build set of candidate splits for node
12:    use examplesn and calls to Oracle(constraintsn ) to evaluate splits
13:    S = best binary split
14:    search for best m-of-n split, S0 , using S as seed
15:    make N an internal node with split S0
16:    for all possible outcomes s ∈ S0 do
17:       make C, a new child node of N
18:       constraintsc = constraintsn ∪ {S0 = s}
19:       use calls to Oracle(constraintsc ) to determine if C should remain a leaf
20:       if not then
21:          examplesc = members of examplesn with outcome s on split S0
22:          put < C, examplec , constraintsc > into Queue
23:       end if
24:    end for
25: end while
26: T H = empty
27: for all path p in T do
28:    Create a rule r containing conjunctions of each m-of-n split conditions s inside p
29:    Add all possible m combinations of the n conditions inside s as disjunctive body literals in r
30:    Add r to T H
31: end for
32: perform a decreasing ordering on T H on the number of examples covered by each path p
33: return a first-order theory T H


The adapted version of TREPAN presented on Algorithm 1 have the following differences when
compared to original TREPAN:


                                                  5
      • Line 7: Tree generation has been simplified, only maximum size criterion is used for stop-
        ping the process.
      • Line 14: The search heuristic for best m-of-n split is now weighted by size of m. The
        original heuristic value for a given split is now subtracted by m/n.
      • Lines 26-32: The m-of-n tree is transformed into a set of (possibly) disjunctive rules, in
        order to allow first-order inference with logic programming languages such as Prolog.

After extracting rules from the trained network (after obtaining T H), the definitions of the semi-
propositional first-order features (Li clauses) obtained during BCP are added to T H, resulting in a
first-order theory that can be used to perform first-order inference. In the following, the well-known
east-west trains ILP dataset [31] is used in order to demonstrate how CILP++ performs relational
learning, relational knowledge extraction and reasoning.
In the first step of CILP++ (propositionalization with BCP), 20 bottom clauses were generated from
the 10 positive and 10 negative examples of eastbound and westbound trains. From those bottom
clauses, 41 features were obtained by using semi-propositionalization. Therefore, 41 input neurons
will be created in CILP++’s initial neural network structure, each one representing one feature. A
small sample of bottom clauses generated, the features generated with BCP and the resulting initial
neural network structure are presented in Figure 1.

                                        Bottom clauses                                  Neural network

                    eastbound(A) :-                                                       eastbound
                       has_car(A,B), long(B), wheels(B,2).
                    eastbound(A) :-
                       has_car(A,B), has_car(A,C), long(C), double(B).
                    eastbound(A) :-
                       has_car(A,B), has_car(A,C), long(B), shape(C,u_shaped).
                                                                                               ...
                                           Features

                    F1(A) :- has_car(A,B), long(B), wheels(B,2).                   F1     F2   F3     F4
                    F2(A) :- has_car(A,B), double(B).
                    F3(A) :- has_car(A,B), long(B).
                    F4(A) :- has_car(A,C), shape(C, u_shaped).


        Figure 1: Bottom clauses and BCP features example for the east-west trains dataset.

After neural network training, the adapted TREPAN rule extractor algorithm (Algorithm 1) is used
to generate first-order rules from the network. Leave-one-out cross-validation was used, i.e., 20
folds have been generated from the 20 first-order examples. Figure 2 shows the resulting first-order
theory. The first part of the generated theory is the extracted TREPAN rules, whilst the second part
is the added semi-propositional clauses generated by BCP.

                                                        Generated theory

                           eastbound(A) :- F1(A).
                           eastbound(A) :- F2(A).

                                               Semi-propositionalization clauses

                           F1(A) :- has_car(A,B), short(B), wheels(B,1).
                           F2(A) :- has_car(A,B), short(B), closed(B), wheels(B, 2), jagged(B).


              Figure 2: Theory obtained with CILP++ for the east-west trains dataset1

The obtained results for the east-west trains are:

      • Accuracy: 90% for the trained network and 90% accuracy for the extracted rules (the
        extracted rules were evaluated in Prolog, by querying the first-order examples);
      • Fidelity: 95% of fidelity between the trained neural network and the extracted rules. Fi-
        delity is defined as the percentage of examples classified in the same way by both models.
        It does not matter if the result is a hit or a miss for a given example e: as long as both the
        neural network and the rules classify e identically, it is considered a hit towards fidelity.
   1 Compare with the rules extracted by LINUS in [16]; our method seems to produce more readable rules.


                                                                   6
Table 1: Accuracy results (ACC) with standard deviations, area under ROC curve results (AUC) and
runtimes in seconds (RUN) for the Mutagenesis, UW-CSE and Alzheimers-anime datasets

    SYSTEM     METRICS                                   DATASETS

                             Mutagenesis         UW-CSE         Alzheimer-amine          Cora
    Aleph        ACC       80.85%(±10.51)     85.11%(±7.11)      78.71%(±5.25)           −−
                 AUC          0.80(0.02)       0.81(±0.07)        0.79(±0.09)            −−
                 RUN             721               875                5465               −−
    MLN          ACC       62.11%(±0.022)     75.01%(±0.028)    50.01%(±0.002)     70.10%(±0.191)
                 AUC        0.67(±0.022)        0.76(±0.12)       0.51(±0.02)       0.809(±0.001)
                 RUN            4341                1184              9811              97148
    CILP + +     ACC        91.70%(±5.84)     77.17%(±9.01)      79.91%(±2.12)      68.91%(±2.12)
                 AUC         0.82(±0.02)       0.77(±0.07)        0.80(±0.01)        0.79(±0.02)
                 RUN             851               742                3822              11435


Table 2: Rule accuracy results (ACC) with standard deviations and fidelity (FID) for the Mutagene-
sis, UW-CSE, Alzheimer-anime and Cora datasets

    SYSTEM     METRICS                          DATASETS

                              Mutagenesis         UW-CSE         Alzheimer-anime         Cora
    Aleph        ACC        80.85%(±10.51)     85.11%(±7.11)     78.71%(±5.25)           −−
    CILP + +     ACC         77.72(±2.12)       81.98(±3.11)     78.70%(±6.12)      67.98%(±5.29)
                 FID             0.89               0.90              0.88               0.82


3     Experimental Results
CILP++ has been tested empirically and results over ten-fold cross validation for the trained neural
network can be seen on Table 1. CILP++ is being tested against a well-known ILP system, Aleph
[23] and Markov Logic Networks (MLN’s) [15]. Four relational domains have been used: Mu-
tagenesis [7], UW-CSE [8], Alzheimers-anime [21] and Cora [22]. The same parameters as [13]
have been used for training CILP++. For Aleph, the settings suggested in [18] have been used.
For MLN’s, the reported results on three publications [15, 32, 33] have been collected. Lastly, on
TREPAN, trees izel imit has been set as 5. All experiments were run on a 3.2 Ghz Intel Core i3-2100
with 4 GB RAM.
Results show that CILP++ has comparable accuracy and AUC measurements with both Aleph and
MLN’s, while having considerably better runtimes. While CILP++ was able to run and generate
competitive results on all tested datasets, Aleph ran out of memory while running Cora. Standard
MLN’s performed very poorly on Alzheimer-amine [32] and had higher training times.
Table 2 shows comparative results with Aleph for querying the extracted first-order rules from the
trained CILP++ neural network on Prolog. Also, fidelity measurements are provided.
Results show that competitive accuracy with Aleph has been maintained after extraction, and also
good fidelity measures have been obtained in comparison with the trained neural network. This
indicates that CILP++ neural networks are capable of efficiently solve relational tasks with BCP.

4     Concluding Remarks
In this paper, we have presented an integrated and efficient method and system for the extraction
of first-order logic rules from neural networks. Experimental results show that the first-order rules
extracted from trained neural networks, in terms of accuracy and AUC, are comparable with a well-


                                                 7
known probabilistic system for relational learning, MLN’s, and a search-based ILP system, Aleph,
while being considerably faster. Those results indicate the promise of CILP++ as a relational learner.
Further comparisons with related work include the analysis of propositionalization-based systems
such as LINUS/DINUS/SINUS [9, 11] and RelF[34], which rely on the quality of their feature
generation to reduce the information loss of the propositionalization approach and, consequently,
within the rules extracted from the learner. Both the LINUS/DINUS/SINUS family of ILP systems
and RelF generate a number of constrained first-order features f from the Herbrand base H (H is the
set of possible clauses for a given domain knowledge). From the collection of features f , a final set
of features F is obtained for representing the training examples, according to a given score function.
CILP++, on the other hand, uses the concept of bottom clause, which is a clause that uniquely
describes a single relational example. CILP++ uses bottom clauses to train a neural network, and an
algorithm based on the concept of semi-propositionalization [16, 27] to generate F.
Approaches based on Bayesian networks [35] also perform relational learning, but represent the
learned knowledge without the use of explicit relational rules. Statistical Relational Models [36]
contain a rich representation language which combines a frame-based logical representation with
probabilistic semantics based on bayesian networks. BLOG [37] is a first-order probabilistic model-
ing language that specifies probability distributions over possible worlds with varying sets of objects.
A BLOG model contains statements that define conditional probability distributions for a certain set
of random variables; the model also specifies certain context-specific independence properties. In-
ference is done on BLOG using Markov Chain Monte Carlo [38] algorithms. In CILP++, inference
is deterministic with first-order rules learned from the neural (statistical) model being applicable
directly onto a Prolog theorem prover for reasoning.
As future work, a study on how CILP++ deals with noisy datasets (noise in the background knowl-
edge and/or examples) can provide interesting results, due to how backpropagation naturally deals
with incomplete data and noisy inputs. Also, an investigation on how CILP++ can be adapted to deal
directly with numeric data can overcome a well-known flaw in ILP systems, which is its inability
to deal directly with numbers. ILP systems use auxiliary predicates to indicate relations between
numeric variables such as greater-than, less-than and so on.

References
 [1] S. Hölldobler and Y. Kalinke, “Towards a massively parallel computational model for logic program-
     ming,” in In: Proceedings of the ECAI94 Workshop on Combining Symbolic and Connectionist Process-
     ing, pp. 68–77, 1994.
 [2] G. G. Towell and J. W. Shavlik, “Knowledge-Based Artificial Neural Networks,” Artif. Intell., vol. 70,
     no. 1-2, pp. 119–165, 1994.
 [3] A. S. D. Garcez and G. Zaverucha, “The Connectionist Inductive Learning and Logic Programming Sys-
     tem,” Applied Intelligence, vol. 11, pp. 59–77, 1999.
 [4] R. Basilio, G. Zaverucha, and V. Barbosa, “Learning Logic Programs with Neural Networks,” in Inductive
     Logic Programming, vol. 2157 of LNAI, pp. 15–26, Springer Berlin / Heidelberg, 2001.
 [5] S. Džeroski and N. Lavrač, Relational Data Mining. Relational Data Mining, Springer, 2001.
 [6] L. De Raedt, Logical and Relational Learning. Cognitive Technologies, Springer, 2008.
 [7] A. Srinivasan and S. H. Muggleton, “Mutagenesis: ILP experiments in a non-determinate biological
     domain,” in Proceedings of the 4th International Workshop on Inductive Logic Programming, volume 237
     of GMD-Studien, pp. 217–232, 1994.
 [8] J. Davis, E. S. Burnside, I. de Castro Dutra, D. Page, and V. S. Costa, “An integrated approach to learning
     bayesian networks of rules,” in ECML (J. Gama, R. Camacho, P. Brazdil, A. Jorge, and L. Torgo, eds.),
     vol. 3720 of Lecture Notes in Computer Science, pp. 84–95, Springer, 2005.
 [9] N. Lavrač and S. Džeroski, Inductive logic programming: techniques and applications. Ellis Horwood
     series in artificial intelligence, E. Horwood, 1994.
[10] F. Železný and N. Lavrač, “Propositionalization-based Relational Subgroup Discovery With RSD,” Ma-
     chine Learning, vol. 62, pp. 33–63, 2006.
[11] S. Kramer, N. Lavrač, and P. Flach, “Relational Data Mining,” ch. Propositionalization approaches to
     relational data mining, pp. 262–286, New York, NY, USA: Springer-Verlag New York, Inc., 2000.
[12] M.-A. Krogel, S. Rawles, F. Železný, P. Flach, N. Lavrač, and S. Wrobel, “Comparative Evaluation Of
     Approaches To Propositionalization,” in ILP, vol. 2835 of LNAI, pp. 194–217, Springer-Verlag, 2003.


                                                       8
[13] M. V. M. França, G. Zaverucha, and A. dAvila Garcez, “Fast relational learning using bottom clause
     propositionalization with artificial neural networks,” Machine Learning, vol. 94, no. 1, pp. 81–104, 2014.
[14] M. V. M. França, A. S. D. Garcez, and G. Zaverucha, “Relational Knowledge Extraction from Attribute-
     Value Learners,” in 2013 Imperial College Computing Student Workshop, vol. 35 of OpenAccess Series
     in Informatics (OASIcs), (Dagstuhl, Germany), pp. 35–42, Schloss Dagstuhl, 2013.
[15] M. Richardson and P. Domingos, “Markov logic networks,” Machine Learning, vol. 62, pp. 107–136,
     2006.
[16] N. Lavrač and P. A. Flach, “An extended transformation approach to inductive logic programming,” ACM
     Trans. Comput. Logic, vol. 2, no. 4, pp. 458–494, 2001.
[17] B. Kijsirikul and B. K. Lerdlamnaochai, “First-Order Logical Neural Networks,” Int. J. Hybrid Intell.
     Syst., vol. 2, pp. 253–267, Dec. 2005.
[18] A. Paes, F. Železný, G. Zaverucha, D. Page, and A. Srinivasan, “ILP Through Propositionalization
     and Stochastic k-Term DNF Learning,” in ILP, vol. 4455 of LNAI, (Berlin, Heidelberg), pp. 379–393,
     Springer-Verlag, 2007.
[19] R. Basilio, G. Zaverucha, and A. S. Garcez, “Inducing Relational Concepts with Neural Networks via the
     LINUS System,” in In ICONIP, pp. 1507151–0, 1998.
[20] M. Craven and J. W. Shavlik, “Extracting Tree-Structured Representations of Trained Networks,” in NIPS,
     pp. 24–30, 1995.
[21] R. King and A. Srinivasan, “Relating chemical activity to structure: An examination of ILP successes,”
     New Generation Computing, vol. 13, no. 3-4, pp. 411–434, 1995.
[22] M. Bilenko and R. J. Mooney, “Adaptive duplicate detection using learnable string similarity measures,”
     in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data
     mining, KDD ’03, (New York, NY, USA), pp. 39–48, ACM, 2003.
[23] A. Srinivasan, “The Aleph System, version 5.” http://www.cs.ox.ac.uk/activities/machlearn/Aleph/
     aleph.html, 2007. Last accessed on may/2013.
[24] S. Muggleton, “Inverse Entailment and Progol,” New Generation Computing, Special issue on Inductive
     Logic Programming, vol. 13, no. 3-4, pp. 245–286, 1995.
[25] A. Tamaddoni-Nezhad and S. Muggleton, “The lattice structure and refinement operators for the hypoth-
     esis space bounded by a bottom clause,” Machine Learning, vol. 76, no. 1, pp. 37–72, 2009.
[26] S. Muggleton and L. D. Raedt, “Inductive Logic Programming: Theory and Methods,” Journal of Logic
     Programming, vol. 19, no. 20, pp. 629–679, 1994.
[27] M. V. M. França, G. Zaverucha, and A. S. D. Garcez, “Neural relational learning through semi-
     propositionalization of bottom clauses,” in AAAI Spring Symposium Series, 2015.
[28] R. A. Jacobs, “Increased rates of convergence through learning rate adaptation,” Neural Networks, vol. 1,
     no. 4, pp. 295–307, 1988.
[29] L. Prechelt, “Early stopping - but when?,” in Neural Networks: Tricks of the Trade, volume 1524 of LNCS,
     chapter 2, pp. 55–69, Springer-Verlag, 1997.
[30] R. A. Kowalski, “The early years of logic programming,” Commun. ACM, vol. 31, pp. 38–43, Jan. 1988.
[31] J. Larson and R. S. Michalski, “Inductive inference of VL decision rules,” SIGART Bull., no. 63, pp. 38–
     44, 1977.
[32] T. N. Huynh and R. J. Mooney, “Discriminative structure and parameter learning for markov logic net-
     works,” in Proceedings of the 25th International Conference on Machine Learning, ICML ’08, (New
     York, NY, USA), pp. 416–423, ACM, 2008.
[33] S. Kok and P. Domingos, “Learning the structure of markov logic networks,” in Proceedings of the 22Nd
     International Conference on Machine Learning, (New York, NY, USA), pp. 441–448, ACM, 2005.
[34] O. Kuželka and F. Železný, “Block-wise construction of tree-like relational features with monotone re-
     ducibility and redundancy,” Machine Learning, vol. 83, pp. 163–192, 2011.
[35] J. Pearl, Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000.
[36] N. Friedman, L. Getoor, D. Koller, and A. Pfeffer, “Learning probabilistic relational models,” in In IJCAI,
     pp. 1300–1309, Springer-Verlag, 1999.
[37] B. Milch, B. Marthi, S. Russell, D. Sontag, D. L. Ong, and A. Kolobov, “BLOG: probabilistic models
     with unknown objects,” IJCAI, (San Francisco, CA, USA), pp. 1352–1359, Morgan Kaufmann Publishers
     Inc., 2005.
[38] A. E. Gelfand and A. F. M. Smith, “Sampling-based approaches to calculating marginal densities,” Jour-
     nal of the American Statistical Association, vol. 85, no. 410, pp. 398–409, 1990.


                                                       9