=Paper=
{{Paper
|id=Vol-2350/paper21
|storemode=property
|title=DeepLogic: Towards End-to-End Differentiable Logical Reasoning
|pdfUrl=https://ceur-ws.org/Vol-2350/paper21.pdf
|volume=Vol-2350
|authors=Nuri Cingillioglu,Alessandra Russo
|dblpUrl=https://dblp.org/rec/conf/aaaiss/CingilliogluR19
}}
==DeepLogic: Towards End-to-End Differentiable Logical Reasoning==
<pdf width="1500px">https://ceur-ws.org/Vol-2350/paper21.pdf</pdf>
<pre>
             DeepLogic: Towards End-to-End Differentiable Logical Reasoning

                                          Nuri Cingillioglu and Alessandra Russo
                                                     {nuric, a.russo}@imperial.ac.uk
                                                        Deparment of Computing
                                                        Imperial College London


                            Abstract                                   Relation Networks (Santoro et al. 2017) and Memory Atten-
                                                                       tion Control Networks (Hudson and Manning 2018). Tradi-
  Combining machine learning with logic-based expert systems           tionally, inference systems over logic programs are manually
  in order to get the best of both worlds are becoming increas-
  ingly popular. However, to what extent machine learning can
                                                                       built on algorithms such as backward chaining (Apt and
  already learn to reason over rule-based knowledge is still an        Van Emden 1982) or forward chaining (Russell and Norvig
  open problem. In this paper, we explore how symbolic logic,          2016). There have been attempts to partially replace sym-
  defined as logic programs at a character level, is learned to be     bolic components of these systems with neural networks
  represented in a high-dimensional vector space using RNN-            such as Neural Prolog (Ding 1995) which constructs net-
  based iterative neural networks to perform reasoning. We cre-        works through symbolic applications of logic rules. Another
  ate a new dataset that defines 12 classes of logic programs          vital component, unification of variables have been tackled by
  exemplifying increased level of complexity of logical reason-        Unification Neural Networks (Komendantskaya 2011). How-
  ing and train the networks in an end-to-end fashion to learn         ever, neither of these networks act as a complete inference
  whether a logic program entails a given query. We analyse how        engine that can be trained end-to-end solely using examples
  learning the inference algorithm gives rise to representations
  of atoms, literals and rules within logic programs and evaluate
                                                                       and learn corresponding representations of symbolic knowl-
  against increasing lengths of predicate and constant symbols         edge.
  as well as increasing steps of multi-hop reasoning.                     Another important aspect of machine learning that allows
                                                                       scalability is learning dense distributional representations
                                                                       such as word embeddings (Pennington, Socher, and Manning
                        Introduction                                   2014). This approach is used to learn embeddings of predi-
There has been increasing interest and attempts at combining           cates to perform knowledge base completion (Serafini and
machine learning with logic-based expert systems in order              Garcez 2016); however, an existing reasoning mechanism
to get the best of both worlds: have an end-to-end trainable           such as backward chaining (Rocktäschel and Riedel 2017)
system that requires little data to learn how to reason, possibly      guides the machine learning component to learn such em-
using existing background knowledge, and is interpretable              beddings. We instead build on the prior work on iterative
by humans (Besold et al. 2017). However, it is not clear to            neural networks that learn algorithmic tasks such as addition
what extent machine learning can learn logic-based reasoning           of digits (Graves, Wayne, and Danihelka 2014) and sequence
without requiring any prior engineering. Thus, in this work            processing (Zaremba et al. 2016). Resultantly, we learn rea-
we provide a first insight into how symbolic logic, represented        soning tasks from scratch without prior engineering alongside
as logic programs at a character level can be learned using            representations of knowledge bases expressed as logic pro-
RNN-based iterative neural networks to perform reasoning.              grams. The key motivation is that if we can learn symbolic
   A crucial component in existing logic-based expert sys-             reasoning using machine learning then we might be able
tems is the ability to iteratively use a given knowledge base.         to minimise the extent to which they have to be combined
Framed as multi-hop reasoning, these tasks require an agent            manually. The focus is not to extract the prior symbolic hy-
to process information over several steps to reach a conclu-           potheses but to learn how to utilise a given knowledge base;
sion. This paradigm of reasoning has been performed by                 thus, we do not compare within the domain of inductive logic
neural networks in domains such as story comprehension                 programming (Muggleton and De Raedt 1994).
with Dynamic Memory Networks (Xiong, Meity, and Socher                    This paper provides the following contributions, (i) a
2016), graph problems with Differentiable Neural Comput-               new synthetic dataset consisting of 12 various classes of
ers (Graves et al. 2016) and visual question answering with            normal logic programs, (ii) an iterative neural inference
Copyright held by the author(s). In A. Martin, K. Hinkelmann, A.       network to learn end-to-end different reasoning tasks and
Gerber, D. Lenat, F. van Harmelen, P. Clark (Eds.), Proceedings of     (iii) analysis of how these networks represent logic pro-
the AAAI 2019 Spring Symposium on Combining Machine Learn-             grams and handle multi-hop reasoning. Our implementation
ing with Knowledge Engineering (AAAI-MAKE 2019). Stanford              is publicly available online at https://github.com/
University, Palo Alto, California, USA, March 25-27, 2019.             nuric/deeplogic.
                                              Table 1: Sample programs from tasks 1 to 7.

    1: Facts    2: Unification   3: 1 Step           4: 2 Steps          5: 3 Steps          6: Logical AND           7: Logical OR
    e(l).       o(V,V).          g(L,S) :- x(S,L).   x(G,B) :- k(G,B).   p(P,R) :- b(R,P).   f(P,U) :- b(P) , p(U).   e(D,X) :- n(D,X).
    i(u).       i(x,z).          x(a,m).             k(Z,V) :- g(Z,V).   b(A,L) :- a(A,L).   b(x).                    e(D,X) :- w(D,X).
    n(d).       y(w,d).          y(X) :- r(X).       g(k,k).             a(W,F) :- v(F,W).   p(x).                    e(w,y).
    v(h,y).     p(a,b).          p(h).               e(k,s).             v(t,i).             p(y).                    n(n,j).
    p(n).       t(A,U).          s(t,v).             p(L,G) :- v(G,L).   l(D) :- t(D).       e(y,v).                  w(t).
    ? e(l). 1   ? o(d,d). 1      ? g(m,a). 1         ? x(k,k). 1         ? p(t,i). 1         ? f(x,x). 1              ? e(n,j). 1
    ? i(d). 0   ? o(b,d). 0      ? g(a,m). 0         ? x(k,s). 0         ? p(i,t). 0         ? f(y,x). 0              ? e(i,j). 0


                         Background                                      on concepts seen in prior tasks. Every task consists of triples
A normal logic program (Apt and Bol 1994) is a set of rules              (context, query, target) following the signature of the in-
of the form:                                                             ference function in equation 4. For compactness a single
                   A ← L1 , . . . , Ln (n ≥ 0)                (1)        context can have multiple queries and are expressed in the
where A is an atom and each Li is a literal. A literal is either         form ? query target after the context. The contexts
an atom (a positive literal) or a negated atom of the form               contain the simplest possible rules that cover the required
not A where not is negation by failure (Clark 1978). Atom                reasoning procedure and are not mixed together akin to unit
A represents the head and L1 , . . . , Ln the body of a rule. An         tests in software engineering.
atom is a predicate with some arity, e.g. p(X, Y ), q(a), where             Each task is generated using a fixed procedure from which
variables are upper case characters and predicates and con-              samples are drawn. For constants and predicates we use the
stants are lower case. If a rule does not contain any variables          lower case English alphabet and for variables upper case. The
we refer to it as ground rule and ground rules with empty bod-           length of the character sequences that make up predicates,
ies as facts. We follow a similar syntax to Prolog (Clocksin             constants and variables can be of arbitrary length (only ex-
and Mellish 2003) and express normal logic programs as:                  amples of length 1 are shown) but we generate lengths up
                                                                         to 2 for the training dataset and longer lengths for test data.
                    p(X) :- q(X), −r(X).                          (2)    The arity of the atoms are selected randomly between 1 and
                     q(a).                                        (3)    2. For every sample we also generate irrelevant rules as noise
As in equation 2, the ← is replaced with :- and the negation             that always have different predicates and random structure
by failure not with − while maintaining the same semantics.              while still preserving the semantics of the task.
When there are no body literals we omit the implication                     Facts The simplest task consists only of facts. There is
entirely, equation 3.                                                    only one successful case in which the query appears in the
                                                                        context and can fail in 3 different ways: (i) the constant might
                                1 if C ` Q                               not match (shown in Table 1), (ii) the predicate might not
                   f (C, Q) =                            (4)
                                0 otherwise                              match or (iii) the query might not be in the context at all.
                                                                         These failures can cascade and a query can fail for multiple
                f (C, not Q) = 1 − f (C, Q)                       (5)
                                                                         reasons with equal probability.
We define the logical reasoning process as an inference func-               Unification These tasks contain rules with empty bodies
tion f , equation 4, that given a normal logic program (without          and an atom with variables in the head. The intended ob-
function symbols) as context C ∈ C and a ground atom as                  jective is to emphasise the semantics of unification between
query Q ∈ Q returns {1, 0} = B depending on whether                      different p(X, Y ) and same variables p(X, X). The query
the context entails the query C ` Q or not. We can now                   succeeds if the corresponding variables unify and fail other-
define negation by failure not Q using the outcome of the                wise. The failure case in which same variables do not match
corresponding positive ground query Q, equation 5.                       different constants is in Table 1.
                      f (C, Q) , p(Q|C)                           (6)       N Step Deduction One prominent feature of logical rea-
                                                                         soning is the application of rules that contain literals in the
In order to learn the reasoning process, we define an auxiliary          body. These tasks cover the case when there is a single posi-
objective for the neural network and consider the inference              tive atom in the body. All such rules contain only variables
function f to be the conditional probability of the query                and chains of arbitrary steps can be generated. For the train-
given the context, equation 6. This approach renders the                 ing dataset, we generate up to 3 steps, samples in Table 1. The
problem, from a machine learning perspective, as a binary                query succeeds when the body of the last rule in the chain is
classification problem and allows training using standard                grounded with the same constant as in the query. We occa-
cross-entropy loss.                                                      sionally swap the variables p(X, Y ):-q(Y, X). to emphasise
                                                                         the variable binding aspect of rules which can happen at any
                          The Tasks                                      rule in the chain or not at all. The failure cases are covered
Inspired by the bAbI dataset (Weston et al. 2015), we lay-               when the swapped constants do not match or when the final
out 12 tasks that cover various aspects of normal logic pro-             body literal in the chain fails due to reasons covered in the
grams. The tasks are of increasing complexity building up                first task.
                                            Table 2: Sample programs from tasks 8 to 12.

               8: Transitivity              9: 1 Step NBF        10: 2 Step NBF      11: AND NBF               12: OR NBF
               f(A,W) :- q(A,P) , d(P,W).   s(X,J) :- -p(J,X).   r(C) :- -o(C).      b(G,B) :- -i(G) , u(B).   y(Z) :- -e(Z).
               q(h,t).                      p(e,x).              o(P) :- l(P).       i(w).                     y(Z) :- b(Z).
               d(t,j).                      v(V,Q) :- u(V,Q).    l(o).               g(a).                     y(r).
               q(d,m).                      o(N) :- -q(N).       g(u).               u(a).                     e(d).
               d(n,g).                      t(x,e).              p(U,L) :- e(U,L).   f(t).                     s(a).
               s(S,F) :- x(S,A) , e(A,F).   m(y,c).              p(X,X).             l(W) :- a(W) , d(W).      b(m).
               ? f(h,j). 1                  ? s(x,e). 0          ? r(u). 1           ? b(a,a). 1               ? y(a). 1
               ? f(d,g). 0                  ? s(e,x). 1          ? r(o). 0           ? b(w,a). 0               ? y(d). 0


   Logical AND Building upon the previous deduction tasks,              ral Reasoning Networks and the objective is to learn how
we create rules with 2 body literals to capture the logical ∧           C ` Q is computed solely from examples. The primary
semantics in rules, sample in Table 1. The reasoning engine             challenge for these networks is to have a fixed network ar-
now has to keep track of and prove both body literals to                chitecture that must process all the tasks unlike tree based
succeed. A failure occurs when one randomly selected body               recurrent networks (Tai, Socher, and Manning 2015) or graph
literal fails for reasons similar to the first task.                    networks (Battaglia et al. 2018) which construct a different
   Logical OR Having multiple matching heads captures the               network dependant on each input. We place this constraint
semantics of ∨ in logic programming by creating several                 to avoid engineering any prior structural information into a
possible paths to a successful proof, sample in Table 1. In             network. We gather inspiration from Memory Networks (We-
this task we branch the query predicate 3 ways, 2 implications          ston, Chopra, and Bordes 2015), in particular we wanted
and 1 ground rule. The query succeeds when any of the rules             to incorporate the end-to-end approach (Sukhbaatar et al.
succeed and fail when all the matching rules fail.                      2015) and the iterative fashion of Dynamic Memory Net-
   Transitivity Based on the previous 2 tasks, the transitive           works (DMN) (Kumar et al. 2016) while following the steps
case covers existential variable binding. It requires the model         of a symbolic reasoning method such as backward chain-
to represent the conjunction of the body literals of a rule and         ing (Apt and Van Emden 1982).
match multiple possible facts. The case succeeds when the
inner variable unifies with the ground instances or fails other-
wise. We expect an reasoning engine to solve the previous 2
tasks in order to solve this one, sample in Table 2.
   N-Step Deduction with Negation These tasks introduce
the concept of negation by failure. The body literal of the first
rule in the chain is negated and a chain of arbitrary length
is generated. For the training dataset we only generate proof
chains of length 2, samples in Table 2. The query succeeds
when the negated body atom fails; the swapped variables do
not match the constants, or the body literal of the final rule in
the chain fails for reasons similar to the first task. The query
fails whenever the negated body atom succeeds following the              Figure 1: Graphical overview of the iterative cell of the It-
semantics described by equation 5.                                       erative Memory Attention (IMA) model. The context and
   Logical AND with Negation After introducing negation                  query are processed at the character level to produce literal
by failure, we consider negation together with logical ∧ and             embeddings, then an attention is computed over the head of
randomly negate one of the body literals, sample in Table 2.             the rules. A weighted sum of the unifier GRU outputs using
As such, the query succeeds when the negated body atom                   the attention, updates the state for the next iteration.
fails and the other literal succeeds. The query can fail when
either body literal fails similar to the non-negated case.                 Design We can consider the logic program context a read-
   Logical OR with Negation Finally, we consider negation               only memory and the proof state a writable memory com-
together with the logical ∨ case and negate the body literal            ponent. In a similar fashion to Prolog’s backward chaining
of one rule. The query succeeds when any matching rule                  algorithm (Clocksin and Mellish 2003), we aim to have (i) a
except the negated one succeeds and fails if the negated rule           state to store information about the proof such as the query,
succeeds while other matching rules fail, sample in Table 2.            (ii) a mechanism to select rules via attention and (iii) a compo-
                                                                        nent to update the state with respect to the rules. To that end,
                                                                        we introduce the Iterative Memory Attention (IMA) network
            Neural Reasoning Networks                                   that given a normal logic program as context and a positive
In this section we describe a RNN-based iterative neural                ground atom as query, embeds the literals in a high dimen-
network for learning the inference function f , equation 4.             sional vector space, attends to rules using soft attention and
Broadly, we call networks that learn logical reasoning, Neu-            updates the state using a recurrent network. IMA should be
                               Table 3: Results on easy test set (10k each), d = 64, sm = softmax.

                  Training                        Multi-task                                           Curriculum
                    Model      LSTM      MAC     DMN                 IMA                MAC      DMN                IMA
                Embedding        -       rule     rule       literal     lit+rule       rule      rule        literal   lit+rule
                 Attention       -       sm        σ       σ       sm       sm          sm         σ        σ       sm     sm
                      Facts     0.61     0.84     1.00     1.00      1.00     0.98      0.89      1.00       1.00       0.99   0.94
                Unification     0.53     0.86     0.87     0.90      0.87     0.85      0.83      0.85       0.88       0.88   0.86
                     1 Step     0.57     0.90     0.74     0.98      0.94     0.95      0.77      0.62       0.96       0.93   0.92
                    2 Steps     0.56     0.81     0.67     0.95      0.95     0.94      0.70      0.58       0.95       0.91   0.89
                    3 Steps     0.57     0.78     0.77     0.94      0.94     0.94      0.64      0.64       0.93       0.86   0.87
                      AND       0.65     0.84     0.80     0.95      0.94     0.85      0.81      0.70       0.80       0.78   0.83
                        OR      0.62     0.85     0.87     0.97      0.96     0.93      0.75      0.75       0.96       0.93   0.90
                Transitivity    0.50     0.50     0.50     0.50      0.52     0.52      0.50      0.50       0.50       0.50   0.50
                1 Step NBF      0.58     0.92     0.79     0.98      0.94     0.95      0.65      0.58       0.96       0.91   0.92
               2 Steps NBF      0.57     0.83     0.85     0.96      0.93     0.95      0.57      0.73       0.95       0.90   0.90
                 AND NBF        0.55     0.82     0.84     0.92      0.93     0.85      0.61      0.61       0.71       0.77   0.83
                   OR NBF       0.53     0.74     0.75     0.86      0.86     0.86      0.59      0.63       0.86       0.83   0.84
               Easy Mean        0.57     0.81     0.79     0.91      0.90     0.88      0.69      0.68       0.87       0.85   0.85
             Medium Mean        0.52     0.70     0.70     0.86      0.81     0.79      0.60      0.61       0.81       0.76   0.74
               Hard Mean        0.51     0.63     0.66     0.83      0.75     0.72      0.55      0.58       0.76       0.70   0.68


considered a variant of Memory Networks, designed to suit                   state. We also append a blank rule () that acts as a learnable
the logical reasoning process in which the recurrent compo-                 parameter and is often attended when no other rule is suitable.
nent is iterated over literals to perform reasoning, a graphical
overview is shown in Figure 1.                                              Iteration
                                                                            The iterative step consists of attending to the rules, computing
Literal Embedding                                                           a new state using each rule and updating the old state. The
The inputs to the network are two sequences of characters                   network is iterated for T steps, fixed in advance, with the
cC             C      Q            Q
  0 , . . . , cm and c0 , . . . , cn for context and query respec-
                                                                            initial state s0 ∈ Rd set to the query vector s0 = q.
tively. The principle motivation behind having character level                         cati = [st ; q ; ri ; (st − ri )2 ; st         ri ]    (8)
inputs rather than symbol tokens is to constrain the network                                         d    d             d
to learn sub-symbolic representations that could potentially                            αit = σ(W 1× 2 (U 2 ×d cati + b 2 ) + b1 )            (9)
extend to previously unseen literals. We pre-process the con-
text sequence to separate it into literals and obtain an input              At any time step t, we compute a feature vector cati ∈ R5d
                               0
tensor IC ∈ NR×L×m of characters encoded as positive                        for every head of a rule ri = Ci0 using the current state
integers where R is the number of rules, L number of liter-                 st ∈ Rd , equation 8 where [; ] is the concatenation operator.
als and m0 the length of the literals. The query is a single                We also experiment with embedding rules using another GRU
ground positive atom encoded as a vector I Q ∈ Nn . This                    over literals h0ij = GRU (Cij , h0i,j−1 ) and take the final state
pre-processing allows the network to consider each literal                  ri = h0iL as the representation of the rule. To compute the
independently when iterating over rules giving it finer control             final attention vector αit , we use a two layer feed-forward
over the reasoning process.                                                 network, equation 9 where σ is the sigmoid function. We
                                                                            also experiment with the softmax formulation of the attention
                  ht = GRU (O[I::t ], ht−1 )                  (7)           vector αit after the two layer feed-forward network.
Each literal is embedded using a recurrent neural network that                                  utij = GRU (Cij , uti(j−1) )                 (10)
processes only the characters of that literal I::t , equation 7
where O[I::t ] is the one-hot encoding of the characters. We                                             R
                                                                                                         X
use a gated recurrent unit (GRU) (Cho et al. 2014) starting                                    st+1 =        αit utiL                        (11)
              →
              −                                                                                          i
with h0 = 0 to process the characters in reverse order to
emphasise the predicate and take the final hidden state ht                  To apply a rule, we use another recurrent neural network
to be the embedding of the literal l ∈ Rd where d is the                    that processes every literal of every rule Cij . The initial
fixed dimension of the embedding. The context and query are                 hidden state uti0 = st is set to the current state st , then for
embedded using the same network yielding a context tensor                   every rule a GRU is used to compute the new hidden state
C ∈ RR×L×d where R is the number of rules and L number                      utij , equation 10. Finally, the new state st+1 becomes the
of literals in a rule; for the query we obtain vector q ∈ Rd .              weighted sum of the final hidden states utiL , equation 11. We
   In order for the network to propagate the state unchanged,               call the inner GRU unifier as it needs to learn unification
                                   →
                                   −
we append a null sentinel φ = 0 which allows the network                    between variables and constants as well as how each rule
to ignore the current reasoning step by carrying over the                   interacts with the current state.
Figure 2: Attention maps produced for query p(a) for IMA with softmax attention performing backward chaining in the left
column and IMA with literal + rule embedding forward chaining in the right column on tasks 5 to 7.


                       Experiments                                  generated logic programs per task and results for the best
                                                                    single training run out of 3 for each model with state size
We carry out experiments on individual tasks with variations        d = 64 on the easy set are shown in Table 3.
on our model. As a baseline, we use a LSTM (Hochreiter and             We observe that all iterative models perform better than
Schmidhuber 1997) to process the query and then the context         the baseline except for task 8, transitivity which all models
to predict the answer. We also compare our model against the        fail at. We speculate the reason is that the models have not
Dynamic Memory Network (DMN) (Kumar et al. 2016) and                seen existential variable binding for more than 2 character
the Memory Attention Control (MAC) (Hudson and Manning              constants and fail to generalise in this particular case. We
2018) network which both incorporate iterative components           note that the curriculum training regime has no benefit for
achieving state-of-the-art results in visual question answering     any model most likely because we introduce new, unseen
datasets. With DMN and MAC, the context is separated into           tasks with each iteration such that models with prior train-
rules and the entire rule is embedded using a GRU at the            ing have no advantage, ex. solving OR in 2 iterations does
character level. Unlike DMN which uses another GRU to               not improve the solution for AND in 3 iterations. Embed-
accumulate information over the rule embeddings and the             ding literals seems to provide an advantage over embedding
current state, our variant IMA processes literal embeddings         rules since all IMA models outperform both DMN and MAC
using a GRU to compute the new state as a weighted sum              when we consider the mean accuracy over every test set. We
over each rule.                                                     postulate literal embeddings give a finer view and allow for
    Training We use two training regimes: curriculum learn-         better generalisation over increasing lengths as embedding
ing (Bengio et al. 2009) and multi-task. With curriculum            rules with literals (lit+rule) also degrades the performance.
learning the model is trained in an incremental fashion start-      Although our variant IMA performs the best on all the test
ing with tasks 1 and 2 with only 1 iteration. Then tasks            sets, all models quickly degrade in mean accuracy as the dif-
accumulate with increasing number of iterations with tasks          ficulty increases. We speculate that a finite sized state vector
3, 7, 9, 12 with 2 iterations and tasks 4, 6, 8, 11 using 3         stores limited information about unseen unique sequences of
iterations. We determine the minimum number iterations re-          increasing lengths and analyse this behaviour further in the
quired for a task based on Prolog (Clocksin and Mellish             next section.
2003). Finally all tasks are introduced with a maximum of              Figure 2 portrays the attention maps produced by IMA
4 iterations. The multi-task approach trains on the entire          with softmax attention. Although by default models converge
dataset with 4 iterations fixed in advance. Models are trained      to backward chaining, by (i) reversing the direction of the
via back-propagation using Adam (Kingma and Ba 2014)                unifier GRU, equation 10, and (ii) skipping task 2, we can
for 120 epochs (10 per task) with a mini-batch size of 32           create a bias towards ground facts that have a matching con-
and a training dataset size of 20k logic programs per task.         stant. This approach encourages our model IMA with rule
We ensure a mini-batch contains at least 1 sample from each         embeddings (lit+rule) to converge to forward chaining (Rus-
training task to avoid any bias towards any task in a given         sell and Norvig 2016), albeit with more training time (thus
mini-batch. Logic programs are shuffled after each epoch and        results for forward chaining are not included in Table 3). This
rules within a context are also shuffled with every mini-batch.     observation emphasises the fact that a fixed network architec-
Since we have access to the data generating distribution, we        ture can be flexible enough to learn two different solutions
do not use any regularisation in any of the models and have         for the same domain of problems given the right incentives.
increased the training dataset size accordingly to avoid over-
fitting (Goodfellow, Bengio, and Courville 2017).
                                                                                              Analysis
    We generate 4 test sets of increasing difficulty: validation,
easy, medium and hard which have up to 2, 4, 8 and 12 char-         Since models are trained in an end-to-end fashion, the repre-
acters for predicates and constants as well as added number         sentations for logical constructs that help perform reasoning
of irrelevant rules respectively. Each test set consists of 10k     must also be learned. In this section, we scrutinise the learnt
                                                                      As the embedding sizes are finite, we also expect a satu-
                                                                   ration in the embedding space as the length of the literals
                                                                   increase. We capture this phenomenon by repeating the char-
                                                                   acter p 64 times and observe a converging pattern shown in
                                                                   Figure 4. We take note how odd and even length predicates
                                                                   converge to their own respective point suggesting that the
                                                                   embeddings produced by the GRU, equation 7, learn parity.
                                                                      If we take structurally different literals we observe a pref-
                                                                   erence towards whether literals are negated or grounded then
                                                                   the arity of the atoms, Figure 5a. We believe this clustering
                                                                   captures the literal semantics available in the dataset within 4
                                                                   major clusters (grey lines). Furthermore, if we look at the rule
                                                                   embeddings learnt by DMN, we notice a similar clustering
                                                                   based on rule structure with again groundedness, negation,
                                                                   arity and number of body literals as learnt distinguishing
                                                                   features, Figure 5b. The multiple points within a cluster cor-
Figure 3: First, second and last principal components of           respond to different predicates following a similar ordering,
embeddings of single character literals form a lattice like        for example predicate p is often the upper most point.
structure with predicates clustered in vertical columns and           To evaluate multi-hop reasoning, we generate logic pro-
constants on horizontal surfaces; from IMAsm.                      grams that require increasing number of steps of deduction.
                                                                   While the training data contains up to 3 steps, we generate
                                                                   up to 32 steps and the networks are run for n + 1 iterations.
representations of the embedding GRU, equation 7 for IMA           We obtain similar results to other recurrent neural network
and the rule embeddings of DMN.                                    based systems (Zaremba et al. 2016) and observe a steady de-
   An important assumption made in the dataset is that every       crease in accuracy beyond the training line (grey dashed line),
predicate and constant combination is unique. We expect this       shown in Figure 6 in which imasm and imarsm indicate
setup to create a formation that would allow every literal to      IMA with softmax attention and rule embedding respectively.
be differentiated. As such, we observe a lattice like structure    We speculate that with each iteration step the state represen-
when the embeddings of single and double character atoms           tation degrades eventually losing information since models
are visualised in 3 dimensions using principle component           tend to produce noisy attention maps or state transformations.
analysis (PCA), Figure 3. The first two components select          Our IMA model with softmax attention maintains the high-
the predicate and the final the constant to uniquely identify      est accuracy most likely due to the sparsity created by the
the literal with a clear distinction between single and double     softmax operation.
character predicates. This arrangement is in contrast with            To evaluate generalisation to unseen symbols, we take
distributional representations that exploit similarities between   task 3 and generate increasing character lengths of random
entities such as in natural language embeddings (Mikolov et        predicate or constant symbols up to 64, well beyond the train-
al. 2013). In our case, p(a) might be more similar to p(b)         ing dataset length of 2. Although we observed that literal
than to pp(a) although all are deemed unique.                      embeddings can saturate, models can cope with longer ran-
                                                                   domly generated predicate or constant symbols, Figure 7a,
                                                                   since looking at only a few characters can determine unique-
                                                                   ness. This reflects on our intuition that looking at portions of
                                                                   sequences might be enough to determine equality rather than
                                                                   memorising them entirely, ex. looking at last few digits of
                                                                   phone numbers to check if they are same.
                                                                      In order to understand how the embedding and state di-
                                                                   mensions d affect the models, we experimented with sizes 32,
                                                                   48, 64, 96 and 128 running each curriculum training session
                                                                   3 times for our IMA model. Figure 7b shows that increasing
                                                                   the dimension size does not contribute to a clear increase in
                                                                   mean accuracy over all tasks and the drop in accuracy across
                                                                   easy, medium and hard test sets follow a similar pattern for
                                                                   every dimension. Despite the initial increase beyond d = 32,
                                                                   we get the highest upper bound in accuracy with d = 64, for
                                                                   which the individual results are in Table 3.
                                                                      We designed the tasks to be of increasing difficulty build-
Figure 4: Repeating character predicates saturate the embed-
                                                                   ing on previously seen problems such as unification before
ding and converge to respective points, equidistant lines are
                                                                   deduction. We expected models trained using an iterative
plotted in grey; from IMAsm.
                                                                   scheme would generalise better as the network depth would
                                                                   increase gradually. However, when we average the mean
  (a) Structurally different literals first cluster by whether they are (b) Rule embeddings form clusters based on their structure with a
  negated or grounded then by arity (grey lines added as visual aids); distinction between negated and non-negated rules (grey line added
  from IMAsm.                                                           as visual aid), from DMN.

                             Figure 5: Principle component analysis (PCA) of learnt representations.


                                                                         and symbolic systems rather than encompassing one with the
                                                                         other. In this section, we try to provide additional discussion
                                                                         and insight into why one might or might not learn symbolic
                                                                         reasoning from scratch using iterative neural networks.
                                                                            Whilst the basic logic programs presented in this paper
                                                                         can all be solved using existing symbolic systems such as
                                                                         Prolog (Clocksin and Mellish 2003), we struggle to see a
                                                                         comparable performance from neural networks. Increasing
                                                                         the number of steps in task 3, as shown in Figure 6, demon-
                                                                         strates the fragile nature of using continuous representations
                                                                         for rigid symbolic reasoning. The embedding space is inher-
                                                                         ently limited in capacity as it has fixed number of dimensions.
                                                                         Although a 64 dimensional vector can in theory encode a
                                                                         lot of information, in practice neural networks trained using
                                                                         back-propagation do not seem to have any guarantees on how
Figure 6: When models are iterated beyond the training num-              efficiently they will learn representations in a vector space.
ber of steps (grey line) to perform increasing steps of deduc-           We speculate that this creates an accumulative noise with
tion, the accuracy degrades for all models.                              each iteration which eventually degrades the performance.
                                                                         On the other hand, the learnt continuous representations scale
                                                                         to very long previously unseen symbols, Figure 7a, which is
                                                                         a desirable property of neuro-symbolic systems.
accuracy for all dimensions, for all 3 runs of IMA with soft-
max attention across the test sets, we do not discern any                   Attention mechanisms allow state-of-the-art memory
advantage for curriculum training over multi-task training               based networks to address and operate an external mem-
beyond the validation set. Figure 7c shows a similar decrease            ory (Weston, Chopra, and Bordes 2015). In the case of nueral
in performance across the test sets for both training regimes.           reasoning networks, the attention components provide the
We believe this result stems from introducing new tasks with             means to select rules. An interesting outcome is that all neu-
each iteration and models not having any incentive to abstract           ral reasoning models try to attend to multiple rules when
subtasks until the next set of unseen tasks are incorporated             applicable, for example when there are two matching rules
into the training dataset.                                               as shown in Figure 2. Unlike Prolog, this approach allows
                                                                         the neural networks to simultaneously explore multiple pos-
                                                                         sible branches completing the reasoning task in fewer steps.
                        Discussion                                       However, we speculate such an attention mechanism will
Given the simple, unit test like nature of the logic programs            become the bottleneck when there are large numbers, possi-
generated, the lack of robust results for iterative neural net-          bly hundreds of matching rules such as in knowledge base
works further encourages a combination of machine learning               completion tasks since it will try to aggregate all matching
(a) The models can cope, in particular IMA (b) Mean accuracy over all tasks against in- (c) Mean accuracy of training regimes applied
with literal embeddings, when predicate and creasing embedding dimension d shows no to IMAsm against test sets across all dimen-
constant symbols of increasing length are ran- clear increase beyond d = 64.            sions show no advantage of curriculum train-
domly generated.                                                                        ing for generalisation.

                Figure 7: Accuracy plots over increasing symbol lengths, state dimension and training regime.


rules in a single step. An explicit strong supervision of the        program. Logic Tensor Networks (Serafini and Garcez 2016)
attention mechanism, similar to the original DMN (Kumar et           tackle knowledge completion “on a simple but representative
al. 2016) work or a hierarchical architecture might be needed        example” also grounding every term in the program prior to
to encourage a form of iterative backtracking.                       performing reasoning.
   Lack of abstraction between seemingly related tasks limits           Following the above mentioned works, Neural Theorem
the performance on unseen logic programs. During curricu-            Provers (Rocktäschel and Riedel 2017) learn distributional
lum training, we noticed models which solve tasks 1, 2 and 3         representations of predicates and constants by symbolically
have no advantage on solving task 4 because the final state          constructing the relationship between embeddings using an
representation from task 3 specifically looks for a ground           existing symbolic inference engine. We design our tasks such
instance such as p(a) to complete the task. However, in task         that the neural networks attempt to learn not only represen-
4 the model needs to match another rule p(X):- and iterate           tations at a character level but also the reasoning algorithm
once more. When presented with task 4, models most likely            with no help or knowledge of existing reasoning engines and
have to adjust the intermediate state representation to check        methods.
for the second rule case; as a result Figure 7c can also be             Learning similarities between constants can be used to
interpreted as the lack of abstraction over increasingly com-        perform link-prediction tasks (Nickel et al. 2016) and knowl-
plex tasks in the dataset since curriculum learning provides         edge base completion (Socher et al. 2013) but creates un-
no clear advantage. In other words, the neural models do not         wanted inferences when similar constants should indeed be
seem to learn a reasoning process that is general enough to          unique (Rocktäschel et al. 2014). Although we set every con-
be compared to existing symbolic systems; consequently we            stant to be unique, we expect embeddings of similar constants
were unable to run them on existing logic program bench-             to cluster during training if the data entails the same conclu-
marks or against Prolog.                                             sions. Creating differentiable logical inference networks can
                                                                     also induce rules (Evans and Grefenstette 2018); however, at
                                                                     this stage we do not learn logical rules along side reasoning
                     Related Work                                    tasks and assume they are given to the model. Possible World
There have been many attempts to combine symbolic reason-            Net (Evans et al. 2018) follows a unique approach to learn-
ing with machine learning techniques under the name neural-          ing propositional logic entailment using semantic, worlds
symbolic systems (Garcez, Broda, and Gabbay 2012). Lifted            interpretation; however, they exploit the syntax of logical for-
Relational Neural Networks (Sourek et al. 2015) ground               mulae by parsing them and constructing the neural network
clauses turning them into propositional programs when con-           in a tree like manner in which nodes correspond to logical
structing networks as a set of weighted definite clauses. We         operators. The works mentioned so far are designed for either
do not pre-process programs to ground variables and Neural           deductive databases, relation learning, link prediction, knowl-
Reasoning Networks must learn unification. TensorLog (Co-            edge base completion or propositional programs; thus our
hen 2016) constructs “factor graphs” from logic rules which          task of learning reasoning over and embeddings of normal
in return create the network that run on one-hot encoded             logic programs using a fixed RNN-based iterative network
constants. Our approach does not factor, compile or use any          is inherently different making a direct empirical comparison
implicit knowledge about first-order logical inference or rule       unreasonable.
applications. We also only one-hot encode characters, not               Neural Reasoning Networks can be seen as interpreters
entire predicates or constants, and only give labels 1 or 0          of logic programs as rules can act like instructions. This
as targets to train end-to-end using the same neural network         perspective reflects on systems such as Neural Program In-
with same architecture and weights for every normal logic            terpreters (Reed and De Freitas 2015) and Neural Symbolic
Machines (Liang et al. 2016). These systems contain discrete     Ding, L. 1995. Neural prolog-the concepts, construction
operations that allow more complex actions to be performed       and mechanism. In Systems, Man and Cybernetics, 1995.
overcoming the problem of a degrading state representation;      Intelligent Systems for the 21st Century., IEEE International
however, they require a reinforcement learning setting to        Conference on, volume 4, 3603–3608. IEEE.
train. We believe a reinforcement learning approach applied      Evans, R., and Grefenstette, E. 2018. Learning explana-
on our dataset would learn a similar algorithm to that of        tory rules from noisy data. Journal of Artificial Intelligence
Prolog (Clocksin and Mellish 2003).                              Research 61:1–64.
                                                                 Evans, R.; Saxton, D.; Amos, D.; Kohli, P.; and Grefenstette,
                       Conclusion                                E. 2018. Can neural networks understand logical entailment?
We presented a new synthetic dataset and provided insights       arXiv:1802.08535.
into how machine learning might encompass symbolic rea-          Garcez, A. S. d.; Broda, K. B.; and Gabbay, D. M. 2012.
soning, defined as logic programs using RNN-based iterative      Neural-symbolic learning systems: foundations and applica-
neural networks. Fully differentiable models trained end-to-     tions. Springer Science & Business Media.
end have their inherent disadvantages: they seem to lose track   Goodfellow, I.; Bengio, Y.; and Courville, A. 2017. Deep
when the number of iterations is increased and the embedding     Learning. The MIT Press.
space is limited in capacity. However, such networks might
still hold the key to incorporate symbolic prior knowledge       Graves, A.; Wayne, G.; Reynolds, M.; Harley, T.; Danihelka,
into a continuous space by understanding how that embed-         I.; Grabska-Barwińska, A.; Colmenarejo, S. G.; Grefenstette,
ding space is organised to store symbolic information.           E.; Ramalho, T.; Agapiou, J.; et al. 2016. Hybrid comput-
   Since such neural networks provide a differentiable but       ing using a neural network with dynamic external memory.
approximate reasoning engine over logic programs, in the         Nature 538(7626):471.
future we hope to induce rules using continuous embeddings       Graves, A.; Wayne, G.; and Danihelka, I. 2014. Neural turing
of logical rules by propagating gradients back to the context.   machines. arXiv:1410.5401.
However, initially if possible, a more robust neural reasoning   Hochreiter, S., and Schmidhuber, J. 1997. Long short-term
network must be learned, one that is comparable in perfor-       memory. Neural Computation 9(8):1735–1780.
mance to existing logic-based expert systems.
                                                                 Hudson, D. A., and Manning, C. D. 2018. Compositional
                                                                 attention networks for machine reasoning. arXiv:1803.03067.
                       References                                Kingma, D. P., and Ba, J. 2014. Adam: A method for stochas-
Apt, K. R., and Bol, R. N. 1994. Logic programming and           tic optimization. arXiv:1412.6980.
negation: A survey. The Journal of Logic Programming             Komendantskaya, E. 2011. Unification neural networks:
19:9–71.                                                         unification by error-correction learning. Logic Journal of the
Apt, K. R., and Van Emden, M. H. 1982. Contributions to the      IGPL 19(6):821–847.
theory of logic programming. Journal of the ACM (JACM)           Kumar, A.; Irsoy, O.; Ondruska, P.; Iyyer, M.; Bradbury, J.;
29(3):841–862.                                                   Gulrajani, I.; Zhong, V.; Paulus, R.; and Socher, R. 2016.
Battaglia, P. W.; Hamrick, J. B.; Bapst, V.; Sanchez-Gonzalez,   Ask me anything: Dynamic memory networks for natural
A.; Zambaldi, V.; Malinowski, M.; Tacchetti, A.; Raposo, D.;     language processing. In ICML, 1378–1387.
Santoro, A.; Faulkner, R.; et al. 2018. Relational inductive     Liang, C.; Berant, J.; Le, Q.; Forbus, K. D.; and Lao, N.
biases, deep learning, and graph networks. arXiv preprint        2016. Neural symbolic machines: Learning semantic parsers
arXiv:1806.01261.                                                on freebase with weak supervision. arXiv:1611.00020.
Bengio, Y.; Louradour, J.; Collobert, R.; and Weston, J. 2009.   Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and
Curriculum learning. In Proceedings of the 26th annual           Dean, J. 2013. Distributed representations of words and
international conference on machine learning, 41–48. ACM.        phrases and their compositionality. In NIPS, 3111–3119.
Besold, T. R.; Garcez, A. d.; Bader, S.; Bowman, H.; Domin-      Muggleton, S., and De Raedt, L. 1994. Inductive logic
gos, P.; Hitzler, P.; Kühnberger, K.-U.; Lamb, L. C.; Lowd,     programming: Theory and methods. The Journal of Logic
D.; Lima, P. M. V.; et al. 2017. Neural-symbolic learning and    Programming 19:629–679.
reasoning: A survey and interpretation. arXiv:1711.03902.        Nickel, M.; Rosasco, L.; Poggio, T. A.; et al. 2016. Holo-
Cho, K.; Van Merriënboer, B.; Bahdanau, D.; and Bengio,         graphic embeddings of knowledge graphs. In AAAI, 1955–
Y. 2014. On the properties of neural machine translation:        1961.
Encoder-decoder approaches. arXiv:1409.1259.                     Pennington, J.; Socher, R.; and Manning, C. 2014. Glove:
Clark, K. L. 1978. Negation as failure. In Logic and data        Global vectors for word representation. In EMNLP, 1532–
bases. Springer. 293–322.                                        1543.
Clocksin, W. F., and Mellish, C. S. 2003. Programming in         Reed, S., and De Freitas, N. 2015. Neural programmer-
PROLOG. Springer Science & Business Media.                       interpreters. arXiv:1511.06279.
Cohen, W. W. 2016. Tensorlog: A differentiable deductive         Rocktäschel, T., and Riedel, S. 2017. End-to-end differen-
database. arXiv:1605.06523.                                      tiable proving. In NIPS, 3791–3803.
Rocktäschel, T.; Bošnjak, M.; Singh, S.; and Riedel, S. 2014.
Low-dimensional embeddings of logic. In Proceedings of
the ACL 2014 Workshop on Semantic Parsing, 45–49.
Russell, S., and Norvig, P. 2016. Artificial Intelligence: A
Modern Approach (3rd Edition). Pearson.
Santoro, A.; Raposo, D.; Barrett, D. G.; Malinowski, M.;
Pascanu, R.; Battaglia, P.; and Lillicrap, T. 2017. A simple
neural network module for relational reasoning. In NIPS,
4974–4983.
Serafini, L., and Garcez, A. d. 2016. Logic tensor networks:
Deep learning and logical reasoning from data and knowl-
edge. arXiv:1606.04422.
Socher, R.; Chen, D.; Manning, C. D.; and Ng, A. 2013.
Reasoning with neural tensor networks for knowledge base
completion. In NIPS, 926–934.
Sourek, G.; Aschenbrenner, V.; Zelezny, F.; and Kuzelka, O.
2015. Lifted relational neural networks. arXiv:1508.05128.
Sukhbaatar, S.; Weston, J.; Fergus, R.; et al. 2015. End-to-end
memory networks. In NIPS, 2440–2448.
Tai, K. S.; Socher, R.; and Manning, C. D. 2015. Improved
semantic representations from tree-structured long short-term
memory networks. arXiv preprint arXiv:1503.00075.
Weston, J.; Bordes, A.; Chopra, S.; Rush, A. M.; van
Merriënboer, B.; Joulin, A.; and Mikolov, T. 2015. To-
wards ai-complete question answering: A set of prerequisite
toy tasks. arXiv:1502.05698.
Weston, J.; Chopra, S.; and Bordes, A. 2015. Memory
networks. ICLR.
Xiong, C.; Meity, S.; and Socher, R. 2016. Dynamic memory
networks for visual and textual question answering. In ICML,
2397–2406.
Zaremba, W.; Mikolov, T.; Joulin, A.; and Fergus, R. 2016.
Learning simple algorithms from examples. In ICML, 421–
429.

</pre>