=Paper=
{{Paper
|id=Vol-2350/paper21
|storemode=property
|title=DeepLogic: Towards End-to-End Differentiable Logical Reasoning
|pdfUrl=https://ceur-ws.org/Vol-2350/paper21.pdf
|volume=Vol-2350
|authors=Nuri Cingillioglu,Alessandra Russo
|dblpUrl=https://dblp.org/rec/conf/aaaiss/CingilliogluR19
}}
==DeepLogic: Towards End-to-End Differentiable Logical Reasoning==
DeepLogic: Towards End-to-End Differentiable Logical Reasoning Nuri Cingillioglu and Alessandra Russo {nuric, a.russo}@imperial.ac.uk Deparment of Computing Imperial College London Abstract Relation Networks (Santoro et al. 2017) and Memory Atten- tion Control Networks (Hudson and Manning 2018). Tradi- Combining machine learning with logic-based expert systems tionally, inference systems over logic programs are manually in order to get the best of both worlds are becoming increas- ingly popular. However, to what extent machine learning can built on algorithms such as backward chaining (Apt and already learn to reason over rule-based knowledge is still an Van Emden 1982) or forward chaining (Russell and Norvig open problem. In this paper, we explore how symbolic logic, 2016). There have been attempts to partially replace sym- defined as logic programs at a character level, is learned to be bolic components of these systems with neural networks represented in a high-dimensional vector space using RNN- such as Neural Prolog (Ding 1995) which constructs net- based iterative neural networks to perform reasoning. We cre- works through symbolic applications of logic rules. Another ate a new dataset that defines 12 classes of logic programs vital component, unification of variables have been tackled by exemplifying increased level of complexity of logical reason- Unification Neural Networks (Komendantskaya 2011). How- ing and train the networks in an end-to-end fashion to learn ever, neither of these networks act as a complete inference whether a logic program entails a given query. We analyse how engine that can be trained end-to-end solely using examples learning the inference algorithm gives rise to representations of atoms, literals and rules within logic programs and evaluate and learn corresponding representations of symbolic knowl- against increasing lengths of predicate and constant symbols edge. as well as increasing steps of multi-hop reasoning. Another important aspect of machine learning that allows scalability is learning dense distributional representations such as word embeddings (Pennington, Socher, and Manning Introduction 2014). This approach is used to learn embeddings of predi- There has been increasing interest and attempts at combining cates to perform knowledge base completion (Serafini and machine learning with logic-based expert systems in order Garcez 2016); however, an existing reasoning mechanism to get the best of both worlds: have an end-to-end trainable such as backward chaining (Rocktäschel and Riedel 2017) system that requires little data to learn how to reason, possibly guides the machine learning component to learn such em- using existing background knowledge, and is interpretable beddings. We instead build on the prior work on iterative by humans (Besold et al. 2017). However, it is not clear to neural networks that learn algorithmic tasks such as addition what extent machine learning can learn logic-based reasoning of digits (Graves, Wayne, and Danihelka 2014) and sequence without requiring any prior engineering. Thus, in this work processing (Zaremba et al. 2016). Resultantly, we learn rea- we provide a first insight into how symbolic logic, represented soning tasks from scratch without prior engineering alongside as logic programs at a character level can be learned using representations of knowledge bases expressed as logic pro- RNN-based iterative neural networks to perform reasoning. grams. The key motivation is that if we can learn symbolic A crucial component in existing logic-based expert sys- reasoning using machine learning then we might be able tems is the ability to iteratively use a given knowledge base. to minimise the extent to which they have to be combined Framed as multi-hop reasoning, these tasks require an agent manually. The focus is not to extract the prior symbolic hy- to process information over several steps to reach a conclu- potheses but to learn how to utilise a given knowledge base; sion. This paradigm of reasoning has been performed by thus, we do not compare within the domain of inductive logic neural networks in domains such as story comprehension programming (Muggleton and De Raedt 1994). with Dynamic Memory Networks (Xiong, Meity, and Socher This paper provides the following contributions, (i) a 2016), graph problems with Differentiable Neural Comput- new synthetic dataset consisting of 12 various classes of ers (Graves et al. 2016) and visual question answering with normal logic programs, (ii) an iterative neural inference Copyright held by the author(s). In A. Martin, K. Hinkelmann, A. network to learn end-to-end different reasoning tasks and Gerber, D. Lenat, F. van Harmelen, P. Clark (Eds.), Proceedings of (iii) analysis of how these networks represent logic pro- the AAAI 2019 Spring Symposium on Combining Machine Learn- grams and handle multi-hop reasoning. Our implementation ing with Knowledge Engineering (AAAI-MAKE 2019). Stanford is publicly available online at https://github.com/ University, Palo Alto, California, USA, March 25-27, 2019. nuric/deeplogic. Table 1: Sample programs from tasks 1 to 7. 1: Facts 2: Unification 3: 1 Step 4: 2 Steps 5: 3 Steps 6: Logical AND 7: Logical OR e(l). o(V,V). g(L,S) :- x(S,L). x(G,B) :- k(G,B). p(P,R) :- b(R,P). f(P,U) :- b(P) , p(U). e(D,X) :- n(D,X). i(u). i(x,z). x(a,m). k(Z,V) :- g(Z,V). b(A,L) :- a(A,L). b(x). e(D,X) :- w(D,X). n(d). y(w,d). y(X) :- r(X). g(k,k). a(W,F) :- v(F,W). p(x). e(w,y). v(h,y). p(a,b). p(h). e(k,s). v(t,i). p(y). n(n,j). p(n). t(A,U). s(t,v). p(L,G) :- v(G,L). l(D) :- t(D). e(y,v). w(t). ? e(l). 1 ? o(d,d). 1 ? g(m,a). 1 ? x(k,k). 1 ? p(t,i). 1 ? f(x,x). 1 ? e(n,j). 1 ? i(d). 0 ? o(b,d). 0 ? g(a,m). 0 ? x(k,s). 0 ? p(i,t). 0 ? f(y,x). 0 ? e(i,j). 0 Background on concepts seen in prior tasks. Every task consists of triples A normal logic program (Apt and Bol 1994) is a set of rules (context, query, target) following the signature of the in- of the form: ference function in equation 4. For compactness a single A ← L1 , . . . , Ln (n ≥ 0) (1) context can have multiple queries and are expressed in the where A is an atom and each Li is a literal. A literal is either form ? query target after the context. The contexts an atom (a positive literal) or a negated atom of the form contain the simplest possible rules that cover the required not A where not is negation by failure (Clark 1978). Atom reasoning procedure and are not mixed together akin to unit A represents the head and L1 , . . . , Ln the body of a rule. An tests in software engineering. atom is a predicate with some arity, e.g. p(X, Y ), q(a), where Each task is generated using a fixed procedure from which variables are upper case characters and predicates and con- samples are drawn. For constants and predicates we use the stants are lower case. If a rule does not contain any variables lower case English alphabet and for variables upper case. The we refer to it as ground rule and ground rules with empty bod- length of the character sequences that make up predicates, ies as facts. We follow a similar syntax to Prolog (Clocksin constants and variables can be of arbitrary length (only ex- and Mellish 2003) and express normal logic programs as: amples of length 1 are shown) but we generate lengths up to 2 for the training dataset and longer lengths for test data. p(X) :- q(X), −r(X). (2) The arity of the atoms are selected randomly between 1 and q(a). (3) 2. For every sample we also generate irrelevant rules as noise As in equation 2, the ← is replaced with :- and the negation that always have different predicates and random structure by failure not with − while maintaining the same semantics. while still preserving the semantics of the task. When there are no body literals we omit the implication Facts The simplest task consists only of facts. There is entirely, equation 3. only one successful case in which the query appears in the context and can fail in 3 different ways: (i) the constant might 1 if C ` Q not match (shown in Table 1), (ii) the predicate might not f (C, Q) = (4) 0 otherwise match or (iii) the query might not be in the context at all. These failures can cascade and a query can fail for multiple f (C, not Q) = 1 − f (C, Q) (5) reasons with equal probability. We define the logical reasoning process as an inference func- Unification These tasks contain rules with empty bodies tion f , equation 4, that given a normal logic program (without and an atom with variables in the head. The intended ob- function symbols) as context C ∈ C and a ground atom as jective is to emphasise the semantics of unification between query Q ∈ Q returns {1, 0} = B depending on whether different p(X, Y ) and same variables p(X, X). The query the context entails the query C ` Q or not. We can now succeeds if the corresponding variables unify and fail other- define negation by failure not Q using the outcome of the wise. The failure case in which same variables do not match corresponding positive ground query Q, equation 5. different constants is in Table 1. f (C, Q) , p(Q|C) (6) N Step Deduction One prominent feature of logical rea- soning is the application of rules that contain literals in the In order to learn the reasoning process, we define an auxiliary body. These tasks cover the case when there is a single posi- objective for the neural network and consider the inference tive atom in the body. All such rules contain only variables function f to be the conditional probability of the query and chains of arbitrary steps can be generated. For the train- given the context, equation 6. This approach renders the ing dataset, we generate up to 3 steps, samples in Table 1. The problem, from a machine learning perspective, as a binary query succeeds when the body of the last rule in the chain is classification problem and allows training using standard grounded with the same constant as in the query. We occa- cross-entropy loss. sionally swap the variables p(X, Y ):-q(Y, X). to emphasise the variable binding aspect of rules which can happen at any The Tasks rule in the chain or not at all. The failure cases are covered Inspired by the bAbI dataset (Weston et al. 2015), we lay- when the swapped constants do not match or when the final out 12 tasks that cover various aspects of normal logic pro- body literal in the chain fails due to reasons covered in the grams. The tasks are of increasing complexity building up first task. Table 2: Sample programs from tasks 8 to 12. 8: Transitivity 9: 1 Step NBF 10: 2 Step NBF 11: AND NBF 12: OR NBF f(A,W) :- q(A,P) , d(P,W). s(X,J) :- -p(J,X). r(C) :- -o(C). b(G,B) :- -i(G) , u(B). y(Z) :- -e(Z). q(h,t). p(e,x). o(P) :- l(P). i(w). y(Z) :- b(Z). d(t,j). v(V,Q) :- u(V,Q). l(o). g(a). y(r). q(d,m). o(N) :- -q(N). g(u). u(a). e(d). d(n,g). t(x,e). p(U,L) :- e(U,L). f(t). s(a). s(S,F) :- x(S,A) , e(A,F). m(y,c). p(X,X). l(W) :- a(W) , d(W). b(m). ? f(h,j). 1 ? s(x,e). 0 ? r(u). 1 ? b(a,a). 1 ? y(a). 1 ? f(d,g). 0 ? s(e,x). 1 ? r(o). 0 ? b(w,a). 0 ? y(d). 0 Logical AND Building upon the previous deduction tasks, ral Reasoning Networks and the objective is to learn how we create rules with 2 body literals to capture the logical ∧ C ` Q is computed solely from examples. The primary semantics in rules, sample in Table 1. The reasoning engine challenge for these networks is to have a fixed network ar- now has to keep track of and prove both body literals to chitecture that must process all the tasks unlike tree based succeed. A failure occurs when one randomly selected body recurrent networks (Tai, Socher, and Manning 2015) or graph literal fails for reasons similar to the first task. networks (Battaglia et al. 2018) which construct a different Logical OR Having multiple matching heads captures the network dependant on each input. We place this constraint semantics of ∨ in logic programming by creating several to avoid engineering any prior structural information into a possible paths to a successful proof, sample in Table 1. In network. We gather inspiration from Memory Networks (We- this task we branch the query predicate 3 ways, 2 implications ston, Chopra, and Bordes 2015), in particular we wanted and 1 ground rule. The query succeeds when any of the rules to incorporate the end-to-end approach (Sukhbaatar et al. succeed and fail when all the matching rules fail. 2015) and the iterative fashion of Dynamic Memory Net- Transitivity Based on the previous 2 tasks, the transitive works (DMN) (Kumar et al. 2016) while following the steps case covers existential variable binding. It requires the model of a symbolic reasoning method such as backward chain- to represent the conjunction of the body literals of a rule and ing (Apt and Van Emden 1982). match multiple possible facts. The case succeeds when the inner variable unifies with the ground instances or fails other- wise. We expect an reasoning engine to solve the previous 2 tasks in order to solve this one, sample in Table 2. N-Step Deduction with Negation These tasks introduce the concept of negation by failure. The body literal of the first rule in the chain is negated and a chain of arbitrary length is generated. For the training dataset we only generate proof chains of length 2, samples in Table 2. The query succeeds when the negated body atom fails; the swapped variables do not match the constants, or the body literal of the final rule in the chain fails for reasons similar to the first task. The query fails whenever the negated body atom succeeds following the Figure 1: Graphical overview of the iterative cell of the It- semantics described by equation 5. erative Memory Attention (IMA) model. The context and Logical AND with Negation After introducing negation query are processed at the character level to produce literal by failure, we consider negation together with logical ∧ and embeddings, then an attention is computed over the head of randomly negate one of the body literals, sample in Table 2. the rules. A weighted sum of the unifier GRU outputs using As such, the query succeeds when the negated body atom the attention, updates the state for the next iteration. fails and the other literal succeeds. The query can fail when either body literal fails similar to the non-negated case. Design We can consider the logic program context a read- Logical OR with Negation Finally, we consider negation only memory and the proof state a writable memory com- together with the logical ∨ case and negate the body literal ponent. In a similar fashion to Prolog’s backward chaining of one rule. The query succeeds when any matching rule algorithm (Clocksin and Mellish 2003), we aim to have (i) a except the negated one succeeds and fails if the negated rule state to store information about the proof such as the query, succeeds while other matching rules fail, sample in Table 2. (ii) a mechanism to select rules via attention and (iii) a compo- nent to update the state with respect to the rules. To that end, we introduce the Iterative Memory Attention (IMA) network Neural Reasoning Networks that given a normal logic program as context and a positive In this section we describe a RNN-based iterative neural ground atom as query, embeds the literals in a high dimen- network for learning the inference function f , equation 4. sional vector space, attends to rules using soft attention and Broadly, we call networks that learn logical reasoning, Neu- updates the state using a recurrent network. IMA should be Table 3: Results on easy test set (10k each), d = 64, sm = softmax. Training Multi-task Curriculum Model LSTM MAC DMN IMA MAC DMN IMA Embedding - rule rule literal lit+rule rule rule literal lit+rule Attention - sm σ σ sm sm sm σ σ sm sm Facts 0.61 0.84 1.00 1.00 1.00 0.98 0.89 1.00 1.00 0.99 0.94 Unification 0.53 0.86 0.87 0.90 0.87 0.85 0.83 0.85 0.88 0.88 0.86 1 Step 0.57 0.90 0.74 0.98 0.94 0.95 0.77 0.62 0.96 0.93 0.92 2 Steps 0.56 0.81 0.67 0.95 0.95 0.94 0.70 0.58 0.95 0.91 0.89 3 Steps 0.57 0.78 0.77 0.94 0.94 0.94 0.64 0.64 0.93 0.86 0.87 AND 0.65 0.84 0.80 0.95 0.94 0.85 0.81 0.70 0.80 0.78 0.83 OR 0.62 0.85 0.87 0.97 0.96 0.93 0.75 0.75 0.96 0.93 0.90 Transitivity 0.50 0.50 0.50 0.50 0.52 0.52 0.50 0.50 0.50 0.50 0.50 1 Step NBF 0.58 0.92 0.79 0.98 0.94 0.95 0.65 0.58 0.96 0.91 0.92 2 Steps NBF 0.57 0.83 0.85 0.96 0.93 0.95 0.57 0.73 0.95 0.90 0.90 AND NBF 0.55 0.82 0.84 0.92 0.93 0.85 0.61 0.61 0.71 0.77 0.83 OR NBF 0.53 0.74 0.75 0.86 0.86 0.86 0.59 0.63 0.86 0.83 0.84 Easy Mean 0.57 0.81 0.79 0.91 0.90 0.88 0.69 0.68 0.87 0.85 0.85 Medium Mean 0.52 0.70 0.70 0.86 0.81 0.79 0.60 0.61 0.81 0.76 0.74 Hard Mean 0.51 0.63 0.66 0.83 0.75 0.72 0.55 0.58 0.76 0.70 0.68 considered a variant of Memory Networks, designed to suit state. We also append a blank rule () that acts as a learnable the logical reasoning process in which the recurrent compo- parameter and is often attended when no other rule is suitable. nent is iterated over literals to perform reasoning, a graphical overview is shown in Figure 1. Iteration The iterative step consists of attending to the rules, computing Literal Embedding a new state using each rule and updating the old state. The The inputs to the network are two sequences of characters network is iterated for T steps, fixed in advance, with the cC C Q Q 0 , . . . , cm and c0 , . . . , cn for context and query respec- initial state s0 ∈ Rd set to the query vector s0 = q. tively. The principle motivation behind having character level cati = [st ; q ; ri ; (st − ri )2 ; st ri ] (8) inputs rather than symbol tokens is to constrain the network d d d to learn sub-symbolic representations that could potentially αit = σ(W 1× 2 (U 2 ×d cati + b 2 ) + b1 ) (9) extend to previously unseen literals. We pre-process the con- text sequence to separate it into literals and obtain an input At any time step t, we compute a feature vector cati ∈ R5d 0 tensor IC ∈ NR×L×m of characters encoded as positive for every head of a rule ri = Ci0 using the current state integers where R is the number of rules, L number of liter- st ∈ Rd , equation 8 where [; ] is the concatenation operator. als and m0 the length of the literals. The query is a single We also experiment with embedding rules using another GRU ground positive atom encoded as a vector I Q ∈ Nn . This over literals h0ij = GRU (Cij , h0i,j−1 ) and take the final state pre-processing allows the network to consider each literal ri = h0iL as the representation of the rule. To compute the independently when iterating over rules giving it finer control final attention vector αit , we use a two layer feed-forward over the reasoning process. network, equation 9 where σ is the sigmoid function. We also experiment with the softmax formulation of the attention ht = GRU (O[I::t ], ht−1 ) (7) vector αit after the two layer feed-forward network. Each literal is embedded using a recurrent neural network that utij = GRU (Cij , uti(j−1) ) (10) processes only the characters of that literal I::t , equation 7 where O[I::t ] is the one-hot encoding of the characters. We R X use a gated recurrent unit (GRU) (Cho et al. 2014) starting st+1 = αit utiL (11) → − i with h0 = 0 to process the characters in reverse order to emphasise the predicate and take the final hidden state ht To apply a rule, we use another recurrent neural network to be the embedding of the literal l ∈ Rd where d is the that processes every literal of every rule Cij . The initial fixed dimension of the embedding. The context and query are hidden state uti0 = st is set to the current state st , then for embedded using the same network yielding a context tensor every rule a GRU is used to compute the new hidden state C ∈ RR×L×d where R is the number of rules and L number utij , equation 10. Finally, the new state st+1 becomes the of literals in a rule; for the query we obtain vector q ∈ Rd . weighted sum of the final hidden states utiL , equation 11. We In order for the network to propagate the state unchanged, call the inner GRU unifier as it needs to learn unification → − we append a null sentinel φ = 0 which allows the network between variables and constants as well as how each rule to ignore the current reasoning step by carrying over the interacts with the current state. Figure 2: Attention maps produced for query p(a) for IMA with softmax attention performing backward chaining in the left column and IMA with literal + rule embedding forward chaining in the right column on tasks 5 to 7. Experiments generated logic programs per task and results for the best single training run out of 3 for each model with state size We carry out experiments on individual tasks with variations d = 64 on the easy set are shown in Table 3. on our model. As a baseline, we use a LSTM (Hochreiter and We observe that all iterative models perform better than Schmidhuber 1997) to process the query and then the context the baseline except for task 8, transitivity which all models to predict the answer. We also compare our model against the fail at. We speculate the reason is that the models have not Dynamic Memory Network (DMN) (Kumar et al. 2016) and seen existential variable binding for more than 2 character the Memory Attention Control (MAC) (Hudson and Manning constants and fail to generalise in this particular case. We 2018) network which both incorporate iterative components note that the curriculum training regime has no benefit for achieving state-of-the-art results in visual question answering any model most likely because we introduce new, unseen datasets. With DMN and MAC, the context is separated into tasks with each iteration such that models with prior train- rules and the entire rule is embedded using a GRU at the ing have no advantage, ex. solving OR in 2 iterations does character level. Unlike DMN which uses another GRU to not improve the solution for AND in 3 iterations. Embed- accumulate information over the rule embeddings and the ding literals seems to provide an advantage over embedding current state, our variant IMA processes literal embeddings rules since all IMA models outperform both DMN and MAC using a GRU to compute the new state as a weighted sum when we consider the mean accuracy over every test set. We over each rule. postulate literal embeddings give a finer view and allow for Training We use two training regimes: curriculum learn- better generalisation over increasing lengths as embedding ing (Bengio et al. 2009) and multi-task. With curriculum rules with literals (lit+rule) also degrades the performance. learning the model is trained in an incremental fashion start- Although our variant IMA performs the best on all the test ing with tasks 1 and 2 with only 1 iteration. Then tasks sets, all models quickly degrade in mean accuracy as the dif- accumulate with increasing number of iterations with tasks ficulty increases. We speculate that a finite sized state vector 3, 7, 9, 12 with 2 iterations and tasks 4, 6, 8, 11 using 3 stores limited information about unseen unique sequences of iterations. We determine the minimum number iterations re- increasing lengths and analyse this behaviour further in the quired for a task based on Prolog (Clocksin and Mellish next section. 2003). Finally all tasks are introduced with a maximum of Figure 2 portrays the attention maps produced by IMA 4 iterations. The multi-task approach trains on the entire with softmax attention. Although by default models converge dataset with 4 iterations fixed in advance. Models are trained to backward chaining, by (i) reversing the direction of the via back-propagation using Adam (Kingma and Ba 2014) unifier GRU, equation 10, and (ii) skipping task 2, we can for 120 epochs (10 per task) with a mini-batch size of 32 create a bias towards ground facts that have a matching con- and a training dataset size of 20k logic programs per task. stant. This approach encourages our model IMA with rule We ensure a mini-batch contains at least 1 sample from each embeddings (lit+rule) to converge to forward chaining (Rus- training task to avoid any bias towards any task in a given sell and Norvig 2016), albeit with more training time (thus mini-batch. Logic programs are shuffled after each epoch and results for forward chaining are not included in Table 3). This rules within a context are also shuffled with every mini-batch. observation emphasises the fact that a fixed network architec- Since we have access to the data generating distribution, we ture can be flexible enough to learn two different solutions do not use any regularisation in any of the models and have for the same domain of problems given the right incentives. increased the training dataset size accordingly to avoid over- fitting (Goodfellow, Bengio, and Courville 2017). Analysis We generate 4 test sets of increasing difficulty: validation, easy, medium and hard which have up to 2, 4, 8 and 12 char- Since models are trained in an end-to-end fashion, the repre- acters for predicates and constants as well as added number sentations for logical constructs that help perform reasoning of irrelevant rules respectively. Each test set consists of 10k must also be learned. In this section, we scrutinise the learnt As the embedding sizes are finite, we also expect a satu- ration in the embedding space as the length of the literals increase. We capture this phenomenon by repeating the char- acter p 64 times and observe a converging pattern shown in Figure 4. We take note how odd and even length predicates converge to their own respective point suggesting that the embeddings produced by the GRU, equation 7, learn parity. If we take structurally different literals we observe a pref- erence towards whether literals are negated or grounded then the arity of the atoms, Figure 5a. We believe this clustering captures the literal semantics available in the dataset within 4 major clusters (grey lines). Furthermore, if we look at the rule embeddings learnt by DMN, we notice a similar clustering based on rule structure with again groundedness, negation, arity and number of body literals as learnt distinguishing features, Figure 5b. The multiple points within a cluster cor- Figure 3: First, second and last principal components of respond to different predicates following a similar ordering, embeddings of single character literals form a lattice like for example predicate p is often the upper most point. structure with predicates clustered in vertical columns and To evaluate multi-hop reasoning, we generate logic pro- constants on horizontal surfaces; from IMAsm. grams that require increasing number of steps of deduction. While the training data contains up to 3 steps, we generate up to 32 steps and the networks are run for n + 1 iterations. representations of the embedding GRU, equation 7 for IMA We obtain similar results to other recurrent neural network and the rule embeddings of DMN. based systems (Zaremba et al. 2016) and observe a steady de- An important assumption made in the dataset is that every crease in accuracy beyond the training line (grey dashed line), predicate and constant combination is unique. We expect this shown in Figure 6 in which imasm and imarsm indicate setup to create a formation that would allow every literal to IMA with softmax attention and rule embedding respectively. be differentiated. As such, we observe a lattice like structure We speculate that with each iteration step the state represen- when the embeddings of single and double character atoms tation degrades eventually losing information since models are visualised in 3 dimensions using principle component tend to produce noisy attention maps or state transformations. analysis (PCA), Figure 3. The first two components select Our IMA model with softmax attention maintains the high- the predicate and the final the constant to uniquely identify est accuracy most likely due to the sparsity created by the the literal with a clear distinction between single and double softmax operation. character predicates. This arrangement is in contrast with To evaluate generalisation to unseen symbols, we take distributional representations that exploit similarities between task 3 and generate increasing character lengths of random entities such as in natural language embeddings (Mikolov et predicate or constant symbols up to 64, well beyond the train- al. 2013). In our case, p(a) might be more similar to p(b) ing dataset length of 2. Although we observed that literal than to pp(a) although all are deemed unique. embeddings can saturate, models can cope with longer ran- domly generated predicate or constant symbols, Figure 7a, since looking at only a few characters can determine unique- ness. This reflects on our intuition that looking at portions of sequences might be enough to determine equality rather than memorising them entirely, ex. looking at last few digits of phone numbers to check if they are same. In order to understand how the embedding and state di- mensions d affect the models, we experimented with sizes 32, 48, 64, 96 and 128 running each curriculum training session 3 times for our IMA model. Figure 7b shows that increasing the dimension size does not contribute to a clear increase in mean accuracy over all tasks and the drop in accuracy across easy, medium and hard test sets follow a similar pattern for every dimension. Despite the initial increase beyond d = 32, we get the highest upper bound in accuracy with d = 64, for which the individual results are in Table 3. We designed the tasks to be of increasing difficulty build- Figure 4: Repeating character predicates saturate the embed- ing on previously seen problems such as unification before ding and converge to respective points, equidistant lines are deduction. We expected models trained using an iterative plotted in grey; from IMAsm. scheme would generalise better as the network depth would increase gradually. However, when we average the mean (a) Structurally different literals first cluster by whether they are (b) Rule embeddings form clusters based on their structure with a negated or grounded then by arity (grey lines added as visual aids); distinction between negated and non-negated rules (grey line added from IMAsm. as visual aid), from DMN. Figure 5: Principle component analysis (PCA) of learnt representations. and symbolic systems rather than encompassing one with the other. In this section, we try to provide additional discussion and insight into why one might or might not learn symbolic reasoning from scratch using iterative neural networks. Whilst the basic logic programs presented in this paper can all be solved using existing symbolic systems such as Prolog (Clocksin and Mellish 2003), we struggle to see a comparable performance from neural networks. Increasing the number of steps in task 3, as shown in Figure 6, demon- strates the fragile nature of using continuous representations for rigid symbolic reasoning. The embedding space is inher- ently limited in capacity as it has fixed number of dimensions. Although a 64 dimensional vector can in theory encode a lot of information, in practice neural networks trained using back-propagation do not seem to have any guarantees on how Figure 6: When models are iterated beyond the training num- efficiently they will learn representations in a vector space. ber of steps (grey line) to perform increasing steps of deduc- We speculate that this creates an accumulative noise with tion, the accuracy degrades for all models. each iteration which eventually degrades the performance. On the other hand, the learnt continuous representations scale to very long previously unseen symbols, Figure 7a, which is a desirable property of neuro-symbolic systems. accuracy for all dimensions, for all 3 runs of IMA with soft- max attention across the test sets, we do not discern any Attention mechanisms allow state-of-the-art memory advantage for curriculum training over multi-task training based networks to address and operate an external mem- beyond the validation set. Figure 7c shows a similar decrease ory (Weston, Chopra, and Bordes 2015). In the case of nueral in performance across the test sets for both training regimes. reasoning networks, the attention components provide the We believe this result stems from introducing new tasks with means to select rules. An interesting outcome is that all neu- each iteration and models not having any incentive to abstract ral reasoning models try to attend to multiple rules when subtasks until the next set of unseen tasks are incorporated applicable, for example when there are two matching rules into the training dataset. as shown in Figure 2. Unlike Prolog, this approach allows the neural networks to simultaneously explore multiple pos- sible branches completing the reasoning task in fewer steps. Discussion However, we speculate such an attention mechanism will Given the simple, unit test like nature of the logic programs become the bottleneck when there are large numbers, possi- generated, the lack of robust results for iterative neural net- bly hundreds of matching rules such as in knowledge base works further encourages a combination of machine learning completion tasks since it will try to aggregate all matching (a) The models can cope, in particular IMA (b) Mean accuracy over all tasks against in- (c) Mean accuracy of training regimes applied with literal embeddings, when predicate and creasing embedding dimension d shows no to IMAsm against test sets across all dimen- constant symbols of increasing length are ran- clear increase beyond d = 64. sions show no advantage of curriculum train- domly generated. ing for generalisation. Figure 7: Accuracy plots over increasing symbol lengths, state dimension and training regime. rules in a single step. An explicit strong supervision of the program. Logic Tensor Networks (Serafini and Garcez 2016) attention mechanism, similar to the original DMN (Kumar et tackle knowledge completion “on a simple but representative al. 2016) work or a hierarchical architecture might be needed example” also grounding every term in the program prior to to encourage a form of iterative backtracking. performing reasoning. Lack of abstraction between seemingly related tasks limits Following the above mentioned works, Neural Theorem the performance on unseen logic programs. During curricu- Provers (Rocktäschel and Riedel 2017) learn distributional lum training, we noticed models which solve tasks 1, 2 and 3 representations of predicates and constants by symbolically have no advantage on solving task 4 because the final state constructing the relationship between embeddings using an representation from task 3 specifically looks for a ground existing symbolic inference engine. We design our tasks such instance such as p(a) to complete the task. However, in task that the neural networks attempt to learn not only represen- 4 the model needs to match another rule p(X):- and iterate tations at a character level but also the reasoning algorithm once more. When presented with task 4, models most likely with no help or knowledge of existing reasoning engines and have to adjust the intermediate state representation to check methods. for the second rule case; as a result Figure 7c can also be Learning similarities between constants can be used to interpreted as the lack of abstraction over increasingly com- perform link-prediction tasks (Nickel et al. 2016) and knowl- plex tasks in the dataset since curriculum learning provides edge base completion (Socher et al. 2013) but creates un- no clear advantage. In other words, the neural models do not wanted inferences when similar constants should indeed be seem to learn a reasoning process that is general enough to unique (Rocktäschel et al. 2014). Although we set every con- be compared to existing symbolic systems; consequently we stant to be unique, we expect embeddings of similar constants were unable to run them on existing logic program bench- to cluster during training if the data entails the same conclu- marks or against Prolog. sions. Creating differentiable logical inference networks can also induce rules (Evans and Grefenstette 2018); however, at this stage we do not learn logical rules along side reasoning Related Work tasks and assume they are given to the model. Possible World There have been many attempts to combine symbolic reason- Net (Evans et al. 2018) follows a unique approach to learn- ing with machine learning techniques under the name neural- ing propositional logic entailment using semantic, worlds symbolic systems (Garcez, Broda, and Gabbay 2012). Lifted interpretation; however, they exploit the syntax of logical for- Relational Neural Networks (Sourek et al. 2015) ground mulae by parsing them and constructing the neural network clauses turning them into propositional programs when con- in a tree like manner in which nodes correspond to logical structing networks as a set of weighted definite clauses. We operators. The works mentioned so far are designed for either do not pre-process programs to ground variables and Neural deductive databases, relation learning, link prediction, knowl- Reasoning Networks must learn unification. TensorLog (Co- edge base completion or propositional programs; thus our hen 2016) constructs “factor graphs” from logic rules which task of learning reasoning over and embeddings of normal in return create the network that run on one-hot encoded logic programs using a fixed RNN-based iterative network constants. Our approach does not factor, compile or use any is inherently different making a direct empirical comparison implicit knowledge about first-order logical inference or rule unreasonable. applications. We also only one-hot encode characters, not Neural Reasoning Networks can be seen as interpreters entire predicates or constants, and only give labels 1 or 0 of logic programs as rules can act like instructions. This as targets to train end-to-end using the same neural network perspective reflects on systems such as Neural Program In- with same architecture and weights for every normal logic terpreters (Reed and De Freitas 2015) and Neural Symbolic Machines (Liang et al. 2016). These systems contain discrete Ding, L. 1995. Neural prolog-the concepts, construction operations that allow more complex actions to be performed and mechanism. In Systems, Man and Cybernetics, 1995. overcoming the problem of a degrading state representation; Intelligent Systems for the 21st Century., IEEE International however, they require a reinforcement learning setting to Conference on, volume 4, 3603–3608. IEEE. train. We believe a reinforcement learning approach applied Evans, R., and Grefenstette, E. 2018. Learning explana- on our dataset would learn a similar algorithm to that of tory rules from noisy data. Journal of Artificial Intelligence Prolog (Clocksin and Mellish 2003). Research 61:1–64. Evans, R.; Saxton, D.; Amos, D.; Kohli, P.; and Grefenstette, Conclusion E. 2018. Can neural networks understand logical entailment? We presented a new synthetic dataset and provided insights arXiv:1802.08535. into how machine learning might encompass symbolic rea- Garcez, A. S. d.; Broda, K. B.; and Gabbay, D. M. 2012. soning, defined as logic programs using RNN-based iterative Neural-symbolic learning systems: foundations and applica- neural networks. Fully differentiable models trained end-to- tions. Springer Science & Business Media. end have their inherent disadvantages: they seem to lose track Goodfellow, I.; Bengio, Y.; and Courville, A. 2017. Deep when the number of iterations is increased and the embedding Learning. The MIT Press. space is limited in capacity. However, such networks might still hold the key to incorporate symbolic prior knowledge Graves, A.; Wayne, G.; Reynolds, M.; Harley, T.; Danihelka, into a continuous space by understanding how that embed- I.; Grabska-Barwińska, A.; Colmenarejo, S. G.; Grefenstette, ding space is organised to store symbolic information. E.; Ramalho, T.; Agapiou, J.; et al. 2016. Hybrid comput- Since such neural networks provide a differentiable but ing using a neural network with dynamic external memory. approximate reasoning engine over logic programs, in the Nature 538(7626):471. future we hope to induce rules using continuous embeddings Graves, A.; Wayne, G.; and Danihelka, I. 2014. Neural turing of logical rules by propagating gradients back to the context. machines. arXiv:1410.5401. However, initially if possible, a more robust neural reasoning Hochreiter, S., and Schmidhuber, J. 1997. Long short-term network must be learned, one that is comparable in perfor- memory. Neural Computation 9(8):1735–1780. mance to existing logic-based expert systems. Hudson, D. A., and Manning, C. D. 2018. Compositional attention networks for machine reasoning. arXiv:1803.03067. References Kingma, D. P., and Ba, J. 2014. Adam: A method for stochas- Apt, K. R., and Bol, R. N. 1994. Logic programming and tic optimization. arXiv:1412.6980. negation: A survey. The Journal of Logic Programming Komendantskaya, E. 2011. Unification neural networks: 19:9–71. unification by error-correction learning. Logic Journal of the Apt, K. R., and Van Emden, M. H. 1982. Contributions to the IGPL 19(6):821–847. theory of logic programming. Journal of the ACM (JACM) Kumar, A.; Irsoy, O.; Ondruska, P.; Iyyer, M.; Bradbury, J.; 29(3):841–862. Gulrajani, I.; Zhong, V.; Paulus, R.; and Socher, R. 2016. Battaglia, P. W.; Hamrick, J. B.; Bapst, V.; Sanchez-Gonzalez, Ask me anything: Dynamic memory networks for natural A.; Zambaldi, V.; Malinowski, M.; Tacchetti, A.; Raposo, D.; language processing. In ICML, 1378–1387. Santoro, A.; Faulkner, R.; et al. 2018. Relational inductive Liang, C.; Berant, J.; Le, Q.; Forbus, K. D.; and Lao, N. biases, deep learning, and graph networks. arXiv preprint 2016. Neural symbolic machines: Learning semantic parsers arXiv:1806.01261. on freebase with weak supervision. arXiv:1611.00020. Bengio, Y.; Louradour, J.; Collobert, R.; and Weston, J. 2009. Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Curriculum learning. In Proceedings of the 26th annual Dean, J. 2013. Distributed representations of words and international conference on machine learning, 41–48. ACM. phrases and their compositionality. In NIPS, 3111–3119. Besold, T. R.; Garcez, A. d.; Bader, S.; Bowman, H.; Domin- Muggleton, S., and De Raedt, L. 1994. Inductive logic gos, P.; Hitzler, P.; Kühnberger, K.-U.; Lamb, L. C.; Lowd, programming: Theory and methods. The Journal of Logic D.; Lima, P. M. V.; et al. 2017. Neural-symbolic learning and Programming 19:629–679. reasoning: A survey and interpretation. arXiv:1711.03902. Nickel, M.; Rosasco, L.; Poggio, T. A.; et al. 2016. Holo- Cho, K.; Van Merriënboer, B.; Bahdanau, D.; and Bengio, graphic embeddings of knowledge graphs. In AAAI, 1955– Y. 2014. On the properties of neural machine translation: 1961. Encoder-decoder approaches. arXiv:1409.1259. Pennington, J.; Socher, R.; and Manning, C. 2014. Glove: Clark, K. L. 1978. Negation as failure. In Logic and data Global vectors for word representation. In EMNLP, 1532– bases. Springer. 293–322. 1543. Clocksin, W. F., and Mellish, C. S. 2003. Programming in Reed, S., and De Freitas, N. 2015. Neural programmer- PROLOG. Springer Science & Business Media. interpreters. arXiv:1511.06279. Cohen, W. W. 2016. Tensorlog: A differentiable deductive Rocktäschel, T., and Riedel, S. 2017. End-to-end differen- database. arXiv:1605.06523. tiable proving. In NIPS, 3791–3803. Rocktäschel, T.; Bošnjak, M.; Singh, S.; and Riedel, S. 2014. Low-dimensional embeddings of logic. In Proceedings of the ACL 2014 Workshop on Semantic Parsing, 45–49. Russell, S., and Norvig, P. 2016. Artificial Intelligence: A Modern Approach (3rd Edition). Pearson. Santoro, A.; Raposo, D.; Barrett, D. G.; Malinowski, M.; Pascanu, R.; Battaglia, P.; and Lillicrap, T. 2017. A simple neural network module for relational reasoning. In NIPS, 4974–4983. Serafini, L., and Garcez, A. d. 2016. Logic tensor networks: Deep learning and logical reasoning from data and knowl- edge. arXiv:1606.04422. Socher, R.; Chen, D.; Manning, C. D.; and Ng, A. 2013. Reasoning with neural tensor networks for knowledge base completion. In NIPS, 926–934. Sourek, G.; Aschenbrenner, V.; Zelezny, F.; and Kuzelka, O. 2015. Lifted relational neural networks. arXiv:1508.05128. Sukhbaatar, S.; Weston, J.; Fergus, R.; et al. 2015. End-to-end memory networks. In NIPS, 2440–2448. Tai, K. S.; Socher, R.; and Manning, C. D. 2015. Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075. Weston, J.; Bordes, A.; Chopra, S.; Rush, A. M.; van Merriënboer, B.; Joulin, A.; and Mikolov, T. 2015. To- wards ai-complete question answering: A set of prerequisite toy tasks. arXiv:1502.05698. Weston, J.; Chopra, S.; and Bordes, A. 2015. Memory networks. ICLR. Xiong, C.; Meity, S.; and Socher, R. 2016. Dynamic memory networks for visual and textual question answering. In ICML, 2397–2406. Zaremba, W.; Mikolov, T.; Joulin, A.; and Fergus, R. 2016. Learning simple algorithms from examples. In ICML, 421– 429.