1. Introduction

Learning Proof Path Selection Policies in Neural Theorem Proving

Matthew Morris

matthewthemorris@gmail.com 0

Pasquale Minervini

p.minervini@cs.ucl.ac.uk 2

Phil Blunsom

phil.blunsom@cs.ox.ac.uk 0 1 0 Computer Science Department, University of Oxford 1 DeepMind , London 2 UCL Centre for Artificial Intelligence, University College London

Neural Theorem Provers (NTPs) are neural relaxations of the backward-chaining logic reasoning algorithm. They can learn continuous representations for predicates and constants, induce interpretable rules, can provide logic explanations for their predictions, and show strong systematic generalisation properties. However, since they enumerate all possible proof paths for proving a goal, they sufer from high computational complexity, and are thus unsuitable for complex reasoning tasks. Conditional Theorem Provers (CTPs) try to overcome this issue by generating relevant rules on-the-fly based on the goal, rather than considering all possible rules. Nonetheless, CTPs sufer from similar computational constraints, as they still have to consider multiple proof paths while reasoning. We propose Adaptive CTPs (ACTPs), where CTPs are augmented with a learned policy to dynamically select the most promising proof paths. This allows the model designer to specify the number of proof paths to consider, to conform to the computational constraints of their use case, while retaining all of the benefits of CTPs. By evaluating on the CLUTRR dataset, we provide evidence for the computational issues in existing CTP models, show that ACTPs alleviate these issues, and demonstrate that, in certain scenarios, the accuracy achieved by ACTPs is higher than CTPs while retaining the same computational complexity.

reasoning neuro-symbolic adaptive computation reinforcement learning

1. Introduction

Promising work has been done around integrating neural models and symbolic reasoning, as their complementary strengths and weaknesses make for powerful models when combined [ 1, 2, 3, 4, 5 ]. In this paper, we consider such a technique: Neural Theorem Provers [ 6 ].

The approach of Rocktäschel and Riedel [ 6 ] is to keep variable binding symbolic, but compare predicates and constants using their sub-symbolic representations. They introduce Neural Theorem Provers (NTPs): end-to-end diferentiable provers for theorems formulated as queries to a knowledge base (KB). Prolog’s backward chaining algorithm [ 7 ] is used as a blueprint for constructing neural networks in a recursive manner, (P. Blunsom) which can prove queries to a KB using vector representations of symbols. These proofs are given success scores, which are diferentiable with respect to the sub-symbolic representations, allowing the model to learn representations that maximise the proof scores. Using the same process, rules of pre-defined structures are also learnt.

NTPs can learn representations of symbols in a KB like neural link prediction models and learn rules which hold in the KB. They also allow one to incorporate already known rules into the reasoning process, as one simply needs to include them in the knowledge base. NTPs are also naturally interpretable, since they induce sub-symbolic rules that can be decoded to human-readable symbolic rules. Finally, Minervini et al. [ 8 ] demonstrate that NTPs have the ability to perform systematic generalisation, learning how to solve complex reasoning tasks while only being trained on simpler examples. In contrast, many neural models appear not to generalise robustly on tasks requiring systematic generalisation [ 9, 10, 11, 12 ].

However, NTPs have a significant computational footprint, as they consider all possible rules for proving a goal or sub-goal. This means they cannot scale to settings with a large number or rules or reasoning steps. To solve this problem, Minervini et al. [ 8 ] propose Conditional Theorem Provers (CTPs), an extension of NTPs that learn to dynamically generate a set of rules for proving the current goal or sub-goal. This is implemented by a select module which, given a goal, returns multiple rules that can be used to prove that goal. It consists of multiple neural networks, which we refer to as reformulators, each of which can represent several rules in a knowledge base. This module is end-to-end diferentiable, and can be trained end-to-end via backpropagation.

However, CTPs can end up sufering from computational issues in a similar fashion to NTPs, since they still need to consider multiple proof paths during the reasoning process. For complex datasets in which there are many ways to prove a given goal, more proof paths need to be checked. This, in conjunction with the high reasoning depth often required for such datasets, causes CTPs to become infeasibly slow. This is especially problematic when CTPs are applied in settings where the inference time is critical.

Objectives In this paper, we aim to address the computational shortcomings of CTPs by extending them to ACTPs (Adaptive CTPs). Specifically, we augment CTPs with a policy trained to select the proof paths that are most likely to succeed. This allows one to control the amount of exploration in the space of proof paths for a given goal, depending on the computational requirements of the task. An alternative to addressing the computational issues of CTPs is to simply reduce the number of reformulators trained, leading to fewer proof paths being considered. However, this makes the model less expressive, meaning it will likely be unable to capture all of the rules in a knowledge base. Thus, for ACTPs to be useful, an ACTP model expanding only proof paths should achieve higher accuracy than a CTP model with only reformulators. In the following, we 1) motivate for CTP models sometimes requiring a large number of reformulators and reasoning depth; 2) concretely establish the computational issues that CTPs sufer from, using both empirical results and a theoretical analysis; 3) define a framework for ACTPs using policy gradient descent; 4) empirically demonstrate on the CLUTRR dataset that ACTPs are an improvement upon CTPs, in both their respective accuracies and evaluation times.

2. Neural Theorem Proving

Backward Chaining Prolog [ 7 ] is a logic programming language that finds use in contemporary work. It has been used for a variety of tasks, including automated theorem proving [ 13 ], expert systems [ 14 ], and natural language processing [ 15 ]. A Prolog KB consists of rules and facts. Queries are passed to the KB, with Prolog returning whether or not the queries are entailed by the KB. The restrictive syntax of Prolog allows one to answer queries using Prolog’s backward chaining algorithm [ 7 ]. Given a goal, such as sister(Joshua, Cindy), which is constructed from a query, Prolog tries to find substitutions for the goal by using the rules in the KB. The process of checking if the head of a rule matches a goal is called unification . If unification succeeds, then the goal is replaced with the atoms from the body of the rule, giving a new set of sub-goals. The same process is then applied to each of these sub-goals, continuing recursively until all sub-goals are found as facts in the KB or there are no more rules to apply. Neural Theorem Provers Neural Theorem Provers (NTPs), proposed by Rocktäschel and Riedel [ 6 ], are a continuous relaxation of the backward chaining algorithm. NTPs can be trained end-to-end, by calculating the gradient of proof successes with respect to vector representations of symbols, and are defined in terms of modules [ 16 ]. The recursive expansion upon the goal is kept track of in proof states, which each contain a neural network that outputs the success score of the proof so far, and the substitution set. The network is recursively built upon, with new nodes being added as rules are applied. Every diferent proof path will have a diferent associated proof state. However, as noted by Rocktäschel and Riedel [ 6 ], NTPs sufer from severe computational limitations. In standard backward chaining, a proof path can be aborted when unification with a rule fails, but this happens far less in neural backward chaining, since unification only fails when predicates do not have matching arities. Given a goal such as sister(Joshua, Cindy), the prover should only consider rules such as the first one below, and not the second. Conditional Theorem Provers To address the computational limitations of NTPs, Minervini et al. [ 8 ] propose introducing a new module into the system, select, to reduce the number of rules being considered when expanding upon a goal. In the or module, instead of considering every rule ← ∈ in the KB , they only consider each ← ∈ select () . This select module can be implemented by a sequence of diferentiable parameterized functions select1(), select2(), ..., select () that each, given a goal, produces a sequence of subgoals. We refer to each of these functions as a goal reformulation module, or simply a reformulator.

In summary, a CTP is composed of reformulators, each of which can represent multiple rules, and it learns representations for predicates and constants. All model parameters are initialised randomly, and trained end-to-end via backpropagation. During inference, a CTP model starts with a goal ( 1, 2), and initialises = {( 1, 2)} as the set of sub-goals. Then, recursively up to the given reasoning depth , the model applies each reformulator to each of the sub-goals in , generating a new set of sub-goals from the output of the reformulators with each recursive step. At each recursive step, the model produces a new proof path for each reformulator. The model maximises scores over proof paths. At every depth up to the reasoning depth , the model unifies every sub-goal from the proof path with the knowledge base of facts, given in the dataset task. These similarity values are propagated up, with the score of the atom in the head of a reformulator being set to the minimum similarity of all the atoms in its body. Evaluation Dataset CLUTRR [ 12 ] is a system for constructing artificial datasets modelling family relationships. Given a set of family relations, the task is to infer the relationship between two family members whose relationship is not explicit in the set. To solve this, an agent ought to be able to induce the logical rules that govern family relationships, and use those rules to infer the relationship of the query members from the given relations. In particular, CLUTRR allows for testing an agent’s ability to perform systematic generalization [ 17, 12, 18 ]. Sinha et al. [ 12 ] published several generated dataset groups alongside the CLUTRR system, which we make use of for training and evaluation in this paper. We refer to the training dataset as 1.2,1.3,1.4_train and the testing datasets as 1.2_test, 1.3_test, ..., 1.10_test.

3. Expressivity and Complexity

The expressivity of a relational learning model is an important consideration for its viability [ 19, 20 ]. A more expressive model can capture more types of data and relations than a less expressive one, meaning it can be applied to more complex datasets. Due to the restrictive nature of the syntax of Prolog, CTPs are already quite limited when it comes to the structure of expressions upon which they can reason. However, these limitations go further, as the rules that a CTP model can capture depend on the structure and number of reformulators in the model. Number of Reformulators A single reformulator can capture any number of rules, provided that for each rule, the positions of the variables in the rule correspond to the positions of their representations in the reformulator. In addition, given an atom for the head of a rule, a reformulator can only capture one rule with the given head. For example, a single reformulator would be unable to capture both of the following rules, even though the positions of the variables are the same in both: This is because a reformulator is a function that maps from ∈ ℝ × (ℝ ∪ ) × (ℝ ∪ ) , meaning that each has to map to a unique output. As such, to fully capture all of the rules in a knowledge base, we need as many reformulators as there are the maximum number of rules with the same head atom. We refer to this number as the minimal full expressivity bound. For the CLUTRR datasets, this number is 5. To see the efect of increasing the number of reformulators, we fix the train and test reasoning depths, and evaluate the average/maximum test accuracy of CTPs on on 1.4_test and 1.10_test, using a number of reformulators varying from 1 to 8. The results are shown in Table 1.

Reasoning Depth CTPs reason up to a pre-defined reasoning depth and then unify the existing sub-goals with the facts in the knowledge base. This means that, even if the CTP model has perfectly learned to represent all of the rules in the knowledge base with reformulators, it still needs a suficient reasoning depth to prove the goal. The required reasoning depth for a valid proof path is the minimum number of recursive steps down, such that every sub-goal appears in the knowledge base of facts. The reasoning depth needed to solve a task is thus the minimum required reasoning depth across all possible proof paths. For the most complex CLUTRR instance, this number is at most 5. In Table 2, we demonstrate the efect of using diferent test reasoning depths.

Computational Issues While we have demonstrated that it is rarely harmful and often beneficial to increase the reasoning depth and number of reformulators in terms of predictive accuracy, doing so can lead to higher computational costs. In Appendix A, we provide a theoretical analysis of the time complexity of CTPs. We consider two base operations that we wish to count: the number of reformulator applications (data being passed through a neural network) and the number of sub-goals that need unifying with the knowledge base after the reasoning depth is reached (comparisons with all fact embeddings using a RBF kernel). The remainder of the operations in CTPs are either tied into one of these two, or take constant time. We count these operations with respect to the number of reformulators used , the reasoning depth , and the maximum number of atoms in the body of any reformulator used . We find that the time complexity of CTPs as a whole is (() ). We illustrate the issues resulting from this empirically in Appendix B.

4. Optimization via REINFORCE

Outline of Solution Since the time complexity of CTPs is (() ), optimizing CTP evaluation time consists of trying to keep each of these variables as low as possible. The maximum number of atoms in the body of any reformulator used is almost impossible to reduce, as it is completely determined by the rules the model is trying to capture in the knowledge base. The depth is also challenging to reduce, as a certain depth is required for full expressivity on the test datasets. Our approach is to keep the number of reformulators used at each expansion step as low as possible, minimizing the number of proof paths that need to be considered. Choosing Reformulators Given that the time complexity of CTPs is (() ), having reformulators and only selecting at each expansion step would lead to a time complexity of (() ) instead. Ideally, we would just be able to use = 1 and learn to select the perfect reformulator at every expansion step. However, initial experiments demonstrated this to be an almost impossible task since, for each goal and sub-goal in CLUTRR, we do not know in advance which proof path will allow us to prove it. Rather, we use an hyperparameter ∈ ℕ , tuned for maximising accuracy and satisfying the computational constraints of the test datasets. This means that instead of reformulators are selected and used at every expansion step in the reasoning. We refer to the module making these selection decisions as the selection module. Implementation We adopt REINFORCE [21] to train the reformulation selection module. Let us first define the corresponding deterministic Markov decision process (, , , ) :

States. A state in the model represents which sub-goal we are currently considering for expansion in the proof. It is thus an atom where ∈ ℝ × (ℝ ∪ ) × (ℝ ∪ ) . In order that the policy estimator may be implemented by a neural network, we define ∶= ℝ × ℝ × ℝ = ℝ3 . We fixed the value of a variable to be {0 } .

Actions. The set of possible actions from any state is the reformulators that could be used to expand upon the sub-goal. With reformulators, we thus have = {1, ..., } . Note that as we are choosing reformulators for expansion, multiple actions are chosen for a given state by sampling without repetition from the probability distribution () .

Transition Function. After some subset of reformulators is chosen, the CTP model proceeds as normal, expanding out into a diferent proof path for each reformulator. Thus, we have a deterministic transition function defined by this process.

Rewards. The proof score is an obvious choice for reward signal, as higher proof scores correspond to the model performing better on positive tasks. However, rather than discounting future rewards, we can use the fact that proof scores are propagated back up through the CTP model to have access to the exact score that choosing a reformulator leads to. The reward given for choosing a reformulator is thus the maximum proof score across all proof paths originating from the reformulator applied to the current sub-goal.

The policy network is implemented by a neural network with a single hidden layer, containing 30 hidden nodes. We use a Rectified Linear Unit (ReLU) activation function before the hidden layer: ReLU() ∶= max(0, ) . The output of the policy network is thus

() ∶= softmax ( 2 × ReLU( 1 × )) while the loss is given by () = − 1 ∑=1 × ln( ( ) ), where is the batch size, is the action chosen in a particular task in the batch, is the proof score that resulted from the action, and ( ) is the probability that action has in the distribution ( ). The loss is applied separately for each of the actions (reformulators) chosen. Rather than having episodes, we simply execute the CTP model as usual, applying the policy and calculating the loss at every expansion step in the reasoning. We refer to this model as ACTP and the original baseline CTP model as CTP.

ACTP Speedup

We see that if ACTPs are used, there is at least a theoretical speedup from the baseline of CTPs. Ideally this would translate into a speedup in wall-clock time as well, but this is not guaranteed for all datasets and values of and . The larger the values of and , the greater the efect of choosing from reformulators will have; this follows directly from the computational complexity (()

). Furthermore, larger and more complex datasets will also see this efect being more pronounced, as they contain more facts that need to be unified with the (()

) sub-goals once the test depth is reached.

Counteracting this is the overhead that comes from having to do the selection at each reasoning step, instead of just applying every reformulator. If the overhead is high enough, then better wall-clock times might not present themselves for the evaluations we do on CLUTRR. However, even if this is not the case, we argue that the theoretical speedup of this method proves its usefulness regardless: eventually the dataset complexity and reasoning depths will be high enough that the resulting theoretical speedup overcomes the overhead that comes with the method. Serious computational concerns for CTPs will occur more often for complex datasets and high reasoning depths, which is exactly when the theoretical speedup becomes an advantage.

5. Related Work

Other works have already explored models that learn to traverse a knowledge base. Das et al. [22], Xiong et al. [23] use reinforcement learning to learn inference paths in large knowledge bases. Both of these works are based upon the path ranking algorithm [24], which uses random walks with restarts to perform several upper-bounded depth-first searches to find paths along relations. When combined with elastic-net based learning, the algorithm can learn to choose paths which are more likely to complete the inference.

Das et al. [22] propose MINERVA, a method for searching a knowledge graph for answerproviding paths using reinforcement learning, conditioned on the query. Given a knowledge graph, it attempts to learn a policy which, given a query of the form predicate(c, ) , starts from c and walks over relations (edges in the knowledge graph), choosing a relation at each step, conditioned upon the query predicate and the walk so far. This is done with reinforcement learning by trying to maximize the reward: reaching the correct answer constant. Xiong et al. [23] adopt a similar but slightly simpler approach with DeepPath, which also uses reinforcement learning to find paths between pairs of constants. However, in contrast to Das et al. [22], they also condition upon the answer constant while traversing the graph.

While our proposed method is not exactly what MINERVA and DeepPath do, their existence and success at least indicate that the problem of learning which reasoning steps to take in a knowledge base is a solvable one. Our work also runs parallel to that of Asgharbeygi et al. [25], Crouse et al. [26], both of which use reinforcement learning as a search heuristic to optimise reasoning, albeit in diferent settings. Finally, we also draw inspiration for our method from the work of Li et al. [27], who provide motivation for training large transformer models and then heavily compressing them before testing. In a similar manner, our method aims to utilize a large number of reformulators when training, and then only use the selected ones during evaluation.

6. Experiments

Experiment Design We adopt the the hyperparameters in Minervini et al. [ 8 ] to train our CTP models. We adopt the procedure of first training the reformulators, and then training the selection module. This means the selection module is learning to select proof paths over actual rules in the knowledge base, and we can independently control how long the reformulators and selection module are trained for. For optimizing hyperparameters, we adopt the same approach to evaluation and test sets as Minervini et al. [ 8 ]. We perform two diferent optimizations: the first has hyperparameters tuned on an evaluation set of 1.3_test and is tested on all other datasets, and the second is tuned on an evaluation set of 1.9_test.

Reformulator Strength Since all reformulators have diferent random initializations, they all converge to diferent local optima. Thus, it is clear that some reformulators will capture more rules than others, and that some will learn a particular rule better than the others. Moreover, certain subsets of reformulators are likely to contribute more to proofs than others, with subsets of the reformulators that better cover the range of rules in the knowledge base giving higher accuracies when used for evaluation. We demonstrate this empirically by training 5 and 8 reformulators respectively, across 10 diferent seeds and then using 4 random subsets of 3 reformulators each for evaluation. The average accuracies of the highest and lowest scoring subsets are 17% and 51% for 5 reformulators, and 11% and 31% for 8 reformulators.

We see that there is a large discrepancy between the performance of diferent subsets, indicating that when it comes to maximizing accuracy during evaluation, there are reformulators whose inclusion in the model is far more important.We also present the following hypothesis: as the number of reformulators increases, individual reformulators become weaker. More formally: as we use more reformulators during training, the expected accuracy when using a fixed-size subset of the reformulators during evaluation decreases. This is an important hypothesis to note and prove, as it means that the task of selecting the best reformulators for a proof becomes harder as the total number of reformulators increases. This hypothesis is supported empirically by the results in Table 3: we trained CTP models with 5 and 8 reformulators respectively, reporting the average accuracy when 4 random subsets of 3 reformulators were used for evaluation. Adaptive CTP Evaluation Let be the number of reformulators trained and the number of proof paths expanded during evaluation. We chose to evaluate ACTPs with ∈ {5, 8} to measure their efectiveness when more (and individually weaker) reformulators are used and with ∈ {2, 3} to measure the efectiveness of ACTPs when they are allowed to expand fewer proof paths. We refer to such an ACTP model as ACTP- C . This yielded 4 diferent scenarios for evaluation.

Only when using ∶= 5 and ∶= 3 were ACTPs an improvement upon CTPs, the results of which are shown in Table 4. Further to this, we see that ACTPs perform significantly worse when only 2 reformulators are chosen instead of 3. ACTP models that only choose 2 reformulators are outperformed by the baseline across every dataset. We hypothesise that, for CLUTRR, the task of learning which two reformulators are the most promising is one that is just too dificult for the model to find a solution to.

As shown in Section 6, as the number of reformulators increases, individual reformulators become weaker. Thus, the task of choosing the optimal reformulators for expansion becomes more dificult as the number of reformulators increases. This means that, all else being constant, the accuracy of ACTP models will drop as more reformulators are trained. This efect is ofset by the increasing expressivity of the model as more reformulators are used. Hence, as expected, ACTPs consistently perform worse when more reformulators are trained. This is confirmed by our experimental findings, where ACTP-5C models outperform ACTP-8C models in every scenario and across every dataset.

We also note that tuning hyperparameters on 1.9_test instead of 1.3_test causes baseline CTP performance to decrease for the simpler datasets but increase for the more complex datasets. For ACTPs however, this efect appears far less pronounced, with the accuracy of ACTP-5C3 models even dropping slightly for the more complex test datasets, when tuning on 1.9_test instead of 1.3_test. This indicates that ACTPs are not learning the reasoning patterns needed for One-tailed unpaired t-test between the baseline of CTPs with 3 reformulators and ACTPs with 5 choosing 3 reformulators. Hyperparameters tuned on 1.3_test the more complex datasets, even when tuned on such a dataset. It is also possible that ACTPs could be overfitting to the evaluation set when tuned on 1.9_test. However, since the accuracy of ACTPs on 1.9_test does not even increase that much when tuning on the dataset, we find the former explanation to be more likely. As a final point, we note that ACTPs do not appear to exhibit higher instability in their accuracies than CTPs, with the models showing comparable levels of variance.

7. Conclusions

In this paper, we provided motivation for cases in which CTP models would be required to have a large number of reformulators and a high reasoning depth, as well as demonstrating how this leads to computational complexity concerns both theoretically and empirically. We defined a framework for ACTPs as an extension to CTPs, in which reinforcement learning is used to learn to select optimal reformulators for expansion during a proof. This allows the model designer to scale down the number of selected reformulators, such that the computational constraints of the use case may be met. We noted that certain subsets of reformulators perform significantly better than others, and that individual reformulators tend to become weaker as the number of reformulators used in a CTP model increases. This means that the task of selecting reformulators becomes more dificult as the number of reformulators increases. We evaluated ACTPs in 4 separate scenarios, which vary in regard to the number of reformulators trained and the number of reformulators selected. In 1 of these 4 scenarios, we found that ACTPs outperformed CTPs on 8 out of 9 test datasets. The results demonstrate the usefulness of ACTPs over CTPs in certain situations, but also highlight their failing to be a categorical improvement upon CTPs.

Acknowledgments

This work was mostly done during Matthew’s MSc in Computer Science at Oxford. As a result, the following has been written by him. Thanks are first and foremost due to my supervisor, Phil Blunsom. Thank you for introducing me to the topic of this thesis, for sharing so many ideas and papers with me, and for the constant and gentle encouragement. To Pasquale Minervini, the primary creator of CTPs and an outstanding academic at UCL: I convey my most sincere gratitude. I am still amazed that you were willing to spend so long helping me through all my technical and theoretical dificulties. You owed nothing to me, which makes me so much more thankful for all that you did. Thanks are also due to the Oxford Whiteson Research Lab, for allowing me to use their servers for running experiments, and to Luisa in particular, for helping me get everything set up. Finally, thank you to Skye Foundation and The Oppenheimer Memorial Trust for fully funding my degree and expenses while in the UK. statistical relational learning: The relational dependency network case, Machine Learning 86 (2012) 25–56. [20] T. Trouillon, J. Welbl, S. Riedel, É. Gaussier, G. Bouchard, Complex embeddings for simple link prediction, in: International Conference on Machine Learning, PMLR, 2016, pp. 2071–2080. [21] R. J. Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine learning 8 (1992) 229–256. [22] R. Das, S. Dhuliawala, M. Zaheer, L. Vilnis, I. Durugkar, A. Krishnamurthy, A. Smola, A. McCallum, Go for a walk and arrive at the answer: Reasoning over paths in knowledge bases using reinforcement learning, Proceedings of the 7th International Conference on Learning Representations (2018). [23] W. Xiong, T. Hoang, W. Y. Wang, Deeppath: A reinforcement learning method for knowledge graph reasoning, Conference on Empirical Methods in Natural Language Processing (EMNLP) (2017). [24] N. Lao, T. Mitchell, W. Cohen, Random walk inference and learning in a large scale knowledge base, in: Proceedings of the 2011 conference on empirical methods in natural language processing, 2011, pp. 529–539. [25] N. Asgharbeygi, N. Nejati, P. Langley, S. Arai, Guiding inference through relational reinforcement learning, in: International Conference on Inductive Logic Programming, Springer, 2005, pp. 20–37. [26] M. Crouse, I. Abdelaziz, B. Makni, S. Whitehead, C. Cornelio, P. Kapanipathi, K. Srinivas, V. Thost, M. Witbrock, A. Fokoue, A deep reinforcement learning approach to first-order logic theorem proving, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 2021, pp. 6279–6287. [27] Z. Li, E. Wallace, S. Shen, K. Lin, K. Keutzer, D. Klein, J. E. Gonzalez, Train large, then compress: Rethinking model size for eficient training and inference of transformers, in: Proceedings of the 37th International Conference on Machine Learning (ICML 2020), 2020, pp. 5958–5968.

A. Computational Complexity of CTPs

We provide a theoretical analysis of the time complexity of CTPs. We consider two base operations that we wish to count: the number of reformulator applications (data being passed through a neural network) and the number of sub-goals that need unifying with the knowledge base after the reasoning depth is reached (comparisons with all fact embeddings using a RBF kernel). The remainder of the operations in CTPs are either tied into one of these two, or take constant time. We count these operations with respect to the number of reformulators used , the reasoning depth , and the maximum number of atoms in the body of any reformulator used . We find that the time complexity of CTPs as a whole is (()

Let be the number of sub-goals after the model has applied recursive expansions upon the goal. With each recursive step down, reformulators are applied to every sub-goal, with each reformulator generating () new sub-goals to be proved. This means () new sub-goals for each existing sub-goal, so +1 = ( × ) . Noting that 0 = 1, representing the query passed to the model, we see that the number of sub-goals after the test reasoning depth has been reached is: = 0 × (() ) = (()

). Furthermore, at depth with sub-goals, there are reformulator applications. Thus, the total number of reformulator applications until the ). reasoning depth has been reached is: = 0 + 1 + 2 + ... + −1 = () + (() 1) + ... + (() −1 ) = (() ) Then since > 1 and > 1 for any CTP model of non-trivial complexity, Thus, as = , we state that the time complexity for CTPs as a whole is (() = (() ). The main ). issue with this time complexity is the raising to the power of . For a fixed depth , the time complexity is a -degree polynomial in and , which can also cause dificulties for a suitably large value of . We have already provided motivation for using a test reasoning depth of 5 on CLUTRR, meaning that this would become degree 5 polynomial.

B. Computational Issues of CTPs

To demonstrate the computational issues of CTPs in practice, we compute the average evaluation time across all CLUTRR datasets, noting that the results are not entirely stable due to them being evaluated and aggregated across a variety of machines. We show this with respect to the number of reformulators in Fig. 1, and with respect to the test reasoning depth in Fig. 2. For reference, the longest evaluation time was for a test reasoning depth of 5, with 5 reformulators, on the 1.10_test dataset. It took 16.7 hours to evaluate.

C. ACTP Speedup Wall-clock Speedup

In Fig. 3, we compare the evaluation times of ACTPs and baseline CTPs, with the baseline operating on 3 and 8 reformulators respectively. As expected, the overhead caused by the copying, masking, and other operations needed in an ACTP model led to it taking significantly longer to evaluate than CTP-3. However, the overhead was low enough for ACTP-8C3 to take less time to evaluate than CTP-8 across all datasets. The efect becomes more pronounced as the complexity of the dataset increases, since the number of facts in the knowledge base to unify the sub-goals with increases, which has a significant efect on the computational complexity of the model.

As the dataset complexity continues to increase. the overhead will become more negligible, leading the evaluation time of ACTP- C models to tend to those of CTP- models. The increasing gap between the evaluation time of ACTP-8C3 and CTP-8 in Fig. 3 is a clear visual illustration of this trend. Datasets always taking longer to evaluate on than others is explained by the size of the datasets. For example, despite 1.6_test being a more complex dataset than 1.5_test, it only contains 104 tasks, compared to the 184 of 1.5_test.

D. Model Hyperparameters

In this appendix, for reproducibility, we provide the hyperparameters used for each of our evaluations.

D.1. Fixed CTP Hyperparameters Name Value D.3. Reasoning Depth D.4. Reformulator Subsets Name Values D.2. Number of Reformulators Name Values

seed subset hops seed hops seed hops 1-30 4 1-3 1-5 1-10 4 1-8 reformulators, each with 2 atoms in the body. For example, 4 reformulators is denoted by: “2 2 2 2” 5 reformulators, each with 2 atoms in the body, denoted by: “2 2 2 2 2” 5, 8 reformulators, each with 2 atoms in the body, denoted by: “2 2 2 2 2” and “2 2 2 2 2 2 2 2” respectively Use reformulators with the following indices for evaluation: { [0 1 2], [2 3 4], [0 3 4], [1 2 4] }

D.5. ACTPs 5 Reformulators Name Values D.6. ACTPs 8 Reformulators

hops seed hops seed 1-30 4

E. Algorithm Pseudocode

Here follows the pseudocode for some of the algorithms used in this paper. 1: or(, ) = { 2: and(_, FAIL) = FAIL 9: Algorithm 1: Backward chaining

In the code, is the knowledge base containing the rules and facts, sets are denoted with curly brackets, lists are denoted with square brackets, an underscore matches any argument, refers to a goal, ̂ to a set of sub-goals, to a substitution set, to the body of a rule, and to the head of a rule. To check if a goal 1 holds true, one needs to get the output of or( 1, []). If the output contains a substitution set, then the query is true, otherwise it will only contain the value FAIL and the query cannot be proven. ′ ∣ ′ ∈ and(, unify( , , ))

for each ← ∈ } ″ ∣ ″ ∈ and(, ̂ ″) ∀ ′ ∈ or(substitute(, ), )} 10: substitute([], _) = [] 11: = unify( ,

unify(ℎ ∶ , ∶ , ) ⎨ ⎪ ⎩ ⎧ ∪ {ℎ/} ⎪ ∪ {/ℎ}

if ℎ ∈ if ∈ , ℎ ∉

if = ℎ otherwise ⎫ ⎪ ⎬ ⎪ ⎭

) substitute( ∶ , ) = { otherwise } Algorithm 2: Neural backward chaining

The code is based on the summary by Minervini et al. [ 8 ]. is a goal, is the reasoning depth, is the head of a rule, is the body, is a knowledge base containing rules and facts, is the RBF kernel, is a proof state, is a substitution set, is a proof score, and is a set of variables.

/* Try use any rule in KB to prove */ def or(, , ) for ← ∈ do

end end end def and(, , ) if = [] or = 0 then yield for ∈ and(, , unify( , , )) yield

do /* Empty body or reasoning depth reached */ else for ′ ∈ or(sub( 0, ), − 1, ) /* Apply substitution to body, then try to prove each atom within */ do for ″ ∈ and( 1∶, , ′) do

yield ″ end end end end def unify( , , = (

, )) ⎧⎪ { / } if ∈ ⎫⎪ = ⎩⎪⎨ { ∅/ } if o th∈e ,rwi se∉ ⎪⎭⎬ ′ = ⋃ return ′ = ( ′, ′) ′ = min{ }⋃ , ∉ { ( , )}} similarity across all representations */

/* Extend the substitution set */ /* Similarity value is the minimum

Algorithm 3: REINFORCE

In this algorithm: represents the model parameters, is the number of episodes, is the number of episodes per batch, is the number of steps in an episode, is the discount factor used to make further away rewards worth less, () gives the probability of action from the distribution produced by applied to , and is the learning rate. for episodes in batch do for ∈ 1, ..., do end ∶= + 1 end Calculate policy loss for entire batch: () ∶= −

Update policy: ∶= + ∇() end 1 ∑=1 ln( () ) Generate episode 0, 0, 0, ..., , , using the policy to output probability distributions which are then sampled from to get actions

Calculate discounted rewards from each state: ∶= ∑= def get_best_model(model, train_set, hyperparameter_grid, evaluation_set) best_model = None best_accuracy = -1 for hyperparameter_set ∈ hyperparameter_grid do train_model(model, train_set) accuracy_value = accuracy(model, evaluation_set) if accuracy_value > best_accuracy then best_accuracy = accuracy_value best_model = model end end return best_model end end def get_best_seeded_model(model, train_set, hyperparameter_grid, evaluation_set) pick 6 seeds = { 0, ..., 5} at random include seeds in hyperparameter_grid return get_best_model(model, train_set, hyperparameter_grid, evaluation_set) def get_model_accuracy(model, train_set, evaluation_set, test_sets) initialize hyperparameter_grid total_accuracy = 0 for 5 iterations do best_model = get_best_seeded_model(model, train_set, hyperparameter_grid, evaluation_set) test_accuracies = accuracy(best_model, test_sets) total_accuracy += test_accuracies end return total_accuracy / 5 end get_model_accuracy(model, train_set, 1.3_test, test_sets) get_model_accuracy(model, train_set, 1.9_test, test_sets) Algorithm 5: ACTP selection module is a batch of states, is a batch of actions, and a batch of rewards. denotes the policy estimator and is the number of reformulators the module ought to select. sample_with_replacement(, , ) draws samples from with replacement, using the probability distribution . def get_actions( ) ∶= e() selected_reformulators = [] reformulator_counts = [] end selected_reformulators.append(indices) update reformulator_counts using indices

/* A batch of probability distributions */ end end

return selected_reformulators, reformulator_counts def apply_reward(, , ) ∶= get_loss(, , ) optimizer.apply_loss( , retain_graph = True)

[1]

A. d.

Garcez ,

T. R.

Besold , L. De Raedt,

Földiak ,

Hitzler ,

Icard , K.-U. Kühnberger,

L. C.

Lamb ,

Miikkulainen ,

D. L.

Silver , Neural-symbolic learning and reasoning: contributions and challenges , in: 2015 AAAI Spring Symposium Series, 2015 .

[2]

Yang ,

W. W.

Cohen , Diferentiable learning of logical rules for knowledge base reasoning , Advances in Neural Information Processing Systems 30 ( 2017 ) 2316 - 2325 .

[3]

Evans ,

Grefenstette , Learning explanatory rules from noisy data , Journal of Artificial Intelligence Research 61 ( 2018 ) 1 - 64 .

[4]

Sadeghian ,

Armandpour ,

Ding ,

D. Z.

Wang , Drum: End-to-end diferentiable rule mining on knowledge graphs , Proc. of NIPS ( 2019 ).

[5]

Minervini ,

Bošnjak ,

Rocktäschel ,

Riedel , E. Grefenstette, Diferentiable reasoning on large knowledge bases and natural language , in: Proceedings of the AAAI Conference on Artificial Intelligence , volume 34 , 2020 , pp. 5182 - 5190 .

[6]

Rocktäschel ,

Riedel , End-to-end diferentiable proving , in: NIPS , 2017 , pp. 3788 -- 3800 .

[7]

Gallaire ,

Minker , Logic and data bases, symposium on logic and data bases, centre d'études et de recherches de toulouse, 1977 , Advances in Data Base Theory ( 1978 ).

[8]

Minervini ,

Riedel ,

Stenetorp , E. Grefenstette,

Rocktäschel , Learning reasoning strategies in end-to-end diferentiable proving , in: ICML, volume 119 of Proceedings of Machine Learning Research, PMLR , 2020 , pp. 6938 - 6949 .

[9]

Johnson ,

Hariharan ,

L. Van Der

Maaten , L. Fei-Fei , C.

Lawrence Zitnick , R.

Girshick , Clevr: A diagnostic dataset for compositional language and elementary visual reasoning , in: Proceedings of the IEEE conference on computer vision and pattern recognition , 2017 , pp. 2901 - 2910 .

[10]

Bahdanau ,

Murty ,

Noukhovitch ,

T. H.

Nguyen , H. de Vries, A. Courville, Systematic generalization: what is required and can it be learned? , ICLR ( 2019 ).

[11]

Lake ,

Baroni , Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks , in: International conference on machine learning, PMLR , 2018 , pp. 2873 - 2882 .

[12]

Sinha ,

Sodhani ,

Dong ,

Pineau ,

W. L.

Hamilton , Clutrr: A diagnostic benchmark for inductive reasoning from text , Empirical Methods of Natural Language Processing (EMNLP) ( 2019 ).

[13]

M. E.

Stickel , A prolog technology theorem prover: Implementation by an extended prolog compiler , Journal of Automated reasoning 4 ( 1988 ) 353 - 380 .

[14]

Merritt , Building expert systems in Prolog , Springer Science & Business Media , 2012 .

[15]

Lally ,

Fodor , Natural language processing with prolog in the ibm watson system, The Association for Logic Programming (ALP) Newsletter 9 ( 2011 ).

[16]

Andreas ,

Rohrbach ,

Darrell ,

Klein , Learning to compose neural networks for question answering , NAACL ( 2016 ).

[17]

Chomsky , Logical structures in language, American Documentation (pre- 1986 ) 8 ( 1957 ) 284 .

[18]

Gontier ,

Sinha ,

Reddy ,

Pal , Measuring systematic generalization in neural proof generation with transformers , NeurIPS'20 ( 2020 ).

[19]

Natarajan ,

Khot ,

Kersting ,

Gutmann ,

Shavlik , Gradient-based boosting for