<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Learning Proof Path Selection Policies in Neural Theorem Proving</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Matthew Morris</string-name>
          <email>matthewthemorris@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pasquale Minervini</string-name>
          <email>p.minervini@cs.ucl.ac.uk</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Phil Blunsom</string-name>
          <email>phil.blunsom@cs.ox.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department, University of Oxford</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>DeepMind</institution>
          ,
          <addr-line>London</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>UCL Centre for Artificial Intelligence, University College London</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Neural Theorem Provers (NTPs) are neural relaxations of the backward-chaining logic reasoning algorithm. They can learn continuous representations for predicates and constants, induce interpretable rules, can provide logic explanations for their predictions, and show strong systematic generalisation properties. However, since they enumerate all possible proof paths for proving a goal, they sufer from high computational complexity, and are thus unsuitable for complex reasoning tasks. Conditional Theorem Provers (CTPs) try to overcome this issue by generating relevant rules on-the-fly based on the goal, rather than considering all possible rules. Nonetheless, CTPs sufer from similar computational constraints, as they still have to consider multiple proof paths while reasoning. We propose Adaptive CTPs (ACTPs), where CTPs are augmented with a learned policy to dynamically select the most promising proof paths. This allows the model designer to specify the number of proof paths to consider, to conform to the computational constraints of their use case, while retaining all of the benefits of CTPs. By evaluating on the CLUTRR dataset, we provide evidence for the computational issues in existing CTP models, show that ACTPs alleviate these issues, and demonstrate that, in certain scenarios, the accuracy achieved by ACTPs is higher than CTPs while retaining the same computational complexity.</p>
      </abstract>
      <kwd-group>
        <kwd>reasoning</kwd>
        <kwd>neuro-symbolic</kwd>
        <kwd>adaptive computation</kwd>
        <kwd>reinforcement learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Promising work has been done around integrating neural models and symbolic reasoning, as
their complementary strengths and weaknesses make for powerful models when combined
[
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5">1, 2, 3, 4, 5</xref>
        ]. In this paper, we consider such a technique: Neural Theorem Provers [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        The approach of Rocktäschel and Riedel [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is to keep variable
binding symbolic, but compare predicates and constants using their sub-symbolic
representations. They introduce Neural Theorem Provers (NTPs): end-to-end diferentiable provers
for theorems formulated as queries to a knowledge base (KB). Prolog’s backward chaining
algorithm [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is used as a blueprint for constructing neural networks in a recursive manner,
(P. Blunsom)
which can prove queries to a KB using vector representations of symbols. These proofs are
given success scores, which are diferentiable with respect to the sub-symbolic representations,
allowing the model to learn representations that maximise the proof scores. Using the same
process, rules of pre-defined structures are also learnt.
      </p>
      <p>
        NTPs can learn representations of symbols in a KB like neural link prediction models and
learn rules which hold in the KB. They also allow one to incorporate already known rules
into the reasoning process, as one simply needs to include them in the knowledge base. NTPs
are also naturally interpretable, since they induce sub-symbolic rules that can be decoded to
human-readable symbolic rules. Finally, Minervini et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] demonstrate that NTPs have the
ability to perform systematic generalisation, learning how to solve complex reasoning tasks
while only being trained on simpler examples. In contrast, many neural models appear not to
generalise robustly on tasks requiring systematic generalisation [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref9">9, 10, 11, 12</xref>
        ].
      </p>
      <p>
        However, NTPs have a significant computational footprint, as they consider all possible rules
for proving a goal or sub-goal. This means they cannot scale to settings with a large number
or rules or reasoning steps. To solve this problem, Minervini et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] propose Conditional
Theorem Provers (CTPs), an extension of NTPs that learn to dynamically generate a set of rules
for proving the current goal or sub-goal. This is implemented by a select module which, given
a goal, returns multiple rules that can be used to prove that goal. It consists of multiple neural
networks, which we refer to as reformulators, each of which can represent several rules in a
knowledge base. This module is end-to-end diferentiable, and can be trained end-to-end via
backpropagation.
      </p>
      <p>However, CTPs can end up sufering from computational issues in a similar fashion to NTPs,
since they still need to consider multiple proof paths during the reasoning process. For complex
datasets in which there are many ways to prove a given goal, more proof paths need to be
checked. This, in conjunction with the high reasoning depth often required for such datasets,
causes CTPs to become infeasibly slow. This is especially problematic when CTPs are applied
in settings where the inference time is critical.</p>
      <p>Objectives In this paper, we aim to address the computational shortcomings of CTPs by
extending them to ACTPs (Adaptive CTPs). Specifically, we augment CTPs with a policy
trained to select the proof paths that are most likely to succeed. This allows one to control
the amount of exploration in the space of proof paths for a given goal, depending on the
computational requirements of the task. An alternative to addressing the computational issues
of CTPs is to simply reduce the number of reformulators trained, leading to fewer proof paths
being considered. However, this makes the model less expressive, meaning it will likely be
unable to capture all of the rules in a knowledge base. Thus, for ACTPs to be useful, an ACTP
model expanding only  proof paths should achieve higher accuracy than a CTP model with
only  reformulators. In the following, we 1) motivate for CTP models sometimes requiring a
large number of reformulators and reasoning depth; 2) concretely establish the computational
issues that CTPs sufer from, using both empirical results and a theoretical analysis; 3) define a
framework for ACTPs using policy gradient descent; 4) empirically demonstrate on the CLUTRR
dataset that ACTPs are an improvement upon CTPs, in both their respective accuracies and
evaluation times.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Neural Theorem Proving</title>
      <p>
        Backward Chaining Prolog [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is a logic programming language that finds use in
contemporary work. It has been used for a variety of tasks, including automated theorem proving
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], expert systems [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], and natural language processing [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. A Prolog KB consists of rules
and facts. Queries are passed to the KB, with Prolog returning whether or not the queries
are entailed by the KB. The restrictive syntax of Prolog allows one to answer queries using
Prolog’s backward chaining algorithm [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Given a goal, such as sister(Joshua, Cindy), which
is constructed from a query, Prolog tries to find substitutions for the goal by using the rules
in the KB. The process of checking if the head of a rule matches a goal is called unification . If
unification succeeds, then the goal is replaced with the atoms from the body of the rule, giving
a new set of sub-goals. The same process is then applied to each of these sub-goals, continuing
recursively until all sub-goals are found as facts in the KB or there are no more rules to apply.
Neural Theorem Provers Neural Theorem Provers (NTPs), proposed by Rocktäschel and
Riedel [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], are a continuous relaxation of the backward chaining algorithm. NTPs can be trained
end-to-end, by calculating the gradient of proof successes with respect to vector representations
of symbols, and are defined in terms of modules [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. The recursive expansion upon the goal
is kept track of in proof states, which each contain a neural network that outputs the success
score of the proof so far, and the substitution set. The network is recursively built upon, with
new nodes being added as rules are applied. Every diferent proof path will have a diferent
associated proof state. However, as noted by Rocktäschel and Riedel [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], NTPs sufer from
severe computational limitations. In standard backward chaining, a proof path can be aborted
when unification with a rule fails, but this happens far less in neural backward chaining, since
unification only fails when predicates do not have matching arities. Given a goal such as
sister(Joshua, Cindy), the prover should only consider rules such as the first one below, and
not the second.
Conditional Theorem Provers To address the computational limitations of NTPs, Minervini
et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] propose introducing a new module into the system, select, to reduce the number of
rules being considered when expanding upon a goal. In the or module, instead of considering
every rule  ←  ∈  in the KB  , they only consider each  ←  ∈ select () . This
select module can be implemented by a sequence of diferentiable parameterized functions
select1(), select2(), ..., select () that each, given a goal, produces a sequence of
subgoals. We refer to each of these functions as a goal reformulation module, or simply a reformulator.
      </p>
      <p>
        In summary, a CTP is composed of  reformulators, each of which can represent multiple
rules, and it learns representations for predicates and constants. All model parameters are
initialised randomly, and trained end-to-end via backpropagation. During inference, a CTP
model starts with a goal ( 1,  2), and initialises  = {( 1,  2)} as the set of sub-goals. Then,
recursively up to the given reasoning depth  , the model applies each reformulator to each
of the sub-goals in  , generating a new set of sub-goals from the output of the reformulators
with each recursive step. At each recursive step, the model produces a new proof path for each
reformulator. The model maximises scores over proof paths. At every depth up to the reasoning
depth  , the model unifies every sub-goal from the proof path with the knowledge base of facts,
given in the dataset task. These similarity values are propagated up, with the score of the atom
in the head of a reformulator being set to the minimum similarity of all the atoms in its body.
Evaluation Dataset CLUTRR [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] is a system for constructing artificial datasets modelling
family relationships. Given a set of family relations, the task is to infer the relationship between
two family members whose relationship is not explicit in the set. To solve this, an agent ought
to be able to induce the logical rules that govern family relationships, and use those rules to
infer the relationship of the query members from the given relations. In particular, CLUTRR
allows for testing an agent’s ability to perform systematic generalization [
        <xref ref-type="bibr" rid="ref12 ref17 ref18">17, 12, 18</xref>
        ]. Sinha
et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] published several generated dataset groups alongside the CLUTRR system, which
we make use of for training and evaluation in this paper. We refer to the training dataset as
1.2,1.3,1.4_train and the testing datasets as 1.2_test, 1.3_test, ..., 1.10_test.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Expressivity and Complexity</title>
      <p>
        The expressivity of a relational learning model is an important consideration for its viability
[
        <xref ref-type="bibr" rid="ref19">19, 20</xref>
        ]. A more expressive model can capture more types of data and relations than a less
expressive one, meaning it can be applied to more complex datasets. Due to the restrictive
nature of the syntax of Prolog, CTPs are already quite limited when it comes to the structure of
expressions upon which they can reason. However, these limitations go further, as the rules that
a CTP model can capture depend on the structure and number of reformulators in the model.
Number of Reformulators A single reformulator can capture any number of rules, provided
that for each rule, the positions of the variables in the rule correspond to the positions of
their representations in the reformulator. In addition, given an atom for the head of a rule, a
reformulator can only capture one rule with the given head. For example, a single reformulator
would be unable to capture both of the following rules, even though the positions of the variables
are the same in both:
This is because a reformulator is a function that maps from  ∈ ℝ  × (ℝ ∪  ) × (ℝ  ∪  ) ,
meaning that each  has to map to a unique output. As such, to fully capture all of the rules in
a knowledge base, we need as many reformulators as there are the maximum number of rules
with the same head atom. We refer to this number as the minimal full expressivity bound. For the
CLUTRR datasets, this number is 5. To see the efect of increasing the number of reformulators,
we fix the train and test reasoning depths, and evaluate the average/maximum test accuracy of
CTPs on on 1.4_test and 1.10_test, using a number of reformulators varying from 1 to 8. The
results are shown in Table 1.
      </p>
      <p>Reasoning Depth CTPs reason up to a pre-defined reasoning depth and then unify the
existing sub-goals with the facts in the knowledge base. This means that, even if the CTP model
has perfectly learned to represent all of the rules in the knowledge base with reformulators,
it still needs a suficient reasoning depth to prove the goal. The required reasoning depth for
a valid proof path is the minimum number of recursive steps down, such that every sub-goal
appears in the knowledge base of facts. The reasoning depth needed to solve a task is thus
the minimum required reasoning depth across all possible proof paths. For the most complex
CLUTRR instance, this number is at most 5. In Table 2, we demonstrate the efect of using
diferent test reasoning depths.</p>
      <p>Computational Issues While we have demonstrated that it is rarely harmful and often
beneficial to increase the reasoning depth and number of reformulators in terms of predictive
accuracy, doing so can lead to higher computational costs. In Appendix A, we provide a
theoretical analysis of the time complexity of CTPs. We consider two base operations that we
wish to count: the number of reformulator applications (data being passed through a neural
network) and the number of sub-goals that need unifying with the knowledge base after the
reasoning depth is reached (comparisons with all fact embeddings using a RBF kernel). The
remainder of the operations in CTPs are either tied into one of these two, or take constant time.
We count these operations with respect to the number of reformulators used  , the reasoning
depth  , and the maximum number of atoms in the body of any reformulator used  . We find
that the time complexity of CTPs as a whole is  (()  ). We illustrate the issues resulting from
this empirically in Appendix B.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Optimization via REINFORCE</title>
      <p>Outline of Solution Since the time complexity of CTPs is  (()  ), optimizing CTP
evaluation time consists of trying to keep each of these variables as low as possible. The maximum
number of atoms in the body of any reformulator used  is almost impossible to reduce, as it is
completely determined by the rules the model is trying to capture in the knowledge base. The
depth  is also challenging to reduce, as a certain depth is required for full expressivity on the
test datasets. Our approach is to keep the number of reformulators  used at each expansion
step as low as possible, minimizing the number of proof paths that need to be considered.
Choosing Reformulators Given that the time complexity of CTPs is  (()  ), having 
reformulators and only selecting  at each expansion step would lead to a time complexity of
 (()  ) instead. Ideally, we would just be able to use  = 1 and learn to select the perfect
reformulator at every expansion step. However, initial experiments demonstrated this to be
an almost impossible task since, for each goal and sub-goal in CLUTRR, we do not know in
advance which proof path will allow us to prove it. Rather, we use an hyperparameter  ∈ ℕ ,
tuned for maximising accuracy and satisfying the computational constraints of the test datasets.
This means that  instead of  reformulators are selected and used at every expansion step in
the reasoning. We refer to the module making these selection decisions as the selection module.
Implementation We adopt REINFORCE [21] to train the reformulation selection module.
Let us first define the corresponding deterministic Markov decision process (, , , ) :</p>
      <p>States. A state in the model represents which sub-goal we are currently considering for
expansion in the proof. It is thus an atom  where  ∈ ℝ  × (ℝ ∪  ) × (ℝ  ∪  ) . In order that the
policy estimator may be implemented by a neural network, we define  ∶= ℝ  × ℝ × ℝ = ℝ3 .
We fixed the value of a variable to be {0 } .</p>
      <p>Actions. The set of possible actions from any state is the reformulators that could be used to
expand upon the sub-goal. With  reformulators, we thus have  = {1, ..., } . Note that as we
are choosing  reformulators for expansion, multiple actions are chosen for a given state by
sampling without repetition from the probability distribution   () .</p>
      <p>Transition Function. After some subset of reformulators is chosen, the CTP model proceeds
as normal, expanding out into a diferent proof path for each reformulator. Thus, we have a
deterministic transition function  defined by this process.</p>
      <p>Rewards. The proof score is an obvious choice for reward signal, as higher proof scores
correspond to the model performing better on positive tasks. However, rather than discounting
future rewards, we can use the fact that proof scores are propagated back up through the CTP
model to have access to the exact score that choosing a reformulator leads to. The reward given
for choosing a reformulator is thus the maximum proof score across all proof paths originating
from the reformulator applied to the current sub-goal.</p>
      <p>The policy network is implemented by a neural network with a single hidden layer, containing
30 hidden nodes. We use a Rectified Linear Unit (ReLU) activation function before the hidden
layer: ReLU() ∶= max(0, ) . The output of the policy network is thus</p>
      <p>() ∶= softmax ( 2 × ReLU( 1 × ))
while the loss is given by () = −
1 ∑=1   × ln(  (  ) ), where  is the batch size,   is the


action chosen in a particular task in the batch,   is the proof score that resulted from the action,
and   (  )  is the probability that action   has in the distribution   (  ). The loss is applied
separately for each of the  actions (reformulators) chosen. Rather than having episodes, we
simply execute the CTP model as usual, applying the policy and calculating the loss at every
expansion step in the reasoning. We refer to this model as ACTP and the original baseline CTP
model as CTP.</p>
      <sec id="sec-4-1">
        <title>ACTP Speedup</title>
        <p>We see that if ACTPs are used, there is at least a theoretical speedup from
the baseline of CTPs. Ideally this would translate into a speedup in wall-clock time as well, but
this is not guaranteed for all datasets and values of  and  . The larger the values of  and  ,
the greater the efect of choosing  from  reformulators will have; this follows directly from
the computational complexity  (()</p>
        <p>). Furthermore, larger and more complex datasets will
also see this efect being more pronounced, as they contain more facts that need to be unified
with the  (()</p>
        <p>) sub-goals once the test depth is reached.</p>
        <p>Counteracting this is the overhead that comes from having to do the selection at each
reasoning step, instead of just applying every reformulator. If the overhead is high enough, then
better wall-clock times might not present themselves for the evaluations we do on CLUTRR.
However, even if this is not the case, we argue that the theoretical speedup of this method
proves its usefulness regardless: eventually the dataset complexity and reasoning depths will
be high enough that the resulting theoretical speedup overcomes the overhead that comes
with the method. Serious computational concerns for CTPs will occur more often for complex
datasets and high reasoning depths, which is exactly when the theoretical speedup becomes an
advantage.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Related Work</title>
      <p>Other works have already explored models that learn to traverse a knowledge base. Das et al.
[22], Xiong et al. [23] use reinforcement learning to learn inference paths in large knowledge
bases. Both of these works are based upon the path ranking algorithm [24], which uses random
walks with restarts to perform several upper-bounded depth-first searches to find paths along
relations. When combined with elastic-net based learning, the algorithm can learn to choose
paths which are more likely to complete the inference.</p>
      <p>Das et al. [22] propose MINERVA, a method for searching a knowledge graph for
answerproviding paths using reinforcement learning, conditioned on the query. Given a knowledge
graph, it attempts to learn a policy which, given a query of the form predicate(c,  ) , starts
from c and walks over relations (edges in the knowledge graph), choosing a relation at each
step, conditioned upon the query predicate and the walk so far. This is done with reinforcement
learning by trying to maximize the reward: reaching the correct answer constant. Xiong et al.
[23] adopt a similar but slightly simpler approach with DeepPath, which also uses reinforcement
learning to find paths between pairs of constants. However, in contrast to Das et al. [22], they
also condition upon the answer constant while traversing the graph.</p>
      <p>While our proposed method is not exactly what MINERVA and DeepPath do, their existence
and success at least indicate that the problem of learning which reasoning steps to take in a
knowledge base is a solvable one. Our work also runs parallel to that of Asgharbeygi et al.
[25], Crouse et al. [26], both of which use reinforcement learning as a search heuristic to optimise
reasoning, albeit in diferent settings. Finally, we also draw inspiration for our method from the
work of Li et al. [27], who provide motivation for training large transformer models and then
heavily compressing them before testing. In a similar manner, our method aims to utilize a large
number of reformulators when training, and then only use the selected ones during evaluation.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Experiments</title>
      <p>
        Experiment Design We adopt the the hyperparameters in Minervini et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] to train our
CTP models. We adopt the procedure of first training the reformulators, and then training the
selection module. This means the selection module is learning to select proof paths over actual
rules in the knowledge base, and we can independently control how long the reformulators and
selection module are trained for. For optimizing hyperparameters, we adopt the same approach
to evaluation and test sets as Minervini et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. We perform two diferent optimizations:
the first has hyperparameters tuned on an evaluation set of 1.3_test and is tested on all other
datasets, and the second is tuned on an evaluation set of 1.9_test.
      </p>
      <p>Reformulator Strength Since all reformulators have diferent random initializations, they
all converge to diferent local optima. Thus, it is clear that some reformulators will capture more
rules than others, and that some will learn a particular rule better than the others. Moreover,
certain subsets of reformulators are likely to contribute more to proofs than others, with subsets
of the reformulators that better cover the range of rules in the knowledge base giving higher
accuracies when used for evaluation. We demonstrate this empirically by training 5 and 8
reformulators respectively, across 10 diferent seeds and then using 4 random subsets of 3
reformulators each for evaluation. The average accuracies of the highest and lowest scoring
subsets are 17% and 51% for 5 reformulators, and 11% and 31% for 8 reformulators.</p>
      <p>We see that there is a large discrepancy between the performance of diferent subsets,
indicating that when it comes to maximizing accuracy during evaluation, there are reformulators
whose inclusion in the model is far more important.We also present the following hypothesis: as
the number of reformulators increases, individual reformulators become weaker. More formally:
as we use more reformulators during training, the expected accuracy when using a fixed-size
subset of the reformulators during evaluation decreases. This is an important hypothesis to
note and prove, as it means that the task of selecting the best reformulators for a proof becomes
harder as the total number of reformulators increases. This hypothesis is supported empirically
by the results in Table 3: we trained CTP models with 5 and 8 reformulators respectively,
reporting the average accuracy when 4 random subsets of 3 reformulators were used for evaluation.
Adaptive CTP Evaluation Let  be the number of reformulators trained and  the number
of proof paths expanded during evaluation. We chose to evaluate ACTPs with  ∈ {5, 8} to
measure their efectiveness when more (and individually weaker) reformulators are used and
with  ∈ {2, 3} to measure the efectiveness of ACTPs when they are allowed to expand fewer
proof paths. We refer to such an ACTP model as ACTP- C . This yielded 4 diferent scenarios
for evaluation.</p>
      <p>Only when using  ∶= 5 and  ∶= 3 were ACTPs an improvement upon CTPs, the results of
which are shown in Table 4. Further to this, we see that ACTPs perform significantly worse when
only 2 reformulators are chosen instead of 3. ACTP models that only choose 2 reformulators
are outperformed by the baseline across every dataset. We hypothesise that, for CLUTRR, the
task of learning which two reformulators are the most promising is one that is just too dificult
for the model to find a solution to.</p>
      <p>As shown in Section 6, as the number of reformulators increases, individual reformulators
become weaker. Thus, the task of choosing the optimal reformulators for expansion becomes
more dificult as the number of reformulators increases. This means that, all else being constant,
the accuracy of ACTP models will drop as more reformulators are trained. This efect is ofset
by the increasing expressivity of the model as more reformulators are used. Hence, as expected,
ACTPs consistently perform worse when more reformulators are trained. This is confirmed by
our experimental findings, where ACTP-5C  models outperform ACTP-8C models in every
scenario and across every dataset.</p>
      <p>We also note that tuning hyperparameters on 1.9_test instead of 1.3_test causes baseline CTP
performance to decrease for the simpler datasets but increase for the more complex datasets.
For ACTPs however, this efect appears far less pronounced, with the accuracy of ACTP-5C3
models even dropping slightly for the more complex test datasets, when tuning on 1.9_test
instead of 1.3_test. This indicates that ACTPs are not learning the reasoning patterns needed for
One-tailed unpaired t-test between the baseline of CTPs with 3 reformulators and ACTPs with 5 choosing
3 reformulators. Hyperparameters tuned on 1.3_test
the more complex datasets, even when tuned on such a dataset. It is also possible that ACTPs
could be overfitting to the evaluation set when tuned on 1.9_test. However, since the accuracy
of ACTPs on 1.9_test does not even increase that much when tuning on the dataset, we find the
former explanation to be more likely. As a final point, we note that ACTPs do not appear to
exhibit higher instability in their accuracies than CTPs, with the models showing comparable
levels of variance.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions</title>
      <p>In this paper, we provided motivation for cases in which CTP models would be required to
have a large number of reformulators and a high reasoning depth, as well as demonstrating
how this leads to computational complexity concerns both theoretically and empirically. We
defined a framework for ACTPs as an extension to CTPs, in which reinforcement learning is
used to learn to select optimal reformulators for expansion during a proof. This allows the
model designer to scale down the number of selected reformulators, such that the computational
constraints of the use case may be met. We noted that certain subsets of reformulators perform
significantly better than others, and that individual reformulators tend to become weaker as the
number of reformulators used in a CTP model increases. This means that the task of selecting
reformulators becomes more dificult as the number of reformulators increases. We evaluated
ACTPs in 4 separate scenarios, which vary in regard to the number of reformulators trained
and the number of reformulators selected. In 1 of these 4 scenarios, we found that ACTPs
outperformed CTPs on 8 out of 9 test datasets. The results demonstrate the usefulness of ACTPs
over CTPs in certain situations, but also highlight their failing to be a categorical improvement
upon CTPs.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This work was mostly done during Matthew’s MSc in Computer Science at Oxford. As a result,
the following has been written by him. Thanks are first and foremost due to my supervisor, Phil
Blunsom. Thank you for introducing me to the topic of this thesis, for sharing so many ideas
and papers with me, and for the constant and gentle encouragement. To Pasquale Minervini,
the primary creator of CTPs and an outstanding academic at UCL: I convey my most sincere
gratitude. I am still amazed that you were willing to spend so long helping me through all
my technical and theoretical dificulties. You owed nothing to me, which makes me so much
more thankful for all that you did. Thanks are also due to the Oxford Whiteson Research Lab,
for allowing me to use their servers for running experiments, and to Luisa in particular, for
helping me get everything set up. Finally, thank you to Skye Foundation and The Oppenheimer
Memorial Trust for fully funding my degree and expenses while in the UK.
statistical relational learning: The relational dependency network case, Machine Learning
86 (2012) 25–56.
[20] T. Trouillon, J. Welbl, S. Riedel, É. Gaussier, G. Bouchard, Complex embeddings for simple
link prediction, in: International Conference on Machine Learning, PMLR, 2016, pp.
2071–2080.
[21] R. J. Williams, Simple statistical gradient-following algorithms for connectionist
reinforcement learning, Machine learning 8 (1992) 229–256.
[22] R. Das, S. Dhuliawala, M. Zaheer, L. Vilnis, I. Durugkar, A. Krishnamurthy, A. Smola,
A. McCallum, Go for a walk and arrive at the answer: Reasoning over paths in knowledge
bases using reinforcement learning, Proceedings of the 7th International Conference on
Learning Representations (2018).
[23] W. Xiong, T. Hoang, W. Y. Wang, Deeppath: A reinforcement learning method for
knowledge graph reasoning, Conference on Empirical Methods in Natural Language
Processing (EMNLP) (2017).
[24] N. Lao, T. Mitchell, W. Cohen, Random walk inference and learning in a large scale
knowledge base, in: Proceedings of the 2011 conference on empirical methods in natural
language processing, 2011, pp. 529–539.
[25] N. Asgharbeygi, N. Nejati, P. Langley, S. Arai, Guiding inference through relational
reinforcement learning, in: International Conference on Inductive Logic Programming,
Springer, 2005, pp. 20–37.
[26] M. Crouse, I. Abdelaziz, B. Makni, S. Whitehead, C. Cornelio, P. Kapanipathi, K. Srinivas,
V. Thost, M. Witbrock, A. Fokoue, A deep reinforcement learning approach to first-order
logic theorem proving, in: Proceedings of the AAAI Conference on Artificial Intelligence,
volume 35, 2021, pp. 6279–6287.
[27] Z. Li, E. Wallace, S. Shen, K. Lin, K. Keutzer, D. Klein, J. E. Gonzalez, Train large, then
compress: Rethinking model size for eficient training and inference of transformers, in:
Proceedings of the 37th International Conference on Machine Learning (ICML 2020), 2020,
pp. 5958–5968.</p>
    </sec>
    <sec id="sec-9">
      <title>A. Computational Complexity of CTPs</title>
      <p>We provide a theoretical analysis of the time complexity of CTPs. We consider two base
operations that we wish to count: the number of reformulator applications (data being passed
through a neural network) and the number of sub-goals that need unifying with the knowledge
base after the reasoning depth is reached (comparisons with all fact embeddings using a RBF
kernel). The remainder of the operations in CTPs are either tied into one of these two, or take
constant time. We count these operations with respect to the number of reformulators used
 , the reasoning depth  , and the maximum number of atoms in the body of any reformulator
used  . We find that the time complexity of CTPs as a whole is  (()</p>
      <p>Let   be the number of sub-goals after the model has applied  recursive expansions upon the
goal. With each recursive step down,  reformulators are applied to every sub-goal, with each
reformulator generating  ()
new sub-goals to be proved. This means  ()
new sub-goals
for each existing sub-goal, so  +1 =  (  × )
. Noting that  0 = 1, representing the query
passed to the model, we see that the number of sub-goals after the test reasoning depth has
been reached is:   =  0 ×  (()

) =  (()</p>
      <p>). Furthermore, at depth  with   sub-goals, there
are   reformulator applications. Thus, the total number of reformulator applications until the
 ).
reasoning depth has been reached is:
  =  0 +  1 +  2 + ... +  −1 =  () +  (()
1) + ... +  (()
−1 ) =  (()
 )
Then since  &gt; 1 and  &gt; 1 for any CTP model of non-trivial complexity,  
Thus, as   =   , we state that the time complexity for CTPs as a whole is  (()
=  (()
 ). The main
 ).
issue with this time complexity is the raising to the power of  . For a fixed depth  , the time
complexity is a  -degree polynomial in  and  , which can also cause dificulties for a suitably
large value of  . We have already provided motivation for using a test reasoning depth of 5 on
CLUTRR, meaning that this would become degree 5 polynomial.</p>
    </sec>
    <sec id="sec-10">
      <title>B. Computational Issues of CTPs</title>
      <p>To demonstrate the computational issues of CTPs in practice, we compute the average evaluation
time across all CLUTRR datasets, noting that the results are not entirely stable due to them
being evaluated and aggregated across a variety of machines. We show this with respect to the
number of reformulators in Fig. 1, and with respect to the test reasoning depth in Fig. 2. For
reference, the longest evaluation time was for a test reasoning depth of 5, with 5 reformulators,
on the 1.10_test dataset. It took 16.7 hours to evaluate.</p>
    </sec>
    <sec id="sec-11">
      <title>C. ACTP Speedup</title>
      <sec id="sec-11-1">
        <title>Wall-clock Speedup</title>
        <p>In Fig. 3, we compare the evaluation times of ACTPs and baseline CTPs,
with the baseline operating on 3 and 8 reformulators respectively. As expected, the overhead
caused by the copying, masking, and other operations needed in an ACTP model led to it
taking significantly longer to evaluate than CTP-3. However, the overhead was low enough
for ACTP-8C3 to take less time to evaluate than CTP-8 across all datasets. The efect becomes
more pronounced as the complexity of the dataset increases, since the number of facts in the
knowledge base to unify the sub-goals with increases, which has a significant efect on the
computational complexity of the model.</p>
        <p>As the dataset complexity continues to increase. the overhead will become more negligible,
leading the evaluation time of ACTP- C models to tend to those of CTP- models. The
increasing gap between the evaluation time of ACTP-8C3 and CTP-8 in Fig. 3 is a clear visual
illustration of this trend. Datasets always taking longer to evaluate on than others is explained
by the size of the datasets. For example, despite 1.6_test being a more complex dataset than
1.5_test, it only contains 104 tasks, compared to the 184 of 1.5_test.</p>
      </sec>
    </sec>
    <sec id="sec-12">
      <title>D. Model Hyperparameters</title>
      <p>In this appendix, for reproducibility, we provide the hyperparameters used for each of our
evaluations.</p>
      <sec id="sec-12-1">
        <title>D.1. Fixed CTP Hyperparameters</title>
        <sec id="sec-12-1-1">
          <title>Name Value</title>
        </sec>
      </sec>
      <sec id="sec-12-2">
        <title>D.3. Reasoning Depth</title>
      </sec>
      <sec id="sec-12-3">
        <title>D.4. Reformulator Subsets</title>
        <sec id="sec-12-3-1">
          <title>Name</title>
        </sec>
        <sec id="sec-12-3-2">
          <title>Values</title>
        </sec>
      </sec>
      <sec id="sec-12-4">
        <title>D.2. Number of Reformulators</title>
        <sec id="sec-12-4-1">
          <title>Name Values</title>
          <p>seed
subset
hops
seed
hops
seed
hops
1-30
4
1-3
1-5
1-10
4
1-8 reformulators, each with 2
atoms in the body. For example,
4 reformulators is denoted by: “2
2 2 2”
5 reformulators, each with 2
atoms in the body, denoted by:
“2 2 2 2 2”
5, 8 reformulators, each with 2
atoms in the body, denoted by:
“2 2 2 2 2” and “2 2 2 2 2 2 2 2”
respectively
Use reformulators with the
following indices for evaluation: {
[0 1 2], [2 3 4], [0 3 4], [1 2 4] }</p>
        </sec>
      </sec>
      <sec id="sec-12-5">
        <title>D.5. ACTPs 5 Reformulators</title>
        <sec id="sec-12-5-1">
          <title>Name Values</title>
        </sec>
      </sec>
      <sec id="sec-12-6">
        <title>D.6. ACTPs 8 Reformulators</title>
        <p>hops
seed
hops
seed
1-30
4</p>
      </sec>
    </sec>
    <sec id="sec-13">
      <title>E. Algorithm Pseudocode</title>
      <p>Here follows the pseudocode for some of the algorithms used in this paper.
1: or(, ) = {
2: and(_, FAIL) = FAIL
9:
Algorithm 1: Backward chaining</p>
      <p>In the code,  is the knowledge base containing the rules and facts, sets are denoted with
curly brackets, lists are denoted with square brackets, an underscore matches any
argument,  refers to a goal,  ̂ to a set of sub-goals,  to a substitution set,  to the body
of a rule, and  to the head of a rule. To check if a goal  1 holds true, one needs to get
the output of or( 1, []). If the output contains a substitution set, then the query is true,
otherwise it will only contain the value FAIL and the query cannot be proven.
′ ∣  ′ ∈ and(, unify( , , ))</p>
      <p>for each  ←  ∈  }
″ ∣  ″ ∈ and(,  ̂ ″) ∀ ′ ∈ or(substitute(, ), )}
10: substitute([], _) = []
11:
= unify( ,</p>
      <p>unify(ℎ ∶  ,  ∶ , )
⎨
⎪
⎩
⎧  ∪ {ℎ/}
⎪  ∪ {/ℎ}</p>
      <p>if ℎ ∈ 
if  ∈  , ℎ ∉</p>
      <p>if  = ℎ
otherwise
⎫
⎪
⎬
⎪
⎭</p>
      <p>)
substitute( ∶ , ) =
{

otherwise
}
Algorithm 2: Neural backward chaining</p>
      <p>
        The code is based on the summary by Minervini et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].  is a goal,  is the reasoning
depth,  is the head of a rule,  is the body,  is a knowledge base containing rules and
facts,  is the RBF kernel,  is a proof state,   is a substitution set,   is a proof score,
and  is a set of variables.
      </p>
      <p>/* Try use any rule in KB to prove */
def or(, ,  )
for  ←  ∈ 
do</p>
      <p>end
end
end
def and(, ,  )
if  = [] or  = 0
then
yield 
for  ∈ and(, , unify( , , ))
yield</p>
      <p>do
/* Empty body or reasoning depth reached */
else
for  ′ ∈ or(sub( 0,   ),  − 1, ) /* Apply substitution to body, then try
to prove each atom within */
do
for  ″ ∈ and( 1∶, ,  ′) do</p>
      <p>yield  ″
end
end
end
end
def unify( , ,  = (</p>
      <p>,   ))
⎧⎪ {  /  } if   ∈  ⎫⎪
  = ⎩⎪⎨ {  ∅/  } if  o th∈e ,rwi se∉  ⎪⎭⎬
 ′ =   ⋃  
return  ′ = ( ′,  ′)
 ′ = min{  }⋃  ,  ∉ { (   ,    )}}
similarity across all representations */</p>
      <p>/* Extend the substitution set */
/* Similarity value is the minimum</p>
      <sec id="sec-13-1">
        <title>Algorithm 3: REINFORCE</title>
        <p>In this algorithm:  represents the model parameters,  is the number of episodes,  is
the number of episodes per batch,  is the number of steps in an episode,  is the discount
factor used to make further away rewards worth less,   ()  gives the probability of
action  from the distribution produced by   applied to  , and  is the learning rate.
for  episodes in batch do
for  ∈ 1, ...,  do
end
 ∶=  + 1
end
Calculate policy loss for entire batch: () ∶= −</p>
        <p>Update policy:  ∶=  + ∇()
end
1 ∑=1 ln(    ()   )

Generate episode  0,  0,  0, ...,   ,   ,   using the policy   to output probability
distributions which are then sampled from to get actions</p>
        <p>Calculate discounted rewards from each state:   ∶= ∑= 

def get_best_model(model, train_set, hyperparameter_grid, evaluation_set)
best_model = None
best_accuracy = -1
for hyperparameter_set ∈ hyperparameter_grid do
train_model(model, train_set)
accuracy_value = accuracy(model, evaluation_set)
if accuracy_value &gt; best_accuracy then
best_accuracy = accuracy_value
best_model = model
end
end
return best_model
end
end
def get_best_seeded_model(model, train_set, hyperparameter_grid, evaluation_set)
pick 6 seeds  = { 0, ...,  5} at random
include seeds  in hyperparameter_grid
return get_best_model(model, train_set, hyperparameter_grid, evaluation_set)
def get_model_accuracy(model, train_set, evaluation_set, test_sets)
initialize hyperparameter_grid
total_accuracy = 0
for 5 iterations do
best_model = get_best_seeded_model(model, train_set, hyperparameter_grid,
evaluation_set)
test_accuracies = accuracy(best_model, test_sets)
total_accuracy += test_accuracies
end
return total_accuracy / 5
end
get_model_accuracy(model, train_set, 1.3_test, test_sets)
get_model_accuracy(model, train_set, 1.9_test, test_sets)
Algorithm 5: ACTP selection module
 is a batch of states,  is a batch of actions, and  a batch of rewards.  denotes the
policy estimator and  is the number of reformulators the module ought to select.
sample_with_replacement(, , ) draws  samples from  with replacement, using the
probability distribution  .
def get_actions( )
 ∶= e()
selected_reformulators = []
reformulator_counts = []
end
selected_reformulators.append(indices)
update reformulator_counts using indices</p>
        <p>/* A batch of probability distributions */
end
end</p>
        <p>return selected_reformulators, reformulator_counts
def apply_reward(, ,  )
 ∶= get_loss(, ,  )
optimizer.apply_loss( , retain_graph = True)</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A. d.</given-names>
            <surname>Garcez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. R.</given-names>
            <surname>Besold</surname>
          </string-name>
          , L. De Raedt,
          <string-name>
            <given-names>P.</given-names>
            <surname>Földiak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hitzler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Icard</surname>
          </string-name>
          , K.-U. Kühnberger,
          <string-name>
            <given-names>L. C.</given-names>
            <surname>Lamb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Miikkulainen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. L.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <article-title>Neural-symbolic learning and reasoning: contributions and challenges</article-title>
          , in: 2015 AAAI Spring Symposium Series,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. W.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <article-title>Diferentiable learning of logical rules for knowledge base reasoning</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          )
          <fpage>2316</fpage>
          -
          <lpage>2325</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Grefenstette</surname>
          </string-name>
          ,
          <article-title>Learning explanatory rules from noisy data</article-title>
          ,
          <source>Journal of Artificial Intelligence Research</source>
          <volume>61</volume>
          (
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>64</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sadeghian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Armandpour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          , Drum:
          <article-title>End-to-end diferentiable rule mining on knowledge graphs</article-title>
          ,
          <source>Proc. of NIPS</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Minervini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bošnjak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          , E. Grefenstette,
          <article-title>Diferentiable reasoning on large knowledge bases and natural language</article-title>
          ,
          <source>in: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          , volume
          <volume>34</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>5182</fpage>
          -
          <lpage>5190</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <article-title>End-to-end diferentiable proving</article-title>
          ,
          <source>in: NIPS</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>3788</fpage>
          --
          <lpage>3800</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Gallaire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Minker</surname>
          </string-name>
          ,
          <article-title>Logic and data bases, symposium on logic and data bases, centre d'études</article-title>
          et de recherches de toulouse,
          <year>1977</year>
          ,
          <article-title>Advances in Data Base Theory (</article-title>
          <year>1978</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Minervini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Stenetorp</surname>
          </string-name>
          , E. Grefenstette,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          ,
          <article-title>Learning reasoning strategies in end-to-end diferentiable proving</article-title>
          , in: ICML, volume
          <volume>119</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>6938</fpage>
          -
          <lpage>6949</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Johnson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hariharan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Van Der</given-names>
            <surname>Maaten</surname>
          </string-name>
          , L.
          <string-name>
            <surname>Fei-Fei</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Lawrence Zitnick</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Girshick</surname>
          </string-name>
          ,
          <article-title>Clevr: A diagnostic dataset for compositional language and elementary visual reasoning</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>2901</fpage>
          -
          <lpage>2910</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Bahdanau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Murty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Noukhovitch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. H.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          , H. de Vries, A. Courville,
          <article-title>Systematic generalization: what is required and can it be learned?</article-title>
          ,
          <source>ICLR</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B.</given-names>
            <surname>Lake</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Baroni</surname>
          </string-name>
          ,
          <article-title>Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>2873</fpage>
          -
          <lpage>2882</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Sinha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sodhani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pineau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. L.</given-names>
            <surname>Hamilton</surname>
          </string-name>
          ,
          <article-title>Clutrr: A diagnostic benchmark for inductive reasoning from text</article-title>
          ,
          <source>Empirical Methods of Natural Language Processing (EMNLP)</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Stickel</surname>
          </string-name>
          ,
          <article-title>A prolog technology theorem prover: Implementation by an extended prolog compiler</article-title>
          ,
          <source>Journal of Automated reasoning 4</source>
          (
          <year>1988</year>
          )
          <fpage>353</fpage>
          -
          <lpage>380</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D.</given-names>
            <surname>Merritt</surname>
          </string-name>
          ,
          <source>Building expert systems in Prolog</source>
          , Springer Science &amp; Business
          <string-name>
            <surname>Media</surname>
          </string-name>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lally</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fodor</surname>
          </string-name>
          ,
          <article-title>Natural language processing with prolog in the ibm watson system, The Association for Logic Programming (ALP) Newsletter 9 (</article-title>
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Andreas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rohrbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Darrell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Klein</surname>
          </string-name>
          ,
          <article-title>Learning to compose neural networks for question answering</article-title>
          ,
          <source>NAACL</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>N.</given-names>
            <surname>Chomsky</surname>
          </string-name>
          ,
          <article-title>Logical structures in language, American Documentation (pre-</article-title>
          <year>1986</year>
          )
          <volume>8</volume>
          (
          <year>1957</year>
          )
          <fpage>284</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>N.</given-names>
            <surname>Gontier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sinha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reddy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <article-title>Measuring systematic generalization in neural proof generation with transformers</article-title>
          ,
          <source>NeurIPS'20</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>S.</given-names>
            <surname>Natarajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Khot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kersting</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gutmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shavlik</surname>
          </string-name>
          ,
          <article-title>Gradient-based boosting for</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>