<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Seventh Workshop on Practical Aspects of Automated Reasoning, June</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Directed Graph Networks for Logical Reasoning (Extended Abstract)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michael Rawson</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giles Reger</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Manchester</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>2</volume>
      <fpage>9</fpage>
      <lpage>30</lpage>
      <abstract>
        <p>We introduce a neural model for approximate logical reasoning based upon learned bi-directional graph convolutions on directed syntax graphs. The model avoids inflexible inductive bias found in some previous work on this domain, while still producing competitive results on a benchmark propositional entailment dataset. We further demonstrate the generality of our work in a first-order context with a premise selection task. Such models have applications for learned functions of logical data, such as in guiding theorem provers.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>and  is the binary output variable. The task is to predict logical entailment: whether or not
 |=  holds in classical propositional logic. A and B use only propositional variables and the
connectives {¬, ∧, ∨, ⇒} with the usual semantics. The dataset provides training, validation
and test sets, with the test set split into several categories: “easy”, “hard”, “big”, “massive” and
“exam”. The “massive” set is of particular interest to us as it contains larger entailment problems,
closer to the very large problems encountered by today’s systems.</p>
      <p>
        Previous Work PossibleWorldNet is introduced alongside the dataset [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] as a possible solution
to the task: an unusual neural network architecture making use of algorithmic assistance in
generating repeated random “worlds” to test the truth of the entailment in that world, in a
similar way to model-based heuristic SAT solving. This approach performs exceptionally well,
but does sufer from inflexible inductive bias: it is unclear how this model would perform on
harder tasks without a finite number of possible worlds, or tasks where model-based heuristics
don’t perform as well. Tending instead toward a purely-neural approach, Chvalovský introduces
TopDownNet [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], a recursively-evaluated neural network with impressive results on this dataset.
Graphical representations have been used with some success for logical tasks: Olšák et al.
introduce a model based on message-passing networks working on hypergraphs [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], while
Paliwal et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] use undirected graph convolutions for a higher-order task. Crouse et al.
show that particular form of subgraph pooling [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] improves the state of the art on two logical
benchmarks for graph neural networks. An interesting efort related to the propositional task
is that of NeuroSAT [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], a neural network that learns to solve SAT problems presented in
conjunctive normal form. We are aware of other similar work [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] developed concurrently: the
work achieves good results by adding a fixed number of edge labels and edge convolutions to
the model, in exchange for additional complexity and artificially limiting e.g. function arity.
Contributions Our main contribution is a graph neural network model working directly
on logical syntax that performs well on benchmark datasets, while remaining subjectively
simple and flexible. Suitably-designed input representations retain all relevant information
required to reconstruct a logically-equivalent input. To achieve this we utilise a bi-directional
convolution operator working over directed graphs and experiment with diferent architectures
to accommodate this approach. Strong performance is shown on the propositional entailment
dataset discussed above. Progressing to first-order logic, we also demonstrate a lossless
firstorder encoding method and investigate the performance of an identical network architecture.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Input Encoding</title>
      <p>Directed acyclic graphs (DAGs) are a natural, lossless representation for most types of logical
formulae the authors are aware of; including modal, first-order and higher-order logics, as well
as other structural data such as type systems or parsed natural language. A formula-graph is
formed by taking a syntax tree and merging common sub-trees, followed by mapping distinct
named nodes to nameless nodes that remain distinct: an example is shown in Figure 1. Note
that because distinct symbols and variables are represented by diferent nodes in the encoding,
¬
P
∧
Q
∨
¬
¬
P
¬
¬
P
∨
∧
Q
¬
¬
*
∨
∧
*
(a) syntax tree
(b) DAG - named
(c) DAG - nameless
symbols and variables can be distinguished by their position in the graph, even though concrete
names such as  or  are discarded.</p>
      <p>
        Such graphs have previously been used for problems such as premise selection [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] or search
guidance of automatic theorem provers [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. It should be noted that the acyclic property of
these graphs does not seem to be particularly important — it just so happens that convenient
representations happen to be acyclic. This representation has several desirable properties:
Compact size. Suficiently de-duplicated syntax DAGs have little to no redundancy, and in
pathological cases syntax trees are made exponentially smaller.
      </p>
      <p>Shared processing of redundant terms. Common sub-trees are mapped to the same DAG
node, so models that work on the DAG can identify common sub-terms trivially.
Bounded number of node labels. By use of nameless nodes, a finite number of diferent
node labels are found in any DAG. This allows for simple node representations and does
not require a separate textual embedding network, although this can be employed.</p>
      <sec id="sec-2-1">
        <title>Natural representation of bound variables. Representing bound variables such as those</title>
        <p>
          found in first-order logic can be dificult [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] — this representation side-steps most, if not
all, of these issues and naturally encodes  -equivalence.
        </p>
        <p>
          One drawback of such DAGs as a representation for logical formulae is that they lack ordering
among node children: with a naïve encoding, the representation for  ⇒  is the same as
 ⇒ , but the two are clearly not equivalent in general. The same problem also arises with
ifrst-order terms:  (, ) is indistinguishable from  (, ) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. However, this problem can be
removed by use of auxiliary nodes and edges such that an ordering can be retrieved, as shown
in Section 5. For the propositional dataset, the classical equivalence  ⇒  ≡ ¬  ∨  is used
to rewrite formulae, avoiding ordering issues. We also recast the entailment problem  |=  as
a satisfiability problem: is  ∧ ¬ unsatisfiable? These methods reduce the total number of
node labels used (4 in total — one for propositional variables, and one for each of {¬, ∧, ∨}),
and allow the network to re-use learned embeddings and filters for the existing operators.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Model</title>
      <p>We introduce and motivate a novel neural architecture for learning based on DAG
representations of logical formulae. Unusual neural structures were found to be useful, and are described
ifrst, before these blocks are then combined into the model architecture.</p>
      <sec id="sec-3-1">
        <title>3.1. Bi-directional Graph Convolutions</title>
        <p>
          We assume the input DAG is a graph (X, A) where X is the node feature matrix and A is
the directed graph adjacency matrix. Various graph convolution operators [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] — denoted
conv(X, A) here as an arbitrary operator — have enjoyed recent success. These generalise the
trainable convolution operators found in image-processing networks to work on graphs, by
allowing each layer of the network to produce an output node per input node based on the input
node’s existing data and that of neighbouring nodes connected with incoming edges. This can be
seen as passing messages around the graph: with  convolution layers, a conceptual “message”
may propagate  hops across the graph. Here, we use the standard convolutional layer found
in Graph Convolutional Networks [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. This operator sufers from a shortcoming (illustrated
in Figure 2) on DAGs such as those used here: information will only pass in one direction
through the DAG, as messages propagate only along incoming edges. Unidirectional messages
are not necessarily a problem: bottom-up schemes such as TreeRNNs [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] exist, Chvalovský
uses a top-down approach [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], and cyclic edges are another possible solution. However, to
play to the strengths of the graphical approach the ideal would have messages passed in both
directions, with messages from incoming and outgoing edges dealt with separately. It is possible
to simply make the input graph undirected, but this approach discards much of the crucial
encoded structure and was not found to perform much better than chance on the propositional
task. Instead, a bi-directional convolution is one possible solution:
        </p>
        <p>biconv(X, A) = conv(X, A)‖conv(X, AT)
where the ‖ operator denotes feature concatenation. By convolving in both edge directions
(with disjoint weights) and concatenating the node-level features produced, information may
lfow through the graph in either direction while retaining edge direction information. Disjoint
weights at each convolutional layer for “forward” and “backward” messages allow the network
to learn diferent behaviours for each direction. A concern with the use of bi-directional
convolution in deep networks is that each convolutional layer must decrease output feature size
by a factor of at least 2 (e.g. by taking the sum of both directions) in order to avoid exponential
blowup in the size of feature vectors as the graph propagates through the network. Due to the
use of a DenseNet-style block with feature reduction built-in, this was not an issue here.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. DenseNet-style blocks</title>
        <p>
          Recent trends in deep learning for image processing suggest that including shorter “skip”
connections between earlier stages and later stages in a deep convolutional network can be
beneficial [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. DenseNets [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] take this to a logical extreme, introducing direct connections
from any layer in a block to all subsequent layers. We found a graphical analogue of this style of
architecture very useful. Suppose that X− 1 is the input of some convolutional layer . Then,
by analogy with DenseNets,  should also be given the outputs of previous layers as input:
        </p>
        <p>X =  (X0‖X1‖ . . . ‖X− 1, A)
However, in later layers this node-level input vector becomes very large for a
computationallyexpensive convolutional layer such as . DenseNets also include measures designed to reduce
the size of inputs to convolutional layers, such as 1 × 1 convolutions. We include an analogous
“compression” fully-connected layer ℎ, which reduces the input size before convolution by
allowing the network to project relevant node features from previous layers:</p>
        <p>X =  (ℎ (X0‖X1‖ . . . ‖X− 1) , A)</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Graph Isomorphism Networks and Pooling</title>
        <p>
          It has been shown that the standard graph convolution layer is incapable of distinguishing
some types of graph [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. Since logical reasoning is almost entirely about graph structure and is
known to be computationally hard, it was expected that the more powerful Graph Isomorphism
Networks [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] would produce better results, but the isomorphism operator did not outperform
the baseline operator in experiments. Similarly, localised pooling is well-known to be useful
in image processing tasks, and its graphical analogues such as top- pooling [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] and edge
contraction pooling [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] also perform well on some benchmark tasks. These also appear useful
here, perhaps corresponding to the human approach of simplifying sub-formulae. However,
these also did not improve performance, possibly due to the lack of redundancy in formula
graphs. Further investigations into integrating these powerful methods is left as future work.
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Architecture</title>
        <p>
          A simplistic neural architecture is described. Batch normalisation (BN) [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] is inserted before
convolutional and fully-connected layers, and rectified linear units (ReLU) [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] are used as
nonlinearities throughout, except for the embedding layer (no activation) and the output layer.
Embedding. An embedding layer maps one-hot input node features into node features of the
size used in convolutional layers.
        </p>
        <p>Dense Block. DenseNet-style convolutional layers follow, including the fully-connected layer
(FC) so that each layer consists of ReLU-BN-FC-ReLU-BN-BiConv. Only one block is used,
with each layer using all previous layers’ outputs.</p>
        <p>Global Average Pooling. At this point the graph is collapsed via whole-graph average pooling
into a single vector. Passing forward outputs from all layers in the dense block to be
pooled was found to stabilise and accelerate training significantly.</p>
        <p>Output Layer. A fully-connected layer produces the final classification output.
A relatively large number of convolutional layers — 48 — are included in the dense block, for
both theoretical and practical reasons. Theoretically, if information from one part of the graph
must be passed to another some distance away in order to determine entailment or otherwise,
then a greater number of layers can prevent the network running out of “hops” to transmit this
information. Practically, more layers were found to perform better, particularly on the larger
test categories, confirming the theoretical intuition. In principle there is no limit to the number
of layers that might be gainfully included.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup and Results</title>
      <p>
        Source code for an implementation using the PyTorch Geometric [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] extension library for
PyTorch [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] is available1.
      </p>
      <p>
        Training Training setup generally follows that suggested for DenseNets [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]: the network
is trained using stochastic gradient descent with Nesterov momentum [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] and weight decay,
with the suggested parameters. Parameter initialisation uses PyTorch’s defaults: “Xavier”
initialisation [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] for convolutional weights and “He” initialisation [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] for fully-connected
weights. A cyclic learning rate [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] was found to be useful for this model — we applied a learning
rate schedule (“exp_range” in PyTorch) in which the learning rate cycles between minimum
and maximum learning rates over a certain number of minibatches, while these extremes
themselves decay over time. Training continued until validation loss ceased to obviously
improve. See Table 2 for training parameter details.
      </p>
      <p>
        Augmentation No data augmentation is used as the dataset is relatively large already, and
further it is unclear what augmentation would be applied: the “symbolic vocabulary permutation”
approach [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is not applicable here due to the nameless representation, but randomly altering the
structure of the graph does not seem useful as it could well change the value of  unintentionally.
One could imagine a semantic augmentation in which  is made stronger or  weaker — this
would produce data augmentation without invalidating the entailment value.
Reproducibility Results are reproducible, but with caveats. Training runs performed on a
CPU are fully deterministic, but tediously slow. Conversely, training runs performed on a GPU
are not fully deterministic2, but are significantly accelerated. The results reported here are
obtained with a GPU, but produce comparable results on repeated runs in practice. This is a
significant limitation of this work that we hope to address if and when a suitable deterministic
implementation becomes available.
      </p>
      <p>Results Experimental results are shown in Table 3. Results reported from PossibleWorldNet
and TopDownNet ( = 1024) are also included verbatim, without reproduction, for comparison.
Test scores of the best-performing model on each data split are highlighted. Results show that
1https://github.com/MichaelRawson/gnn-entailment
2An unfortunate consequence of GPU-accelerated “scatter” operations. See https://pytorch.org/docs/stable/
notes/randomness.html
our model is competitive on the test categories, both with algorithmically-assisted approaches
(PossibleWorldNet), and with the a pure neural approach (TopDownNet). The model significantly
outperforms on the “massive” test category.</p>
      <p>Discussion We conjecture that our model generalises to some degree the approach taken
with TopDownNet. In our model arbitrary message-passing schemes within the entire DAG
are permitted, rather than TopDownNet’s strict top-down/recurrent approach, which may
go some way to explaining the diference in performance. However, the relationship with
PossibleWorldNet is less clear-cut, and this is reflected in results: PossibleWorldNet remains
unbeaten on the “hard” and “big” categories, but is surpassed on all others.</p>
    </sec>
    <sec id="sec-5">
      <title>5. First-Order Logic</title>
      <p>
        We demonstrate the flexibility and generality of our approach by also applying the same model
without adaptation or tuning to a diferent dataset expressed in first-order logic.
Dataset We employ the Mizar/DeepMath premise-selection dataset [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] used in the evaluation
of the subgraph-pooling models of Crouse et al. and the hypergraph model of Olšák et al. The
task is to predict whether or not a given premise is required for a given conjecture, both
expressed in full first-order logic. The dataset asserts a baseline score of 71.75%, while the best
subgraph-pooling model achieves 79.9%. Olšák et al. report 80% but do not take a standard
evaluation approach, and further employ clausification. It is unclear to what extent clausification
helps or hinders machine learning approaches on this dataset.
      </p>
      <p>Representation A similar input representation to that in the propositional case is used here.
However, argument order in function and predicate application must be preserved in order to
maintain a lossless representation. This is achieved by use of an auxiliary “argument node”
for each argument in an application, connected by edges indicating the order of arguments,
shown in Figure 3. Quantifier nodes have two children: the variable which they bind, and the
sub-formula in which the variable is bound. More space-eficient or otherwise performant graph
representations are a possibility left as future work. 17 node types are used in total.
Training and Results We used an identical configuration as with the propositional case: it
is possible that with some tuning better performance can be produced. We do however note
that using fewer layers (down to around 24 — half of the original number), did not seem to
hurt performance for this benchmark and significantly reduced computation requirements and
GPU memory usage. Data was split as suggested3 at the conjecture level into 29,144 training
conjectures and 3252 testing conjectures (we reserve a validation set of 128 conjectures). The
model achieves a classification accuracy of 79.8% on the unseen test set. On the test set the
model can classify premises for tens of conjectures per second4, so the network is unlikely to
be a bottleneck in classic premise-selection settings.</p>
      <p>3https://github.com/JUrban/deepmath
4of course, this can vary dramatically with problem size, hardware and engineering approach
f
*
*
(a)  (, ,  (,  ))
(b) ∀.∃. () ∨ ()
Discussion The network achieves performance significantly above the baseline, comparable
with Crouse et al. and Olšák et al. without modification from the propositional task. We
consider this a good result, suggesting that the network architecture is able to perform without
adaptation on more complex tasks expressed in diferent logics.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and Future Work</title>
      <p>We explore directed-graph representations and a new architecture for logical approximation
tasks and show that they have a number of advantages — notably simplicity — and good
performance characteristics. The approach can work over many diferent logics in principle, and
practical experiment suggests this is true in practice. Performance on other logics of interest,
such as higher-order or description logics, is left as future work. The network does not utilise
any algorithmic assistance as PossibleWorldNet does, yet achieves competitive performance
— this allows the network to process similar tasks which do not have a useful concept of
“possible worlds”. Combining our work with the best of other approaches, such as using the
densely-connected network architecture with hypergraph methods, is a promising direction.</p>
      <p>
        In some applications, such as guiding automatic theorem provers, network prediction
throughput is crucial. High-performance automatic theorem prover internals typically use a graphical
representation [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ], so graphs are a natural choice for these structures. Additionally, graph
neural networks parallelise [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] somewhat more naturally than previous approaches such as
TreeNets, suggesting that this style of network may be more applicable to these domains.
      </p>
      <p>Much future work is possible. No systematic efort has been made to tune network
hyperparameters or overall architecture yet. In particular, we suspect that multiple dense blocks
might use fewer parameters or perform better than one large block. A hybrid skip-connection
approach, such as connecting smaller dense blocks with residual connections, is of particular
interest to us as it may reduce computational cost significantly. Other convolution methods
and the conspicuous absence of local pooling may also be investigated. We aim to apply some
variation of this work to guidance scenarios for first-order provers in the medium-term.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T. N.</given-names>
            <surname>Kipf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Welling</surname>
          </string-name>
          ,
          <article-title>Semi-supervised classification with graph convolutional networks</article-title>
          ,
          <source>International Conference on Learning Representations</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Saxton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kohli</surname>
          </string-name>
          , E. Grefenstette,
          <article-title>Can neural networks understand logical entailment?</article-title>
          ,
          <source>International Conference on Learning Representations</source>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Chvalovský</surname>
          </string-name>
          ,
          <article-title>Top-down neural model for formulae</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Olšák</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kaliszyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Urban</surname>
          </string-name>
          ,
          <article-title>Property invariant embedding for automated reasoning</article-title>
          , arXiv preprint arXiv:
          <year>1911</year>
          .
          <volume>12073</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Paliwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Loos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rabe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bansal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          ,
          <article-title>Graph representations for higher-order logic and theorem proving</article-title>
          , arXiv preprint arXiv:
          <year>1905</year>
          .
          <volume>10006</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Crouse</surname>
          </string-name>
          , I. Abdelaziz,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cornelio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Thost</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Forbus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fokoue</surname>
          </string-name>
          ,
          <article-title>Improving graph neural network representations of logical formulae with subgraph pooling</article-title>
          , arXiv preprint arXiv:
          <year>1911</year>
          .
          <volume>06904</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Selsam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lamm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bünz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Liang</surname>
          </string-name>
          , L. de Moura,
          <string-name>
            <given-names>D. L.</given-names>
            <surname>Dill</surname>
          </string-name>
          ,
          <article-title>Learning a SAT solver from single-bit supervision</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>X.</given-names>
            <surname>Glorot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Aygün</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mourad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kohli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Precup</surname>
          </string-name>
          ,
          <article-title>Learning representations of logical formulae using graph neural networks</article-title>
          ,
          <source>in: Neural Information Processing Systems, Workshop on Graph Representation Learning</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <article-title>Premise selection for theorem proving by deep graph embedding</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>2786</fpage>
          -
          <lpage>2796</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Rawson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Reger</surname>
          </string-name>
          ,
          <article-title>A neurally-guided, parallel theorem prover</article-title>
          ,
          <source>in: International Symposium on Frontiers of Combining Systems</source>
          , Springer,
          <year>2019</year>
          , pp.
          <fpage>40</fpage>
          -
          <lpage>56</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>A. M. Pitts</surname>
          </string-name>
          ,
          <article-title>Nominal logic: A first order theory of names and binding</article-title>
          ,
          <source>in: International Symposium on Theoretical Aspects of Computer Software</source>
          , Springer,
          <year>2001</year>
          , pp.
          <fpage>219</fpage>
          -
          <lpage>242</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leskovec</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jegelka</surname>
          </string-name>
          ,
          <article-title>How powerful are graph neural networks?</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>K. S.</given-names>
            <surname>Tai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <article-title>Improved semantic representations from tree-structured long short-term memory networks, Association for Computational Linguists (</article-title>
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>G.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Van Der</given-names>
            <surname>Maaten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <article-title>Densely connected convolutional networks</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>4700</fpage>
          -
          <lpage>4708</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>H.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <surname>Graph</surname>
            <given-names>U</given-names>
          </string-name>
          -Nets,
          <source>International Conference on Machine Learning</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>F.</given-names>
            <surname>Diehl</surname>
          </string-name>
          ,
          <article-title>Edge contraction pooling for graph neural networks</article-title>
          , arXiv preprint arXiv:
          <year>1905</year>
          .
          <volume>10990</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Iofe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          ,
          <article-title>Batch normalization: Accelerating deep network training by reducing internal covariate shift</article-title>
          ,
          <source>International Conference on Machine Learning</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>V.</given-names>
            <surname>Nair</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <article-title>Rectified linear units improve restricted boltzmann machines</article-title>
          ,
          <source>in: Proceedings of the 27th International Conference on Machine Learning (ICML-10)</source>
          ,
          <year>2010</year>
          , pp.
          <fpage>807</fpage>
          -
          <lpage>814</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>M.</given-names>
            <surname>Fey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Lenssen</surname>
          </string-name>
          ,
          <article-title>Fast graph representation learning with PyTorch Geometric</article-title>
          , arXiv preprint arXiv:
          <year>1903</year>
          .
          <volume>02428</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A.</given-names>
            <surname>Paszke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Massa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lerer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bradbury</surname>
          </string-name>
          , G. Chanan,
          <string-name>
            <given-names>T.</given-names>
            <surname>Killeen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Gimelshein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Antiga</surname>
          </string-name>
          , et al.,
          <article-title>PyTorch: An imperative style, high-performance deep learning library</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>8024</fpage>
          -
          <lpage>8035</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>I.</given-names>
            <surname>Sutskever</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Martens</surname>
          </string-name>
          , G. Dahl, G. Hinton,
          <article-title>On the importance of initialization and momentum in deep learning</article-title>
          ,
          <source>in: International conference on machine learning</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>1139</fpage>
          -
          <lpage>1147</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>X.</given-names>
            <surname>Glorot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <article-title>Understanding the dificulty of training deep feedforward neural networks</article-title>
          ,
          <source>in: Proceedings of the thirteenth international conference on artificial intelligence and statistics</source>
          ,
          <year>2010</year>
          , pp.
          <fpage>249</fpage>
          -
          <lpage>256</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Delving deep into rectifiers: Surpassing human-level performance on imagenet classification</article-title>
          ,
          <source>in: Proceedings of the IEEE international conference on computer vision</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1026</fpage>
          -
          <lpage>1034</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>L. N.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <article-title>Cyclical learning rates for training neural networks</article-title>
          ,
          <source>in: 2017 IEEE Winter Conference on Applications of Computer Vision</source>
          (WACV), IEEE,
          <year>2017</year>
          , pp.
          <fpage>464</fpage>
          -
          <lpage>472</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>G.</given-names>
            <surname>Irving</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Alemi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Eén</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Chollet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Urban</surname>
          </string-name>
          ,
          <article-title>Deepmath-deep sequence models for premise selection</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>2235</fpage>
          -
          <lpage>2243</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>S.</given-names>
            <surname>Schulz</surname>
          </string-name>
          ,
          <source>System description: E 1</source>
          .8, in: International
          <source>Conference on Logic for Programming Artificial Intelligence and Reasoning</source>
          , Springer,
          <year>2013</year>
          , pp.
          <fpage>735</fpage>
          -
          <lpage>743</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>