<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluation of Axiom Selection Techniques</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Qinghua Liu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zishi Wu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zihao Wang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Geof Sutclife</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Southwest Jiaotong University</institution>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Miami</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <fpage>63</fpage>
      <lpage>75</lpage>
      <abstract>
        <p>“Large theory” problems in Automated Theorem Proving (ATP) have been defined as having many functions and predicates, and many axioms of which only a few are required for the proof of a theorem. One key to solving large theory problems is selecting a subset of the axioms that is adequate for finding a proof. The main contribution of this paper is metrics for evaluating axiom selection techniques without having to run an ATP system on the problem formed from the selected axioms and the problem's conjecture. This paper additionally presents three new axiom selection techniques. The new techniques, and the axiom selection in the Vampire ATP system, are evaluated using the metrics.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Automated theorem proving</kwd>
        <kwd>axiom selection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1. Introduction
In the third step of this process, a proof indicates an adequate selection (the quality of which can
then be evaluated), and a countermodel indicates an inadequate selection. As the development
of axiom selection techniques progresses, this approach requires repeating steps 2 and 3 to
evaluate progress. Evaluation needs to be done on large corpora, making this a time consuming
process. As a consequence, to move things along faster, it is often necessary to impose a small
time limit on each ATP system run. Time limits (and other practical forms of incompleteness)
cause timeouts, and a timeout in the third step provides no information - the selection might be
inadequate, or the selection might be adequate but the reduced problem is too hard because too
many (unnecessary) axioms were selected, or the time limit is too small. The results are also
influenced by the choice of ATP system. Overall, this process for evaluating axiom selection
techniques is problematic.</p>
      <p>
        The main contribution of this paper is metrics for evaluating axiom selection techniques
without having to run an ATP system on the reduced problems. While the “proof is in the
pudding” and it is eventually necessary to evaluate by running an ATP system, the metrics
described in this paper provide a first-pass evaluation that allows axiom selection techniques
to be rapidly tested and refined. The approach has the advantage of being independent of a
chosen ATP system. This paper additionally presents three new axiom selection techniques.
The new techniques and the axiom selection [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] in the Vampire [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] ATP system are evaluated
using the proposed metrics.
      </p>
      <p>This paper is structured as follows: Section 2 describes the evaluation metrics. Section 3
describes a relatedness measure between formulae, and the three new axiom selection techniques
that are based on that measure. Section 4 provides evaluation results, including an initial
evaluation of the metrics themselves. Section 5 concludes.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Axiom Selection and the Evaluation Metrics</title>
      <p>
        The axiom selection evaluation metrics measure how precisely a set of selected axioms matches
a known minimally adequate set of axioms. Two types of axiom selection techniques are
considered:
• Ranking techniques, which take the problem’s axioms as input, rank them according to
how likely they are judged to be necessary for a proof, and select the axioms that are ranked
above some cut-of criterion. This technique is used in, e.g., Isabelle’s Sledgehammer [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
• Projection techniques, which take the problem’s axioms as input and directly return a
selection of axioms. (Ranking is thus a special case of projection.) This technique is used
in, e.g., Vampire.
      </p>
      <sec id="sec-2-1">
        <title>2.1. The MPTP2078 Problem Corpus</title>
        <p>
          In this work the MPTP2078 corpus1, based on the Mizar Mathematical Library [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], was used for
development. The MPTP2078 has two versions of each of its 2078 problems: the bushy (small)
versions that contain only the Mizar axioms that a knowledgeable user selected for finding a
proof of the conjecture, and the chainy versions that contain all the axioms that precede the
conjecture in the Mizar library order. The bushy problems have between 10 and 67 axioms,
while the chainy versions have between 10 and 4563 axioms.
        </p>
        <p>
          In order to extract minimally adequate sets of axioms for each problem, Vampire and E [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]
were run on the problems, using the StarExec [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] Miami2 cluster with a 300s CPU time limit.
The computers have an octa-core Intel(R) Xeon(R) E5-2667 3.20GHz CPU, 128GB memory, and
run the CentOS Linux release 7.4.1708 operating system. This produced proofs for 1486 of the
bushy problems (1474 by Vampire and 1263 by E) and 1345 of the chainy problems (1333 by
Vampire and 815 by E). For each proof found, the axioms used in the proof were extracted as
an minimally adequate set of axioms, and a new problem was formed from that minimally
adequate set with the problem’s conjecture. These new problems were dubbed the pruney
problems. Additionally, in testing the new axiom selection techniques described in Section 3,
some further diferent minimally adequate sets were found and further pruney problems were
created. This resulted in a further 65 pruney problems for the bushy problems, and a further 24
pruney problems for the chainy problems. For some problems multiple minimally adequate sets
of axioms were found, resulting in a total of 1829 pruney problems for the 1551 bushy problems,
and a total of 3093 pruney problems for the 1369 chainy problems.
        </p>
        <p>The pruney problems provide minimally adequate sets of axioms against which selected
sets of axioms can be compared. The smallest fraction of axioms in a minimally adequate set
ranges from 0.01 to 1.0 for the bushy problems, and from 0.0002 to 0.36 for the chainy problems.
The average fractions are 0.29 and 0.01, respectively. These fractions show that precise axiom
selection could significantly reduce the number of axioms that need to be used, which in turn
normally significantly reduces the search space of an ATP system. As might be expected, the
numbers are more extreme for the larger chainy problems.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. The Metrics</title>
        <p>Define the following basic measures:
• The N umber of Axioms in a Problem: NAxP.
• The N umber of axioms Selected: NSel.
• The N umber of axioms U sed in a Proof, i.e., the cardinality of a minimally adequate set:</p>
        <p>NUiP. NUiP is 0 if no axioms need to be used, i.e., the conjecture is a theorem.
• In a ranked list of axioms, the number of axioms down to the lowest ranked axiom in a
minimally adequate set, i.e., the N umber of axioms N eeded from the Ranking: NNRa.</p>
        <p>The axiom selection evaluation metrics are listed below. In all cases the range is [0.0, 1.0].
Precision. If the axiom selection technique selects an adequate set of axioms, i.e., a superset
of one or more of the known minimally adequate sets of axioms, then:
• If the minimum NUiP = 0, and NSel= 0, then 1.00
• Else the maximum NUiP/NSel
If the selection technique selects an inadequate set then the precision is 0.00. Larger values are
better. The intuition here is of a probability that using the selection will result in a proof (which
is 0.00 if an inadequate set is selected).
Selectivity. NSel/NAxP. This measures the fraction of axioms selected. 1.00 results from
selecting all the axioms - the base case. Smaller values are better, provided an adequate set of
axioms is selected.</p>
        <p>Ranking precision. (Applicable to only ranking techniques.) If the axiom selection technique
selects an adequate set of axioms, then:
• If the minimum NUiP = 0, and NSel= 0, then 1.00
• Else the maximum NNRa/NSel
If the technique selects an inadequate set, then 0.00. This measures how precisely the technique
chooses the “best ones” from the ranked list of axioms. Larger values are better.
Ranking density. (Applicable to only ranking techniques.) If NUiP = NNRa = 0, then 1.00,
else NUiP/NNRa. This measures the quality of the ranking - axioms in a minimally adequate
set should be ranked highly, which in turn allows a smaller number to be selected if the ranking
precision is good. Larger values are better.</p>
        <p>Average precision/selectivity/ranking precision/ranking density. For a set of problems,
the average over the problems.</p>
        <p>Adequacy. For a set of problems, the fraction of problems for which the axiom selection
technique selects an adequate set of axioms. Larger values are better.</p>
        <p>Adequate precision/selectivity/ranking precision/ranking density. For a set of
problems, the average over the problems for which the axiom selection technique selects an adequate
set of axioms.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Our Axiom Selection Techniques</title>
      <p>The motivation for developing these metrics was based on a need to evaluate new axiom
selection techniques being developed by the first three authors. These techniques are described
in this section, and their performance according to the metrics is evaluated in Section 4. It turns
out that a very simple technique performs surprisingly well on the MPTP2078 corpus.</p>
      <p>All of the new techniques are based on how strongly two formulae are related (as is the case in
many axiom selection techniques). In this work a novel measure of relatedness is used, which is
described in Section 3.1. The individual new techniques are then described in Sections 3.2 to 3.4.</p>
      <sec id="sec-3-1">
        <title>3.1. Formula Dissimilarity and Similarity</title>
        <p>The relatedness between two formulae is computed first as a dissimilarity between the two
formulae, which is also later converted to a similarity.</p>
        <p>
          The dissimilarity between two terms or atoms is an extended version of the Hutchinson
distance [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. For two terms or atoms ∆ 1 and ∆ 2, their least general generalization ∆ =
(∆ 1, ∆ 2), if it exists, is a term or atom ∆ such that there are substitutions  1 and  2,
∆  1 = ∆ 1 and ∆  2 = ∆ 2, and there is no term or atom ∆ ′ and substitutions ,  1,  2 such that
 is not just a renaming substitution, ∆  = ∆ ′, ∆ ′ 1 = ∆ 1, and ∆ ′ 2 = ∆ 2. If (∆ 1, ∆ 2)
does not exist, e.g., neither ∆ 1 nor ∆ 2 are variables and their principal symbols are diferent,
then the dissimilarity (∆ 1, ∆ 2) = ∞. Otherwise:
        </p>
        <p>Divide   into two parts,   and   :
  = {,1 ↦→ ,1, . . . , , ↦→ , }
  = {,1 ↦→ ,1, . . . , , ↦→ , }
where , and , are the substituted variables, , are substituting variables, and , are
substituting functional terms. Let
•  be a weight for variables (currently set to 13),
•  be a weight function for non-variable symbols (currently set to 2 for all symbols),
•  (, ∆) be the set of occurrences of the variable  in ∆ ,
•  (, ∆) be the depth of a particular occurrence of the variable  in ∆ , e.g.,  (, (,  ()) =
1 for the first occurrence and 2 for the second occurrence,
• (, ∆) be  × (| (, ∆) | + ∑︀∈ (,Δ)  (, ∆)) , i.e., the weighted sum of the
number of occurrences of  in ∆ and the depths of the occurrences of  in ∆ , e.g.,
(, (,  ()) =  × (2 + (1 + 2)) = 5,
•  (∆) be the sum of the weights (using  ) of the non-variable symbols occurring in ∆ ,
• () be a continuous increasing function R ↦→ R such that (0) = 0 and (1 + 2) ≤
(1) + (2) for 1, 2 ≥ 0 (currently set to ( + 1)).</p>
        <p>Then,  and  are functions that map   and   to real numbers:</p>
        <p>( ) = ∑︁ ((, , ∆ ) − (, , ∆))
=1

 (  ) = ∑︁(| (, , ∆) | ×  (, ))</p>
        <p>=1
( ) measures the diference between the usage of substituting variables in ∆  and the usage
of substituted variables in ∆ , over the substitutions made in ∆ by  .  ( ) measures the total
function symbol weight of the substituting terms, over the substitutions made in ∆ by   . The
dissimilarity between two terms or atoms ∆ 1 and ∆ 2 is then:
(∆ 1, ∆ 2) =
√︁</p>
        <p>[( 1) + ( 2)]2 + [ ( 1 ) +  ( 2 )]2
The dissimilarity measures the combined substitution “efort” required to convert the least
general generalization to those terms or atoms.</p>
        <p>3The values of 1 and 2 for  and  were adopted from their use in E, place greater emphasis on non-variable
substitutions, and are small enough to avoid large dissimilarity values. The use of ( + 1) for () below was
motivated by it’s common adoption as a continuous increasing function, to usefully dampen the impact of variable
substututions on dissimilarity values.
(4)
(5)
(6)</p>
        <p>Let Φ 1 and Φ 2 be two formulae, containing the atoms {∆ 1,1, . . . , ∆ 1,} and {∆ 2,1, . . . , ∆ 2,}
respectively. Let ̸=∞ be the set of pairs (∆ 1,, ∆ 2, ) for which (∆ 1,, ∆ 2, ) ̸= ∞. If
̸=∞ = ∅, i.e., all pairs of atoms in Φ 1 and Φ 2 are infinitely dissimilar, then the dissimilarity
(Φ 1, Φ 2) = ∞. Otherwise:
(Φ 1, Φ 2) =
∑︀
(Δ1,,Δ2,)∉=∞ (∆ 1,, ∆ 2, )
|̸=∞|
×
 × 
|̸=∞|
The first term is the average dissimilarity between pairs of atoms that are not infinitely dissimilar.
The second term penalizes the first by the excess of atom pairs whose dissimilarity is infinite.
Intuitively, the dissimilarity between two formulae measures the extent to which inference
between them might not be possible by virtue of the dissimilarity of their atoms.</p>
        <p>For two formulae Φ 1 and Φ 2, their similarity (Φ 1, Φ 2) is defined in the context of a set
of formulae ℱ , Φ 1, Φ 2 ∈ ℱ :</p>
        <p>(ℱ ) = ,∈ℱ ((,  ) ̸= ∞)
(Φ 1, Φ 2, ℱ ) = max(0, (ℱ ) − (Φ 1, Φ 2))
If the dissimilarity is 0, then the similarity is the largest dissimilarity that is not ∞. If the
dissimilarity is greater than 0 and not ∞, then the similarity is the diference between the
largest dissimilarity that is not ∞ and the dissimilarity. If the dissimilarity is ∞ or is equal to
the largest dissimilarity that is not ∞, then the similarity is 0.
3.2. Q∞
This axiom selection technique takes a problem consisting of a conjecture  and axioms ,
and selects all axioms Φ ∈  such that (, Φ) ̸= ∞. This simply means that each axiom
contains at least one atom whose predicate symbol matches that of an atom in the conjecture.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.3. Spectral Clustering</title>
        <p>
          This axiom selection technique views the formulae of a problem as nodes of a complete
undirected graph, with the similarities between the formulae as the edge weights. Spectral clustering
[
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] is then used to cluster together similar nodes, and the axioms in the cluster containing the
conjecture are selected. This process requires three steps: determining the number of clusters,
choosing the initial centroids for the clusters, and applying k-means clustering.
        </p>
        <p>
          The initial centroids for the clustering are extracted from a feature matrix consisting of the
eigenvectors of the normalized graph Laplacian matrix [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ], computed from the graph’s
adjacency matrix with edges weighted by similarity. The rows of the feature matrix corresponding
to the conjecture node and to the  − 1 (see below for how  is computed) axiom nodes with
highest degree centrality are used as the initial centroids for the clustering.
        </p>
        <p>To determine the number of clusters for a problem in the MPTP2078 corpus, the three step
process was applied to each problem with the number of clusters ranging from two to half
the number of formulae in the problem. The precision was computed each time, and the best
number of clusters recorded. A median regression line was fitted to this data (separately for the
bushy and chainy problems), as shown in Figure 1. The upper two plots are for a problem set of
325 smaller problems (see Section 4 for details of the testing problem sets), and the lower plot
is for the 1551 bushy problems. While the fit of the regression lines is awful, they were used
to set the number of clusters  for problems in the corpus, based on the number of formulae
in each problem. For the 1551 bushy problems it turns out that two clusters is always optimal,
regardless of the number of axioms. It was not possible to compute a regression line for the
1369 chainy problems with the time and computing resources available.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.4. Greedy Tree plus Nearest Neighbours</title>
        <p>This axiom selection technique views the formulae of a problem as nodes of a complete
undirected graph, with the similarities between the formulae as the edge weights. Figure 2 provides
a simple illustrative example, in which C is the conjecture and As are the axioms (the figure
omits edges that do not afect the example). Starting at the conjecture, a greedy tree is built by
iteratively visiting all the axioms most similar to the current leaves of the tree, until axioms that
have no similarity (or equivalently, infinite dissimilarity) to the conjecture are reached. These
infinitely dissimilar nodes are ignored. In Figure 2 the thicker edges are those that are followed,
and the axioms with thicker circles are those at which the tree growth stops because they have
no similarity to the conjecture. At that stage the light grey axioms are in the tree. As a final
step, all the unvisited axioms most similar to the axioms in the greedy tree are added to the
tree. In Figure 2, that adds the dark grey axioms to the tree. The axioms in the tree are selected
those are the non-white axiom in the Figure 2..</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation Results</title>
      <p>The new axiom selection techniques and the axiom selection in the Vampire ATP system were
evaluated using the metrics. No particular efort was moade to optimize the implementation of
the new selection techniques; rather, the evaluation aimed to demonstrate the usefulness of
the metrics, and as such no CPU times are presented below. Two problem sets from each of
the bushy and chainy parts of the MPTP2078 corpus were used. The first set was selected by
taking the 325 chainy problems with less than 500 axioms and for which minimally adequate
axiom sets of axioms (pruney problems) are known, and taking the corresponding 325 bushy
problems. The second set was the 1551 bushy problems and 1369 chainy problems for which
minimally adequate axiom sets (pruney problems) are known. The smaller set was useful for
initial quick testing, and also necessary because some of the new techniques could not analyse
the chainy problems set with the time and computing resources available.</p>
      <p>Tables 1 and 2 show the results, including a row for the base case in which all axioms are
selected. The columns are the Precision (Prcn), Selectivity (Sely), Ranking precision (RPrn),
Ranking density (RDen), Adequacy (Adeq), and the Adequate precision/selectivity/ranking
precision/ranking density.</p>
      <p>For the smaller set of 325 bushy problems Q∞ performs the best, with the highest precision
and lowest selectivity. It also has a 74% adequacy. Moving up to the 325 chainy problems
Q∞ and Vampire have the highest precision, with Q∞ having slightly higher selectivity, and
Vampire better adequacy. For the 1551 bushy problems Q∞ is again the top performer, now in
terms of precision, selectivity, and adequacy. Spectral clustering also performs reasonably well,
with a better precision than Vampire, and with the cluster containing the conjecture (recall
from Section 3.3 that two clusters is optimal, and hence used here) containing 69% of the axioms
on average. Finally, for the 1369 chainy problems Q∞ and Vampire again have the highest
precision, with Vampire having the best selectivity and adequacy. Overall, Q∞ performs the
best on the bushy problems, with Vampire doing slightly better on the chainy problems. It is
surprising that the very simple Q∞ technique performs as well as it does, when compared to
the mature (and more complex) Vampire selection technique.</p>
      <p>The adequacy numbers have to be viewed carefully - it is easy to get an optimal adequacy
simply by selecting all axioms and thus having pessimal selectivity. It is interesting to contrast
the average and adequate precision values. The average precision for the base case is better
than that of some of the selection techniques because of the zero precision assigned when an
inadequate set of axioms is selected. An adequate precision close to the base case, as for, e.g.,
Spectral Clustering and Greedy Tree+NN on both sets of 325 problems, indicates that doing the
axiom selection is probably wasted efort.</p>
      <p>There is a correlation, but not a particularly strong correlation, between the number of axioms
in a problem and the precision of the Vampire and Q∞ axiom selections, e.g., around 40% for
the chainy problems. This suggests that the axiom selection techniques might scale well to
larger problems.</p>
      <p>Overall, the precision values are disappointingly low, especially for the chainy problems. For
Q∞ the ranking density is similarly low, so that even with perfect ranking precision the raw
precision cannot be very good. Figure 3 shows the distributions of the precision values for the
chainy problems. In each plot the left set of points are for those problems where an adequate set
of axioms was selected, and the right set of point are for those problems where an inadequate
set of axioms was selected. For the majority of problems the precision in less that 0.1, i.e., less
than 10% of selected axioms are used in a proof.</p>
      <sec id="sec-4-1">
        <title>4.1. Evaluating the Metrics</title>
        <p>As was noted in the introduction to this paper, while metrics such as those presented in this
paper are useful for quick evaluation of axiom selection techniques, the “proof is in the pudding”.
The quality of the metrics needs to be established by comparing them with eventual ATP system
performance. An initial evaluation of the precision metric has been performed by using the
Vampire and Q∞ axiom selections, and looking at the performance of E on the resultant reduced
problems. This was done for the 1551 bushy problems and 1369 chainy problems for which
there are known minimally adequate axiom sets, so that the precision could be computed. The
testing was done with a 300s CPU time limit on computers with a dual-core Intel(R) Xeon(R)
E5-2609 2.50GHz CPU, 8GB memory, and running the CentOS Linux release 7.6.1810 operating
system. The results are shown in Table 3.</p>
        <p>In all four cases the precision of the unsolved problems is much lower than for the solved
problems - the precision metric aligns correctly with the solvability of the reduced problems.
The numbers of problems with precision 0.00 and 1.00 also align with the numbers of solutions,
with higher numbers of solved problems with precision 1.00, and higher numbers of unsolved
problems with precision 0.00. Overall, these results indicate that the precision metric does
correctly evaluate the quality of the axiom selection.</p>
        <p>The attentive reader might have noticed that there are some solved problems with precision
0.00, which is weird because the precision should be 0.00 only if the selected axioms are not an
adequate set (Section 2.2). This indicates that these experiments revealed some new minimally
adequate sets of axioms for some problems in the MPTP2078 corpus, and thus that some further
1551 Bushy
problems
# Problems
Avg Prcn
# Prcn 0.00
# Prcn 1.00
1369 Chainy
problems
# Problems
Avg Prcn
# Prcn 0.00
# Prcn 1.00
pruney problems need to be created.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>The main contribution of this paper is metrics for evaluating axiom selection techniques without
having to run an ATP system on the reduced problems. Three new axiom selection techniques
have also been presented. Results from evaluating the new techniques and the axiom selection
in the Vampire ATP system have been presented.</p>
      <p>
        Future work includes a more comprehensive evaluation of the metrics, to confirm that the
metrics do provide a meaningful evaluation of the axiom selection techniques. It will also be
interesting to apply the metrics to the axiom selection techniques found in other ATP systems
such as E, GKC [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], iProver [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], Leo-III [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], and MaLARea.
      </p>
      <p>
        The results show that there is lots of room for further research and improvement in axiom
selection. We are currently investigating the use of network propagation [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], adopted from
the analysis of protein-protein interaction networks [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ], for analysing the formula similarity
graphs. One interesting possibility is pipelining axiom selection tools so that the first in the
pipeline receives the original problem, it’s output becomes the input to the second in the pipeline,
and so on. Finally, the three new axiom selection techniques all build on the dissimilarity and
similarity measures described in Section 3.1. Further development and improvement of those
measures, or their complete replacement, would have consequential impacts downstream.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Sutclife</surname>
          </string-name>
          , The CADE-27
          <source>Automated Theorem Proving System Competition - CASC-27, AI</source>
          Communications
          <volume>32</volume>
          (
          <year>2020</year>
          )
          <fpage>373</fpage>
          -
          <lpage>389</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Alama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Heskes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Külwein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tsivtsivadze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Urban</surname>
          </string-name>
          ,
          <article-title>Premise Selection for Mathematics by Corpus Analysis and Kernel Methods</article-title>
          ,
          <source>Journal of Automated Reasoning</source>
          <volume>52</volume>
          (
          <year>2014</year>
          )
          <fpage>191</fpage>
          -
          <lpage>213</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Kaliszyk</surname>
          </string-name>
          , J. Urban,
          <source>MizAR 40 for Mizar 40, Journal of Automated Reasoning</source>
          <volume>55</volume>
          (
          <year>2015</year>
          )
          <fpage>245</fpage>
          -
          <lpage>256</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Brown</surname>
          </string-name>
          , T. Gauthier, C. Kaliszyk, G. Sutclife,
          <string-name>
            <surname>J. Urban,</surname>
          </string-name>
          <article-title>GRUNGE: A Grand Unified ATP Challenge</article-title>
          , in: P. Fontaine (Ed.),
          <source>Proceedings of the 27th International Conference on Automated Deduction, number 11716 in Lecture Notes in Computer Science</source>
          , SpringerVerlag,
          <year>2019</year>
          , pp.
          <fpage>123</fpage>
          -
          <lpage>141</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Puzis</surname>
          </string-name>
          , G. Sutclife,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>Using Relevance Measures for Axiom and Lemma Selection in Automated Theorem Proving</article-title>
          , in: V.
          <string-name>
            <surname>Barr</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          Markov (Eds.),
          <source>Proceedings of the 17th International FLAIRS Conference</source>
          , AAAI Press,
          <year>2004</year>
          , p. Submitted.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Sutclife</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Puzis</surname>
          </string-name>
          , SRASS
          <article-title>- a Semantic Relevance Axiom Selection System</article-title>
          , in: F. Pfenning (Ed.),
          <source>Proceedings of the 21st International Conference on Automated Deduction, number 4603 in Lecture Notes in Artificial Intelligence</source>
          , Springer-Verlag,
          <year>2007</year>
          , pp.
          <fpage>295</fpage>
          -
          <lpage>310</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Meng</surname>
          </string-name>
          , L. Paulson,
          <article-title>Lightweight Relevance Filtering for Machine-generated Resolution Problems</article-title>
          ,
          <source>Journal of Applied Logic</source>
          <volume>7</volume>
          (
          <year>2009</year>
          )
          <fpage>41</fpage>
          -
          <lpage>57</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Külwein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cramer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Koepke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schröder</surname>
          </string-name>
          ,
          <article-title>Premise Selection in the Naproche System</article-title>
          , in: J.
          <string-name>
            <surname>Giesl</surname>
          </string-name>
          , R. Haehnle (Eds.),
          <source>Proceedings of the 5th International Joint Conference on Automated Reasoning, number 6173 in Lecture Notes in Artificial Intelligence</source>
          ,
          <year>2010</year>
          , pp.
          <fpage>434</fpage>
          -
          <lpage>440</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>K.</given-names>
            <surname>Hoder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Voronkov</surname>
          </string-name>
          ,
          <article-title>Sine Qua Non for Large Theory Reasoning</article-title>
          , in: V. SofronieStokkermans, N. Bjoerner (Eds.),
          <source>Proceedings of the 23rd International Conference on Automated Deduction, number 6803 in Lecture Notes in Artificial Intelligence, SpringerVerlag</source>
          ,
          <year>2011</year>
          , pp.
          <fpage>299</fpage>
          -
          <lpage>314</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Külwein</surname>
          </string-name>
          , T. van
          <string-name>
            <surname>Laarhoven</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Tsivtsivadze</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Urban</surname>
          </string-name>
          , T. Heskes,
          <article-title>Overview and Evaluation of Premise Selection Techniques for Large Theory Mathematics</article-title>
          , in: B.
          <string-name>
            <surname>Gramlich</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          Sattler (Eds.),
          <source>Proceedings of the 6th International Joint Conference on Automated Reasoning, number 7364 in Lecture Notes in Artificial Intelligence</source>
          ,
          <year>2012</year>
          , pp.
          <fpage>378</fpage>
          -
          <lpage>392</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Gauthier</surname>
          </string-name>
          , C. Kaliszyk,
          <article-title>Premise Selection and External Provers for HOL4</article-title>
          , in: X.
          <string-name>
            <surname>Leroy</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Tiu (Eds.),
          <source>Proceedings of the 4th ACM SIGPLAN Conference on Certified Programs and Proofs</source>
          , ACM Press,
          <year>2015</year>
          , pp.
          <fpage>49</fpage>
          -
          <lpage>57</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>B.</given-names>
            <surname>Piotrowski</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Urban,</surname>
          </string-name>
          <article-title>ATPboost: Learning Premise Selection in Binary Setting with ATP Feedback</article-title>
          , in: D.
          <string-name>
            <surname>Galmiche</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Schulz</surname>
          </string-name>
          , R. Sebastiani (Eds.),
          <source>Proceedings of the 9th International Joint Conference on Automated Reasoning, number 10900 in Lecture Notes in Computer Science</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>566</fpage>
          -
          <lpage>574</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D.</given-names>
            <surname>Külwein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Blanchette</surname>
          </string-name>
          ,
          <article-title>A Survey of Axiom Selection as a Machine Learning Problem</article-title>
          , in: S. Geschke (Ed.),
          <article-title>Computability and Metamathematics : Festschrift Celebrating the 60th birthdays of Peter Koepke</article-title>
          and
          <string-name>
            <given-names>Philip</given-names>
            <surname>Welch</surname>
          </string-name>
          , College Publications,
          <year>2014</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Urban</surname>
          </string-name>
          , G. Sutclife,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pudlak</surname>
          </string-name>
          , J. Vyskocil, MaLARea SG1:
          <article-title>Machine Learner for Automated Reasoning with Semantic Guidance</article-title>
          , in: P.
          <string-name>
            <surname>Baumgartner</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Armando</surname>
          </string-name>
          , G. Dowek (Eds.),
          <source>Proceedings of the 4th International Joint Conference on Automated Reasoning, number 5195 in Lecture Notes in Artificial Intelligence</source>
          , Springer-Verlag,
          <year>2008</year>
          , pp.
          <fpage>441</fpage>
          -
          <lpage>456</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>L.</given-names>
            <surname>Kovacs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Voronkov</surname>
          </string-name>
          ,
          <string-name>
            <surname>First-Order Theorem</surname>
          </string-name>
          Proving and Vampire, in: N.
          <string-name>
            <surname>Sharygina</surname>
          </string-name>
          , H. Veith (Eds.),
          <source>Proceedings of the 25th International Conference on Computer Aided Verification, number 8044 in Lecture Notes in Artificial Intelligence</source>
          , Springer-Verlag,
          <year>2013</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>35</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>L.</given-names>
            <surname>Paulson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Blanchette</surname>
          </string-name>
          ,
          <article-title>Three Years of Experience with Sledgehammer, a Practical Link between Automatic and Interactive Theorem Provers</article-title>
          , in: G. Sutclife,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ternovska</surname>
          </string-name>
          , S. Schulz (Eds.),
          <source>Proceedings of the 8th International Workshop on the Implementation of Logics</source>
          , number 2 in EPiC Series in Computing,
          <source>EasyChair Publications</source>
          ,
          <year>2010</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>11</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>P.</given-names>
            <surname>Rudnicki</surname>
          </string-name>
          ,
          <article-title>An Overview of the Mizar Project</article-title>
          ,
          <source>in: Proceedings of the 1992 Workshop on Types for Proofs and Programs</source>
          ,
          <year>1992</year>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>332</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Schulz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cruanes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vukmirovic</surname>
          </string-name>
          , Faster, Higher,
          <source>Stronger: E 2</source>
          .3, in: P. Fontaine (Ed.),
          <source>Proceedings of the 27th International Conference on Automated Deduction, number 11716 in Lecture Notes in Computer Science</source>
          , Springer-Verlag,
          <year>2019</year>
          , pp.
          <fpage>495</fpage>
          -
          <lpage>507</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Stump</surname>
          </string-name>
          , G. Sutclife, C. Tinelli,
          <article-title>StarExec: a Cross-Community Infrastructure for Logic Solving</article-title>
          , in: S. Demri,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kapur</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          Weidenbach (Eds.),
          <source>Proceedings of the 7th International Joint Conference on Automated Reasoning, number 8562 in Lecture Notes in Artificial Intelligence</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>367</fpage>
          -
          <lpage>373</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hutchinson</surname>
          </string-name>
          , Metrics on Terms and Clauses, in: M. van Someren, G. Widmer (Eds.),
          <source>Proceedings of the 9th European Conference on Machine Learning, number 1224 in Lecture Notes in Artificial Intelligence</source>
          , Springer-Verlag,
          <year>1997</year>
          , pp.
          <fpage>138</fpage>
          -
          <lpage>145</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>U.</given-names>
            <surname>von Luxburg</surname>
          </string-name>
          ,
          <source>A Tutorial on Spectral Clustering, Statistics and Computing</source>
          <volume>17</volume>
          (
          <year>2007</year>
          )
          <fpage>395</fpage>
          -
          <lpage>416</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>T.</given-names>
            <surname>Tammet</surname>
          </string-name>
          ,
          <article-title>GKC: a Reasoning System for Large Knowledge Bases</article-title>
          , in: P. Fontaine (Ed.),
          <source>Proceedings of the 27th International Conference on Automated Deduction, number 11716 in Lecture Notes in Computer Science</source>
          , Springer-Verlag,
          <year>2019</year>
          , pp.
          <fpage>538</fpage>
          -
          <lpage>549</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>K.</given-names>
            <surname>Korovin</surname>
          </string-name>
          , iProver
          <article-title>- An Instantiation-Based Theorem Prover for First-order Logic (System Description)</article-title>
          , in: P.
          <string-name>
            <surname>Baumgartner</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Armando</surname>
          </string-name>
          , G. Dowek (Eds.),
          <source>Proceedings of the 4th International Joint Conference on Automated Reasoning, number 5195 in Lecture Notes in Artificial Intelligence</source>
          ,
          <year>2008</year>
          , pp.
          <fpage>292</fpage>
          -
          <lpage>298</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>A.</given-names>
            <surname>Steen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Benzmüller</surname>
          </string-name>
          , The
          <string-name>
            <surname>Higher-Order Prover</surname>
          </string-name>
          Leo-III, in: D.
          <string-name>
            <surname>Galmiche</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Schulz</surname>
          </string-name>
          , R. Sebastiani (Eds.),
          <source>Proceedings of the 8th International Joint Conference on Automated Reasoning, number 10900 in Lecture Notes in Artificial Intelligence</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>108</fpage>
          -
          <lpage>116</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>A.</given-names>
            <surname>Stojmirovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Information Flow in Interaction Networks</article-title>
          ,
          <source>Journal of Computationl Biology</source>
          <volume>14</volume>
          (
          <year>2007</year>
          )
          <fpage>1115</fpage>
          -
          <lpage>1143</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>P.</given-names>
            <surname>Devkota</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Danzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Lemmon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bixby</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wuchty</surname>
          </string-name>
          ,
          <article-title>Computational Identification of Kinases that Control Axon Growth in Mouse</article-title>
          , SLAS Discovery (
          <year>2020</year>
          ) To appear.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>