=Paper=
{{Paper
|id=Vol-2752/paper5
|storemode=property
|title=Evaluation of Axiom Selection Techniques
|pdfUrl=https://ceur-ws.org/Vol-2752/paper5.pdf
|volume=Vol-2752
|authors=Qinghua Liu,Zishi Wu,Zihao Wang,Geoff Sutcliffe
|dblpUrl=https://dblp.org/rec/conf/cade/LiuWWS20
}}
==Evaluation of Axiom Selection Techniques==
Evaluation of Axiom Selection Techniques Qinghua Liua , Zishi Wub , Zihao Wangb and Geoff Sutcliffeb a Southwest Jiaotong University, China b University of Miami, USA Abstract “Large theory” problems in Automated Theorem Proving (ATP) have been defined as having many func- tions and predicates, and many axioms of which only a few are required for the proof of a theorem. One key to solving large theory problems is selecting a subset of the axioms that is adequate for finding a proof. The main contribution of this paper is metrics for evaluating axiom selection techniques without having to run an ATP system on the problem formed from the selected axioms and the problem’s con- jecture. This paper additionally presents three new axiom selection techniques. The new techniques, and the axiom selection in the Vampire ATP system, are evaluated using the metrics. Keywords Automated theorem proving, axiom selection 1. Introduction “Large theory” problems in Automated Theorem Proving (ATP) have been defined [1] as having many functions and predicates, and many axioms of which only a few are required for the proof of a theorem. Large theory problems are often found in corpora that contain very many problems, e.g., the MPTP2078 corpus [2], the Mizar 40 corpus [3], and the GRUNGE corpus [4]. Large theory problems present challenges to ATP systems, mainly because of the large search space generated by the large number of axioms. Thus, one key to solving large theory problems is selecting a subset of the axioms that is adequate for finding a proof. There has been significant and successful research on this topic, e.g., [5, 6, 7, 8, 9, 10, 2, 11, 12]. Many techniques are based on the occurrences of symbols in the formulae, e.g., the SInE method [9] and its derivatives. The fact that large theory problems often occur in large corpora makes the application of machine learning techniques [13] viable, e.g., as in the MaLARea system [14]. Evaluation of axiom selection techniques is typically done by: 1. Choosing a corpus of large theory problems that are known to be theorems. 2. For each problem in the corpus, selecting a subset of the problem’s axioms. 3. Running an ATP system on the reduced problem formed from the selected axioms and the problem’s conjecture. PAAR 2020: Seventh Workshop on Practical Aspects of Automated Reasoning, June 29–30, 2020, Paris, France (virtual) qhliu678@gmail.com (Q. Liu); ry04ert39@miami.edu (Z. Wu); zxw526@miami.edu (Z. Wang); geoff@cs.miami.edu (G. Sutcliffe) { https://www.cs.miami.edu/home/geoff/ (G. Sutcliffe) 0000-0003-2271-2669 (Q. Liu); 0000-0002-1039-4264 (Z. Wu); 0000-0002-5559-7452 (Z. Wang); 0000-0001-9120-3927 (G. Sutcliffe) © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 63 Qinghua Liu et al. CEUR Workshop Proceedings 63–75 In the third step of this process, a proof indicates an adequate selection (the quality of which can then be evaluated), and a countermodel indicates an inadequate selection. As the development of axiom selection techniques progresses, this approach requires repeating steps 2 and 3 to evaluate progress. Evaluation needs to be done on large corpora, making this a time consuming process. As a consequence, to move things along faster, it is often necessary to impose a small time limit on each ATP system run. Time limits (and other practical forms of incompleteness) cause timeouts, and a timeout in the third step provides no information - the selection might be inadequate, or the selection might be adequate but the reduced problem is too hard because too many (unnecessary) axioms were selected, or the time limit is too small. The results are also influenced by the choice of ATP system. Overall, this process for evaluating axiom selection techniques is problematic. The main contribution of this paper is metrics for evaluating axiom selection techniques without having to run an ATP system on the reduced problems. While the “proof is in the pudding” and it is eventually necessary to evaluate by running an ATP system, the metrics described in this paper provide a first-pass evaluation that allows axiom selection techniques to be rapidly tested and refined. The approach has the advantage of being independent of a chosen ATP system. This paper additionally presents three new axiom selection techniques. The new techniques and the axiom selection [9] in the Vampire [15] ATP system are evaluated using the proposed metrics. This paper is structured as follows: Section 2 describes the evaluation metrics. Section 3 describes a relatedness measure between formulae, and the three new axiom selection techniques that are based on that measure. Section 4 provides evaluation results, including an initial evaluation of the metrics themselves. Section 5 concludes. 2. Axiom Selection and the Evaluation Metrics The axiom selection evaluation metrics measure how precisely a set of selected axioms matches a known minimally adequate set of axioms. Two types of axiom selection techniques are considered: • Ranking techniques, which take the problem’s axioms as input, rank them according to how likely they are judged to be necessary for a proof, and select the axioms that are ranked above some cut-off criterion. This technique is used in, e.g., Isabelle’s Sledgehammer [16]. • Projection techniques, which take the problem’s axioms as input and directly return a selection of axioms. (Ranking is thus a special case of projection.) This technique is used in, e.g., Vampire. 2.1. The MPTP2078 Problem Corpus In this work the MPTP2078 corpus1 , based on the Mizar Mathematical Library [17], was used for development. The MPTP2078 has two versions of each of its 2078 problems: the bushy (small) versions that contain only the Mizar axioms that a knowledgeable user selected for finding a proof of the conjecture, and the chainy versions that contain all the axioms that precede the 1 https://github.com/JUrban/MPTP2078 64 Qinghua Liu et al. CEUR Workshop Proceedings 63–75 conjecture in the Mizar library order. The bushy problems have between 10 and 67 axioms, while the chainy versions have between 10 and 4563 axioms. In order to extract minimally adequate sets of axioms for each problem, Vampire and E [18] were run on the problems, using the StarExec [19] Miami2 cluster with a 300s CPU time limit. The computers have an octa-core Intel(R) Xeon(R) E5-2667 3.20GHz CPU, 128GB memory, and run the CentOS Linux release 7.4.1708 operating system. This produced proofs for 1486 of the bushy problems (1474 by Vampire and 1263 by E) and 1345 of the chainy problems (1333 by Vampire and 815 by E). For each proof found, the axioms used in the proof were extracted as an minimally adequate set of axioms, and a new problem was formed from that minimally adequate set with the problem’s conjecture. These new problems were dubbed the pruney problems. Additionally, in testing the new axiom selection techniques described in Section 3, some further different minimally adequate sets were found and further pruney problems were created. This resulted in a further 65 pruney problems for the bushy problems, and a further 24 pruney problems for the chainy problems. For some problems multiple minimally adequate sets of axioms were found, resulting in a total of 1829 pruney problems for the 1551 bushy problems, and a total of 3093 pruney problems for the 1369 chainy problems. The pruney problems provide minimally adequate sets of axioms against which selected sets of axioms can be compared. The smallest fraction of axioms in a minimally adequate set ranges from 0.01 to 1.0 for the bushy problems, and from 0.0002 to 0.36 for the chainy problems. The average fractions are 0.29 and 0.01, respectively. These fractions show that precise axiom selection could significantly reduce the number of axioms that need to be used, which in turn normally significantly reduces the search space of an ATP system. As might be expected, the numbers are more extreme for the larger chainy problems. 2.2. The Metrics Define the following basic measures: • The N umber of Axioms in a Problem: NAxP. • The N umber of axioms Selected: NSel. • The N umber of axioms U sed in a Proof, i.e., the cardinality of a minimally adequate set: NUiP. NUiP is 0 if no axioms need to be used, i.e., the conjecture is a theorem. • In a ranked list of axioms, the number of axioms down to the lowest ranked axiom in a minimally adequate set, i.e., the N umber of axioms N eeded from the Ranking: NNRa. The axiom selection evaluation metrics are listed below. In all cases the range is [0.0, 1.0]. Precision. If the axiom selection technique selects an adequate set of axioms, i.e., a superset of one or more of the known minimally adequate sets of axioms, then: • If the minimum NUiP = 0, and NSel= 0, then 1.00 • Else the maximum NUiP/NSel If the selection technique selects an inadequate set then the precision is 0.00. Larger values are better. The intuition here is of a probability that using the selection will result in a proof (which is 0.00 if an inadequate set is selected). 2 https://starexec.ccs.miami.edu 65 Qinghua Liu et al. CEUR Workshop Proceedings 63–75 Selectivity. NSel/NAxP. This measures the fraction of axioms selected. 1.00 results from selecting all the axioms - the base case. Smaller values are better, provided an adequate set of axioms is selected. Ranking precision. (Applicable to only ranking techniques.) If the axiom selection technique selects an adequate set of axioms, then: • If the minimum NUiP = 0, and NSel= 0, then 1.00 • Else the maximum NNRa/NSel If the technique selects an inadequate set, then 0.00. This measures how precisely the technique chooses the “best ones” from the ranked list of axioms. Larger values are better. Ranking density. (Applicable to only ranking techniques.) If NUiP = NNRa = 0, then 1.00, else NUiP/NNRa. This measures the quality of the ranking - axioms in a minimally adequate set should be ranked highly, which in turn allows a smaller number to be selected if the ranking precision is good. Larger values are better. Average precision/selectivity/ranking precision/ranking density. For a set of problems, the average over the problems. Adequacy. For a set of problems, the fraction of problems for which the axiom selection technique selects an adequate set of axioms. Larger values are better. Adequate precision/selectivity/ranking precision/ranking density. For a set of prob- lems, the average over the problems for which the axiom selection technique selects an adequate set of axioms. 3. Our Axiom Selection Techniques The motivation for developing these metrics was based on a need to evaluate new axiom selection techniques being developed by the first three authors. These techniques are described in this section, and their performance according to the metrics is evaluated in Section 4. It turns out that a very simple technique performs surprisingly well on the MPTP2078 corpus. All of the new techniques are based on how strongly two formulae are related (as is the case in many axiom selection techniques). In this work a novel measure of relatedness is used, which is described in Section 3.1. The individual new techniques are then described in Sections 3.2 to 3.4. 3.1. Formula Dissimilarity and Similarity The relatedness between two formulae is computed first as a dissimilarity between the two formulae, which is also later converted to a similarity. The dissimilarity between two terms or atoms is an extended version of the Hutchinson distance [20]. For two terms or atoms ∆1 and ∆2 , their least general generalization ∆ = 66 Qinghua Liu et al. CEUR Workshop Proceedings 63–75 𝑙𝑔𝑔(∆1 , ∆2 ), if it exists, is a term or atom ∆ such that there are substitutions 𝜃1 and 𝜃2 , ∆𝜃1 = ∆1 and ∆𝜃2 = ∆2 , and there is no term or atom ∆′ and substitutions 𝜎, 𝜎1 , 𝜎2 such that 𝜎 is not just a renaming substitution, ∆𝜎 = ∆′ , ∆′ 𝜎1 = ∆1 , and ∆′ 𝜎2 = ∆2 . If 𝑙𝑔𝑔(∆1 , ∆2 ) does not exist, e.g., neither ∆1 nor ∆2 are variables and their principal symbols are different, then the dissimilarity 𝑑𝑠𝑖𝑚(∆1 , ∆2 ) = ∞. Otherwise: Divide 𝜃𝑖 into two parts, 𝜃𝑖𝑣 and 𝜃𝑖𝑓 : 𝜃𝑖𝑣 = {𝑋𝑖,1 ↦→ 𝑍𝑖,1 , . . . , 𝑋𝑖,𝑚𝑖 ↦→ 𝑍𝑖,𝑚𝑖 } 𝜃𝑖𝑓 = {𝑌𝑖,1 ↦→ 𝑓𝑖,1 , . . . , 𝑌𝑖,𝑛𝑖 ↦→ 𝑓𝑖,𝑛𝑖 } where 𝑋𝑖,𝑗 and 𝑌𝑖,𝑗 are the substituted variables, 𝑍𝑖,𝑗 are substituting variables, and 𝑓𝑖,𝑗 are substituting functional terms. Let • 𝑤𝑣 be a weight for variables (currently set to 13 ), • 𝑤𝑓 be a weight function for non-variable symbols (currently set to 2 for all symbols), • 𝑉 (𝑋, ∆) be the set of occurrences of the variable 𝑋 in ∆, • 𝑁 (𝑋, ∆) be the depth of a particular occurrence of the variable 𝑋 in ∆, e.g., 𝑁 (𝑋, 𝑝(𝑋, 𝑓 (𝑋)) = 1 for the first occurrence and 2 for the second occurrence, • 𝑊𝑣 (𝑋, ∆) be 𝑤𝑣 × (|𝑉 (𝑋, ∆)| + 𝑈 ∈𝑉 (𝑋,Δ) 𝑁 (𝑈, ∆)), i.e., the weighted sum of the ∑︀ number of occurrences of 𝑋 in ∆ and the depths of the occurrences of 𝑋 in ∆, e.g., 𝑊𝑣 (𝑋, 𝑝(𝑋, 𝑓 (𝑋)) = 𝑤𝑣 × (2 + (1 + 2)) = 5, • 𝑊𝑓 (∆) be the sum of the weights (using 𝑤𝑓 ) of the non-variable symbols occurring in ∆, • 𝑔(𝑥) be a continuous increasing function R ↦→ R such that 𝑔(0) = 0 and 𝑔(𝑥1 + 𝑥2 ) ≤ 𝑔(𝑥1 ) + 𝑔(𝑥2 ) for 𝑥1 , 𝑥2 ≥ 0 (currently set to 𝑙𝑛(𝑥 + 1)). Then, 𝑆𝑓 and 𝑆𝑣 are functions that map 𝜃𝑖𝑣 and 𝜃𝑖𝑓 to real numbers: 𝑚𝑖 ∑︁ 𝑆𝑣 (𝜃𝑖𝑣 ) = 𝑔(𝑊𝑣 (𝑍𝑖,𝑗 , ∆𝑖 ) − 𝑊𝑣 (𝑋𝑖,𝑗 , ∆)) (1) 𝑗=1 𝑛𝑖 𝑆𝑓 (𝜃𝑖𝑓 ) = ∑︁ (|𝑉 (𝑌𝑖,𝑗 , ∆)| × 𝑊𝑓 (𝑓𝑖,𝑗 )) (2) 𝑗=1 𝑆𝑣 (𝜃) measures the difference between the usage of substituting variables in ∆𝑖 and the usage of substituted variables in ∆, over the substitutions made in ∆ by 𝜃𝑖𝑣 . 𝑆𝑓 (𝜃) measures the total function symbol weight of the substituting terms, over the substitutions made in ∆ by 𝜃𝑖𝑓 . The dissimilarity between two terms or atoms ∆1 and ∆2 is then: √︁ 𝑑𝑠𝑖𝑚(∆1 , ∆2 ) = [𝑆𝑣 (𝜃1𝑣 ) + 𝑆𝑣 (𝜃2𝑣 )]2 + [𝑆𝑓 (𝜃1𝑓 ) + 𝑆𝑓 (𝜃2𝑓 )]2 (3) The dissimilarity measures the combined substitution “effort” required to convert the least general generalization to those terms or atoms. 3 The values of 1 and 2 for 𝑤𝑣 and 𝑤𝑓 were adopted from their use in E, place greater emphasis on non-variable substitutions, and are small enough to avoid large dissimilarity values. The use of 𝑙𝑛(𝑥 + 1) for 𝑔(𝑥) below was motivated by it’s common adoption as a continuous increasing function, to usefully dampen the impact of variable substututions on dissimilarity values. 67 Qinghua Liu et al. CEUR Workshop Proceedings 63–75 Let Φ1 and Φ2 be two formulae, containing the atoms {∆1,1 , . . . , ∆1,𝑛 } and {∆2,1 , . . . , ∆2,𝑚 } respectively. Let 𝑆̸=∞ be the set of pairs (∆1,𝑖 , ∆2,𝑗 ) for which 𝑑𝑠𝑖𝑚(∆1,𝑖 , ∆2,𝑗 ) ̸= ∞. If 𝑆̸=∞ = ∅, i.e., all pairs of atoms in Φ1 and Φ2 are infinitely dissimilar, then the dissimilarity 𝑑𝑠𝑖𝑚(Φ1 , Φ2 ) = ∞. Otherwise: ∑︀ (Δ1,𝑖 ,Δ2,𝑗 )∈𝑆̸=∞ 𝑑𝑠𝑖𝑚(∆1,𝑖 , ∆2,𝑗 ) 𝑛×𝑚 𝑑𝑠𝑖𝑚(Φ1 , Φ2 ) = × (4) |𝑆̸=∞ | |𝑆̸=∞ | The first term is the average dissimilarity between pairs of atoms that are not infinitely dissimilar. The second term penalizes the first by the excess of atom pairs whose dissimilarity is infinite. Intuitively, the dissimilarity between two formulae measures the extent to which inference between them might not be possible by virtue of the dissimilarity of their atoms. For two formulae Φ1 and Φ2 , their similarity 𝑠𝑖𝑚(Φ1 , Φ2 ) is defined in the context of a set of formulae ℱ, Φ1 , Φ2 ∈ ℱ: 𝑚𝑎𝑥𝑑𝑠𝑖𝑚(ℱ) = 𝑚𝑎𝑥𝜑𝑖 ,𝜑𝑗 ∈ℱ (𝑑𝑠𝑖𝑚(𝜑𝑖 , 𝜑𝑗 ) ̸= ∞) (5) 𝑠𝑖𝑚(Φ1 , Φ2 , ℱ) = max(0, 𝑚𝑎𝑥𝑑𝑠𝑖𝑚(ℱ) − 𝑑𝑠𝑖𝑚(Φ1 , Φ2 )) (6) If the dissimilarity is 0, then the similarity is the largest dissimilarity that is not ∞. If the dissimilarity is greater than 0 and not ∞, then the similarity is the difference between the largest dissimilarity that is not ∞ and the dissimilarity. If the dissimilarity is ∞ or is equal to the largest dissimilarity that is not ∞, then the similarity is 0. 3.2. Q∞ This axiom selection technique takes a problem consisting of a conjecture 𝒞 and axioms 𝒜, and selects all axioms Φ ∈ 𝒜 such that 𝑑𝑠𝑖𝑚(𝒞, Φ) ̸= ∞. This simply means that each axiom contains at least one atom whose predicate symbol matches that of an atom in the conjecture. 3.3. Spectral Clustering This axiom selection technique views the formulae of a problem as nodes of a complete undi- rected graph, with the similarities between the formulae as the edge weights. Spectral clustering [21] is then used to cluster together similar nodes, and the axioms in the cluster containing the conjecture are selected. This process requires three steps: determining the number of clusters, choosing the initial centroids for the clusters, and applying k-means clustering. The initial centroids for the clustering are extracted from a feature matrix consisting of the eigenvectors of the normalized graph Laplacian matrix [21], computed from the graph’s adja- cency matrix with edges weighted by similarity. The rows of the feature matrix corresponding to the conjecture node and to the 𝑘 − 1 (see below for how 𝑘 is computed) axiom nodes with highest degree centrality are used as the initial centroids for the clustering. To determine the number of clusters for a problem in the MPTP2078 corpus, the three step process was applied to each problem with the number of clusters ranging from two to half the number of formulae in the problem. The precision was computed each time, and the best number of clusters recorded. A median regression line was fitted to this data (separately for the 68 Qinghua Liu et al. CEUR Workshop Proceedings 63–75 Figure 1: Best number of clusters vs. Number of formulae bushy and chainy problems), as shown in Figure 1. The upper two plots are for a problem set of 325 smaller problems (see Section 4 for details of the testing problem sets), and the lower plot is for the 1551 bushy problems. While the fit of the regression lines is awful, they were used to set the number of clusters 𝑘 for problems in the corpus, based on the number of formulae in each problem. For the 1551 bushy problems it turns out that two clusters is always optimal, regardless of the number of axioms. It was not possible to compute a regression line for the 1369 chainy problems with the time and computing resources available. 3.4. Greedy Tree plus Nearest Neighbours This axiom selection technique views the formulae of a problem as nodes of a complete undi- rected graph, with the similarities between the formulae as the edge weights. Figure 2 provides a simple illustrative example, in which C is the conjecture and As are the axioms (the figure omits edges that do not affect the example). Starting at the conjecture, a greedy tree is built by iteratively visiting all the axioms most similar to the current leaves of the tree, until axioms that have no similarity (or equivalently, infinite dissimilarity) to the conjecture are reached. These infinitely dissimilar nodes are ignored. In Figure 2 the thicker edges are those that are followed, and the axioms with thicker circles are those at which the tree growth stops because they have 69 Qinghua Liu et al. CEUR Workshop Proceedings 63–75 A A 5 7 A 8 A A 7 A 5 A 0 9 A A 7 0 4 6 4 4 A 5 5 A A A A 7 C 6 0 A 3 A Figure 2: A greedy tree no similarity to the conjecture. At that stage the light grey axioms are in the tree. As a final step, all the unvisited axioms most similar to the axioms in the greedy tree are added to the tree. In Figure 2, that adds the dark grey axioms to the tree. The axioms in the tree are selected - those are the non-white axiom in the Figure 2.. 4. Evaluation Results The new axiom selection techniques and the axiom selection in the Vampire ATP system were evaluated using the metrics. No particular effort was moade to optimize the implementation of the new selection techniques; rather, the evaluation aimed to demonstrate the usefulness of the metrics, and as such no CPU times are presented below. Two problem sets from each of the bushy and chainy parts of the MPTP2078 corpus were used. The first set was selected by taking the 325 chainy problems with less than 500 axioms and for which minimally adequate axiom sets of axioms (pruney problems) are known, and taking the corresponding 325 bushy problems. The second set was the 1551 bushy problems and 1369 chainy problems for which minimally adequate axiom sets (pruney problems) are known. The smaller set was useful for initial quick testing, and also necessary because some of the new techniques could not analyse the chainy problems set with the time and computing resources available. Tables 1 and 2 show the results, including a row for the base case in which all axioms are selected. The columns are the Precision (Prcn), Selectivity (Sely), Ranking precision (RPrn), Ranking density (RDen), Adequacy (Adeq), and the Adequate precision/selectivity/ranking precision/ranking density. For the smaller set of 325 bushy problems Q∞ performs the best, with the highest precision and lowest selectivity. It also has a 74% adequacy. Moving up to the 325 chainy problems Q∞ and Vampire have the highest precision, with Q∞ having slightly higher selectivity, and Vampire better adequacy. For the 1551 bushy problems Q∞ is again the top performer, now in terms of precision, selectivity, and adequacy. Spectral clustering also performs reasonably well, with a better precision than Vampire, and with the cluster containing the conjecture (recall from Section 3.3 that two clusters is optimal, and hence used here) containing 69% of the axioms 70 Qinghua Liu et al. CEUR Workshop Proceedings 63–75 325 Bushy problems Average Adequate Technique Prcn Sely RPrn RDen Adeq Prcn Sely RPrn RDen Base 0.35 1.00 - - 1.00 0.35 1.00 - - Vampire 4.4 0.32 0.80 - - 0.80 0.39 0.84 - - Q∞ 0.43 0.54 0.62 0.52 0.74 0.58 0.61 0.84 0.71 Spectral Cl. 0.24 0.57 - - 0.66 0.36 0.79 - - Greedy Tree+NN 0.36 0.57 - - 0.66 0.36 0.79 - - 325 Chainy problems Average Adequate Technique Prcn Sely RPrn RDen Adeq Prcn Sely RPrn RDen Base 0.06 1.00 - - 1.00 0.06 1.00 - - Vampire 4.4 0.08 0.55 - - 0.94 0.09 0.56 - - Q∞ 0.08 0.53 0.57 0.21 0.85 0.09 0.56 0.67 0.25 Spectral Cl. 0.05 0.48 - - 0.65 0.08 0.63 - - Greedy Tree+NN 0.05 0.79 - - 0.86 0.06 0.85 - - Table 1 Results for the 325 smaller problems 1551 Bushy problems Average Adequate Technique Prcn Sely RPrn RDen Adeq Prcn Sely RPrn RDen Base 0.30 1.00 - - 1.00 0.30 1.00 - - Vampire 4.4 0.21 0.69 - - 0.58 0.35 0.76 - - Q∞ 0.33 0.54 0.55 0.41 0.68 0.48 0.59 0.80 0.60 Spectral Cl. 0.25 0.69 - - 0.76 0.33 0.83 - - 1369 Chainy problems Average Adequate Technique Prcn Sely RPrn RDen Adeq Prcn Sely RPrn RDen Base 0.02 1.00 - - 1.00 0.02 1.00 - - Vampire 4.4 0.03 0.40 - - 0.87 0.04 0.41 - - Q∞ 0.03 0.53 0.46 0.07 0.81 0.04 0.56 0.56 0.09 Table 2 Results for all problems with known minimally adequate axiom sets on average. Finally, for the 1369 chainy problems Q∞ and Vampire again have the highest precision, with Vampire having the best selectivity and adequacy. Overall, Q∞ performs the best on the bushy problems, with Vampire doing slightly better on the chainy problems. It is surprising that the very simple Q∞ technique performs as well as it does, when compared to the mature (and more complex) Vampire selection technique. The adequacy numbers have to be viewed carefully - it is easy to get an optimal adequacy simply by selecting all axioms and thus having pessimal selectivity. It is interesting to contrast the average and adequate precision values. The average precision for the base case is better than that of some of the selection techniques because of the zero precision assigned when an inadequate set of axioms is selected. An adequate precision close to the base case, as for, e.g., Spectral Clustering and Greedy Tree+NN on both sets of 325 problems, indicates that doing the axiom selection is probably wasted effort. There is a correlation, but not a particularly strong correlation, between the number of axioms in a problem and the precision of the Vampire and Q∞ axiom selections, e.g., around 40% for 71 Qinghua Liu et al. CEUR Workshop Proceedings 63–75 Figure 3: Distribution of precision values for chainy problems the chainy problems. This suggests that the axiom selection techniques might scale well to larger problems. Overall, the precision values are disappointingly low, especially for the chainy problems. For Q∞ the ranking density is similarly low, so that even with perfect ranking precision the raw precision cannot be very good. Figure 3 shows the distributions of the precision values for the chainy problems. In each plot the left set of points are for those problems where an adequate set of axioms was selected, and the right set of point are for those problems where an inadequate set of axioms was selected. For the majority of problems the precision in less that 0.1, i.e., less than 10% of selected axioms are used in a proof. 4.1. Evaluating the Metrics As was noted in the introduction to this paper, while metrics such as those presented in this paper are useful for quick evaluation of axiom selection techniques, the “proof is in the pudding”. The quality of the metrics needs to be established by comparing them with eventual ATP system performance. An initial evaluation of the precision metric has been performed by using the Vampire and Q∞ axiom selections, and looking at the performance of E on the resultant reduced problems. This was done for the 1551 bushy problems and 1369 chainy problems for which there are known minimally adequate axiom sets, so that the precision could be computed. The testing was done with a 300s CPU time limit on computers with a dual-core Intel(R) Xeon(R) E5-2609 2.50GHz CPU, 8GB memory, and running the CentOS Linux release 7.6.1810 operating system. The results are shown in Table 3. In all four cases the precision of the unsolved problems is much lower than for the solved problems - the precision metric aligns correctly with the solvability of the reduced problems. The numbers of problems with precision 0.00 and 1.00 also align with the numbers of solutions, with higher numbers of solved problems with precision 1.00, and higher numbers of unsolved problems with precision 0.00. Overall, these results indicate that the precision metric does correctly evaluate the quality of the axiom selection. The attentive reader might have noticed that there are some solved problems with precision 0.00, which is weird because the precision should be 0.00 only if the selected axioms are not an adequate set (Section 2.2). This indicates that these experiments revealed some new minimally adequate sets of axioms for some problems in the MPTP2078 corpus, and thus that some further 72 Qinghua Liu et al. CEUR Workshop Proceedings 63–75 1551 Bushy Vampire selection Q∞ selection problems Solved Unsolved Solved Unsolved # Problems 848 (55%) 703 (45%) 969 (62%) 582 (38%) Avg Prcn 0.35 0.03 0.48 0.07 # Prcn 0.00 5 ( 1%) 646 (92%) 3 ( 0%) 487 (84%) # Prcn 1.00 7 ( 1%) 0 ( 0%) 63 ( 7%) 0 ( 0%) 1369 Chainy Vampire selection Q∞ selection problems Solved Unsolved Solved Unolved # Problems 901 (66%) 468 (34%) 812 (59%) 557 (41%) Avg Prcn 0.04 0.02 0.04 0.01 # Prcn 0.00 19 ( 2%) 159 (34%) 11 ( 1%) 243 (44%) # Prcn 1.00 0 ( 0%) 0 ( 0%) 0 ( 0%) 0 ( 0%) Table 3 Evaluation of Vampire and Q∞ axiom selection wrt E pruney problems need to be created. 5. Conclusion The main contribution of this paper is metrics for evaluating axiom selection techniques without having to run an ATP system on the reduced problems. Three new axiom selection techniques have also been presented. Results from evaluating the new techniques and the axiom selection in the Vampire ATP system have been presented. Future work includes a more comprehensive evaluation of the metrics, to confirm that the metrics do provide a meaningful evaluation of the axiom selection techniques. It will also be interesting to apply the metrics to the axiom selection techniques found in other ATP systems such as E, GKC [22], iProver [23], Leo-III [24], and MaLARea. The results show that there is lots of room for further research and improvement in axiom selection. We are currently investigating the use of network propagation [25], adopted from the analysis of protein-protein interaction networks [26], for analysing the formula similarity graphs. One interesting possibility is pipelining axiom selection tools so that the first in the pipeline receives the original problem, it’s output becomes the input to the second in the pipeline, and so on. Finally, the three new axiom selection techniques all build on the dissimilarity and similarity measures described in Section 3.1. Further development and improvement of those measures, or their complete replacement, would have consequential impacts downstream. References [1] G. Sutcliffe, The CADE-27 Automated Theorem Proving System Competition - CASC-27, AI Communications 32 (2020) 373–389. [2] J. Alama, T. Heskes, D. Külwein, E. Tsivtsivadze, J. Urban, Premise Selection for Mathemat- ics by Corpus Analysis and Kernel Methods, Journal of Automated Reasoning 52 (2014) 191–213. 73 Qinghua Liu et al. CEUR Workshop Proceedings 63–75 [3] C. Kaliszyk, J. Urban, MizAR 40 for Mizar 40, Journal of Automated Reasoning 55 (2015) 245–256. [4] C. Brown, T. Gauthier, C. Kaliszyk, G. Sutcliffe, J. Urban, GRUNGE: A Grand Unified ATP Challenge, in: P. Fontaine (Ed.), Proceedings of the 27th International Conference on Automated Deduction, number 11716 in Lecture Notes in Computer Science, Springer- Verlag, 2019, pp. 123–141. [5] Y. Puzis, G. Sutcliffe, Y. Zhang, Y. Gao, Using Relevance Measures for Axiom and Lemma Selection in Automated Theorem Proving, in: V. Barr, Z. Markov (Eds.), Proceedings of the 17th International FLAIRS Conference, AAAI Press, 2004, p. Submitted. [6] G. Sutcliffe, Y. Puzis, SRASS - a Semantic Relevance Axiom Selection System, in: F. Pfenning (Ed.), Proceedings of the 21st International Conference on Automated Deduction, number 4603 in Lecture Notes in Artificial Intelligence, Springer-Verlag, 2007, pp. 295–310. [7] J. Meng, L. Paulson, Lightweight Relevance Filtering for Machine-generated Resolution Problems, Journal of Applied Logic 7 (2009) 41–57. [8] D. Külwein, M. Cramer, P. Koepke, B. Schröder, Premise Selection in the Naproche System, in: J. Giesl, R. Haehnle (Eds.), Proceedings of the 5th International Joint Conference on Automated Reasoning, number 6173 in Lecture Notes in Artificial Intelligence, 2010, pp. 434–440. [9] K. Hoder, A. Voronkov, Sine Qua Non for Large Theory Reasoning, in: V. Sofronie- Stokkermans, N. Bjœrner (Eds.), Proceedings of the 23rd International Conference on Automated Deduction, number 6803 in Lecture Notes in Artificial Intelligence, Springer- Verlag, 2011, pp. 299–314. [10] D. Külwein, T. van Laarhoven, E. Tsivtsivadze, J. Urban, T. Heskes, Overview and Evaluation of Premise Selection Techniques for Large Theory Mathematics, in: B. Gramlich, D. Miller, U. Sattler (Eds.), Proceedings of the 6th International Joint Conference on Automated Reasoning, number 7364 in Lecture Notes in Artificial Intelligence, 2012, pp. 378–392. [11] T. Gauthier, C. Kaliszyk, Premise Selection and External Provers for HOL4, in: X. Leroy, A. Tiu (Eds.), Proceedings of the 4th ACM SIGPLAN Conference on Certified Programs and Proofs, ACM Press, 2015, pp. 49–57. [12] B. Piotrowski, J. Urban, ATPboost: Learning Premise Selection in Binary Setting with ATP Feedback, in: D. Galmiche, S. Schulz, R. Sebastiani (Eds.), Proceedings of the 9th International Joint Conference on Automated Reasoning, number 10900 in Lecture Notes in Computer Science, 2018, pp. 566–574. [13] D. Külwein, J. Blanchette, A Survey of Axiom Selection as a Machine Learning Problem, in: S. Geschke (Ed.), Computability and Metamathematics : Festschrift Celebrating the 60th birthdays of Peter Koepke and Philip Welch, College Publications, 2014, pp. 1–15. [14] J. Urban, G. Sutcliffe, P. Pudlak, J. Vyskocil, MaLARea SG1: Machine Learner for Automated Reasoning with Semantic Guidance, in: P. Baumgartner, A. Armando, G. Dowek (Eds.), Proceedings of the 4th International Joint Conference on Automated Reasoning, number 5195 in Lecture Notes in Artificial Intelligence, Springer-Verlag, 2008, pp. 441–456. [15] L. Kovacs, A. Voronkov, First-Order Theorem Proving and Vampire, in: N. Sharygina, H. Veith (Eds.), Proceedings of the 25th International Conference on Computer Aided Verification, number 8044 in Lecture Notes in Artificial Intelligence, Springer-Verlag, 2013, pp. 1–35. 74 Qinghua Liu et al. CEUR Workshop Proceedings 63–75 [16] L. Paulson, J. Blanchette, Three Years of Experience with Sledgehammer, a Practical Link between Automatic and Interactive Theorem Provers, in: G. Sutcliffe, E. Ternovska, S. Schulz (Eds.), Proceedings of the 8th International Workshop on the Implementation of Logics, number 2 in EPiC Series in Computing, EasyChair Publications, 2010, pp. 1–11. [17] P. Rudnicki, An Overview of the Mizar Project, in: Proceedings of the 1992 Workshop on Types for Proofs and Programs, 1992, pp. 311–332. [18] S. Schulz, S. Cruanes, P. Vukmirovic, Faster, Higher, Stronger: E 2.3, in: P. Fontaine (Ed.), Proceedings of the 27th International Conference on Automated Deduction, number 11716 in Lecture Notes in Computer Science, Springer-Verlag, 2019, pp. 495–507. [19] A. Stump, G. Sutcliffe, C. Tinelli, StarExec: a Cross-Community Infrastructure for Logic Solving, in: S. Demri, D. Kapur, C. Weidenbach (Eds.), Proceedings of the 7th International Joint Conference on Automated Reasoning, number 8562 in Lecture Notes in Artificial Intelligence, 2014, pp. 367–373. [20] A. Hutchinson, Metrics on Terms and Clauses, in: M. van Someren, G. Widmer (Eds.), Proceedings of the 9th European Conference on Machine Learning, number 1224 in Lecture Notes in Artificial Intelligence, Springer-Verlag, 1997, pp. 138–145. [21] U. von Luxburg, A Tutorial on Spectral Clustering, Statistics and Computing 17 (2007) 395–416. [22] T. Tammet, GKC: a Reasoning System for Large Knowledge Bases, in: P. Fontaine (Ed.), Proceedings of the 27th International Conference on Automated Deduction, number 11716 in Lecture Notes in Computer Science, Springer-Verlag, 2019, pp. 538–549. [23] K. Korovin, iProver - An Instantiation-Based Theorem Prover for First-order Logic (System Description), in: P. Baumgartner, A. Armando, G. Dowek (Eds.), Proceedings of the 4th International Joint Conference on Automated Reasoning, number 5195 in Lecture Notes in Artificial Intelligence, 2008, pp. 292–298. [24] A. Steen, C. Benzmüller, The Higher-Order Prover Leo-III, in: D. Galmiche, S. Schulz, R. Sebastiani (Eds.), Proceedings of the 8th International Joint Conference on Automated Reasoning, number 10900 in Lecture Notes in Artificial Intelligence, 2018, pp. 108–116. [25] A. Stojmirovic, Y. Yu, Information Flow in Interaction Networks, Journal of Computationl Biology 14 (2007) 1115–1143. [26] P. Devkota, M. Danzi, V. Lemmon, J. Bixby, S. Wuchty, Computational Identification of Kinases that Control Axon Growth in Mouse, SLAS Discovery (2020) To appear. 75