randoCoP: Randomizing the Proof Search Order in the Connection Calculus Thomas Raths Jens Otten Institut für Informatik, University of Potsdam August-Bebel-Str. 89, 14482 Potsdam-Babelsberg, Germany {traths,jeotten}@cs.uni-potsdam.de Abstract. We present randoCoP, a theorem prover for classical first- order logic, which integrates randomized search techniques into the con- nection prover leanCoP 2.0. By randomly reordering the axioms of the problem and the literals within its clausal form, the incomplete search variants of leanCoP 2.0 can be improved significantly. We introduce de- tails of the implementation and present comprehensive practical results by comparing the performance of randoCoP with leanCoP and other the- orem provers on the TPTP library and problems involving large theories. 1 Introduction Connection calculi, such as the connection calculus [1, 2], the connection tableau calculus [8], or the model elimination calculus [9], are in contrast to standard tableau or saturation-based calculi not proof confluent. Therefore a large amount of backtracking is required during the proof search. By restricting this back- tracking the performance of connection-based proof search procedures can be improved significantly [13]. leanCoP 2.0 [15, 14] is a theorem prover for classical first-order logic based on the connection calculus. A shell script consecutively runs different variants of the core prover with different options, which control the proof search. The most successful variants use restricted backtracking. The downside of restricted backtracking is the loss of completeness. Whereas proofs for some formulae can be found very quickly, it might be impossible to find proofs for other formulae anymore. Since restricted backtracking cuts off alternative connections, the benefit of this approach strongly depends on the proof search order. The proof search order, in turn, usually depends on the order of clauses and literals in the given formula. Whereas the proof search procedure quickly finds a proof for one order, another order of the same clauses might result in an incomplete proof search order, i.e. no proof is found at all. By reordering the clauses, the downside of restricted backtracking can be minimized. randoCoP extends the leanCoP 2.0 implementation by repeatedly reordering the axioms and literals of a given problem at random. This increases the chance to find a proof, in particular for the incomplete prover variants. In Section 2 we present experimental results of several reordering techniques and details of the implemented reordering strategy. In Section 3 the performance of randoCoP is compared with current state-of-the-art theorem provers. 94 2 Randomizing the Proof Search Order We first describe the basic motivation behind the randomized reordering tech- nique. Afterwards we present a detailed practical analysis, which determines the specific reordering strategy that is used within randoCoP. 2.1 Motivation leanCoP searches for connections in the order of the given input clauses. Connec- tions to clauses at the beginning of the input clauses are considered first, before (on backtracking) clauses at the end are examined. Therefore reordering clauses is a simple way of modifying the proof search order. But the effect of reordering clauses is limited for complete search strategies (see Section 2.2), since every clause is considered sooner or later anyway. leanCoP 2.0 has the option to restrict backtracking by cutting off alternative connections in branches that have already been closed before [13]. With this incomplete search technique some clauses might not be considered anymore and the reordering of clauses has a significant effect. It makes it possible to find proofs for problems that could not be solved before. Figure 1 illustrates this fact. The triangle represents the search space, the crosses mark the solutions, and the grey shaded area is the search space that can be traversed within a certain time limit and is roughly the same for all three triangles. Fig. 1. Search strategies: complete, restricted backtracking without/with reordering The complete search strategy (left hand side), does not reach the proof depth required for a solution. The search with restricted backtracking (in the middle), reaches the depth of the solutions but not the required breadth. Only the search strategy with restricted backtracking and repeatedly reordering the clauses (right hand side) is able to find a proof. Clauses can be reordered in several ways, e.g. by rotating or shifting clauses. leanCoP 2.0 already contains an option for reordering clauses that uses a simple perfect shuffle algorithm. But the effect on the proof search is small, since the generated clause orders are not sufficiently diverse. A random reordering mixes the order of clauses more thoroughly. Therefore instead of the built-in reordering technique a randomized reordering should be used. Since the outcome of the proof search needs to be reproducible, a deterministic pseudo-random reordering needs to be implemented. 95 2.2 Evaluation and Analysis Our practical experiments have shown that a reordering of the axioms of a given problem is more effective than the reordering of clauses after the clausal form translation has been applied. These experiments also showed that reordering the literals within the clauses has a positive effect on the proof search as well. The proof search order is randomly modified in the following way. Let A1 ∧ A2 ∧ . . . An ∧ ⇒ C be the given formula where A1 , A2 , . . . , An are the axioms and C is the conjecture of the problem. Let π m : {1, . . . , m} → {1, . . . , m} be a (pseudo-)random permutation. 1. Reordering of axioms: The following formula is generated: Aπn (1) ∧ Aπn (2) ∧ . . . ∧ Aπn (n) ⇒ C. Let C1 , C2 , . . . , Cm be this formula in clausal form. 2. Reordering of literals: For each clause Ci ={L1 , . . . , Lk } the reordered clause Ci0 ={Lπk (1) , . . . , Lπk (k) } is generated. C10 , . . . , Ck0 is the final set of clauses. We first want to find the options of the leanCoP 2.0 core prover that are most suited for our randomized reordering approach. A variant of the leanCoP 2.0 Prolog core prover is determined by a set of options that control the proof search [14]. The following options can be used: 1. nodef/def: The standard/definitional translation into clausal form is done. If none of these options is given the standard translation is used for the axioms whereas the definitional translation is used for the conjecture [13]. 2. conj: The conjecture clauses are used as start clauses. If this option is not given the positive clauses are used as start clauses. 3. scut: Backtracking is restricted for alternative start clauses [13]. 4. cut: Backtracking is restricted for alternative reduction/extension steps [13]. 5. comp(I): Restricted backtracking is switched off when iterative deepening exceeds the active path length I. The option conj is complete only for formulae with a provable conjecture, and the options scut and cut are only complete if used in combination with the option comp(I). The following tests were performed on all 3644 non-clausal problems (FOF division) of the TPTP library [18] version 3.3.0. The left side of Table 1 shows the number of proved problems without reordering using a time limit of 180 seconds. The right side displays the number of proved problems with repeated reordering. The allotted time limit for each order is 3 seconds with an overall time limit of 180 seconds allowing around 60 reorderings for each problem. The prover variants using restricted backtracking (options cut and scut) benefit most from the randomized reordering technique. The greatest advance is made for the variant using the default clausal form translation and both of the options cut and scut. The variants using a full definitional (def) or the standard (nodef) clausal form translation and no restricted backtracking show a lower performance. The most successful variants using reordering use the options {cut}, {cut,scut}, {conj,cut}, and {conj,cut,scut}. 96 Table 1. Results without and with repeated reordering for different prover variants options - cut cut,scut options - cut cut,scut - 1217 1346 1216 - 1342 1691 1684 def 1111 1321 1204 def 1044 1402 1428 nodef 1166 1257 1152 nodef 1077 1432 1436 conj 1208 1398 1336 conj 1422 1658 1597 conj,def 1208 1407 1349 conj,def 1095 1452 1370 conj,nodef 1227 1319 1058 conj,nodef 1152 1365 1239 We have tested the most successful variants again using different time limits for each order of the axioms/literals. Table 2 shows the results using an overall time limit of 180 seconds and a time limit of 2 seconds, 3 seconds and 4 seconds for proving each order allowing around 90, 60 and 45 reorderings, respectively. The difference in the number of proved problems is rather small with a slight advantage for a time limit of 3 seconds. Table 2. Results of different time limits per axiom/literal order options \ time 2s 3s 4s cut 1678 1691 1688 cut,scut 1670 1684 1650 conj,cut 1636 1658 1665 conj,cut,scut 1569 1597 1607 Further evaluations have shown that the most number of problems are proved by repeatedly running the two variants {cut,scut} and {def,conj,cut} each for a time limit of 3 seconds and 2 seconds on each axiom/literal order, respectively. In order to prove and refute a large number of problems within the first few seconds, the proof process is started by running the most successful complete prover variant using the options cut and comp(7) once for 5 seconds. A complete prover variant using the option def only is invoked at the end of the proof process for at least 10 seconds. 2.3 The Implementation randoCoP uses the random library of ECLiPSe Prolog, which contains the predi- cate rand_perm/2 that randomly permutes a list, and randomise/1 to seed and initialize the random generator, which allows the same sequence of permutations to be generated, and thus makes the result reproducible. randoCoP consists of a shell script that calls the leanCoP 2.0 core prover extended by a few predicates realizing the reordering of axioms and literals. This shell script is called with an argument that determines the overall time limit and invokes the prover variants according to Section 2.2. 97 Let T otalT ime be this given time limit in seconds. Then the shell script invokes the following variants of the core prover in the following order: 1. Prover variant {cut,comp(7)} for 5 seconds 2. Repeated reordering step, (T otalT ime − 15)/5 times: (a) Reordering of axioms and literals (b) Prover variant {cut,scut} for 3 seconds (c) Prover variant {def,conj,cut} for 2 seconds 3. Prover variant {def} for 10 seconds If, for example, T otalT ime is set to 600 seconds, then (600-15)/5=117 re- ordering steps are done and after each reordering the two prover variants are run for 3 seconds and 2 seconds, respectively. 3 Performance To evaluate the performance we tested randoCoP on all non-clausal problems in the TPTP library, and the problems of the MPTP challenge. All tests were done on a system with a 3 GHz Xeon processor and 4 Gbyte of memory running Linux and ECLiPSe Prolog 5.8. We compare the performance of randoCoP with the current state-of-the-art provers. 3.1 The TPTP Library For the tests all 3644 problems of version 3.3.0 of the TPTP library [18] that are in non-clausal form (FOF division) are considered. In order to solve satisfiable or unsatisfiable problems — i.e., those problems without a conjecture — these problems are negated. Equality is dealt with by adding the equality axioms. The time limit for all problems is 600 seconds. In Table 3 the performance of randoCoP is compared with the following theo- rem provers: Otter 3.3 [10], version ”20070805r009” of SNARK [19], leanCoP 2.0 [14], version ”Dec-2007” of Prover91 [11], iProver 0.2 [5], Equinox 1.2 [3], SPASS 3.0 [23], E 0.999 [17], and Vampire 9.0 [16]. The rating and the percentage of proved problems for some rating intervals are given. FNE, FEQ and PEQ are problems without, with and containing only equality, respectively. Furthermore, the number of proved problems for each domain (see [18]) that contains at least 10 problems is shown. randoCoP proves more problems than Equinox, iProver, Prover9, leanCoP 2.0, SNARK and Otter. It solves more problems in the SEU domain than any other prover. The SEU domain contains problems from set theory taken from the MPTP [20] (see also Section 3.2). Including the equality axioms, these problems contain up to 1100 axioms, of which not all are required to prove the conjecture. For the SEU domain randoCoP also shows the biggest improvement compared to leanCoP 2.0, with 491 proved problems compared to 296 problems proved by leanCoP 2.0. A notable improvement is also made for the domains NUM and SWC. 96 of all proved problems have the highest rating of 1.0. 1 The most recent version ”2008-04A” of Prover9 has a significant lower performance. 98 Table 3. Benchmark results on the TPTP v3.3.0 library Otter SNARK leanCoP Prover9 iProver Equinox randoCoP SPASS E Vampire 3.3 08/07 2.0 12/07 0.2 1.2 1.0 3.0 0.999 9.0 proved 1310 1565 1638 1677 1858 1876 1879 2127 2250 2377 [%] 36% 43% 45% 46% 51% 52% 52% 58% 62% 66% 0s to 1s 987 1259 1124 1281 1224 1376 1161 1627 1760 1530 1s to 10s 183 130 123 197 370 239 229 248 229 283 10s to 100s 106 117 193 141 179 151 375 170 192 394 100s to 600s 34 59 198 58 85 110 114 82 69 170 rating 0.0 455 435 450 464 473 500 452 500 503 488 rating >0.0 855 1130 1188 1213 1385 1376 1427 1627 1747 1889 rating 1.0 7 8 13 8 9 18 96 19 12 30 0.00...0.24 72% 72% 72% 73% 76% 76% 73% 78% 79% 78% 0.25...0.49 40% 62% 48% 70% 73% 72% 52% 84% 85% 86% 0.50...0.74 3% 20% 36% 28% 43% 40% 43% 58% 71% 80% 0.75...1.00 1% 2% 11% 3% 7% 9% 27% 15% 19% 27% FNE 476 437 492 526 562 541 496 555 561 566 FEQ 834 1128 1146 1151 1296 1335 1383 1572 1689 1811 PEQ 47 83 30 71 76 196 31 191 165 133 AGT 16 17 29 17 19 10 32 18 19 52 ALG 60 84 32 83 80 186 38 196 162 130 CSR 3 16 2 27 10 15 3 2 25 27 GEO 160 121 171 171 172 167 169 168 174 177 GRA 5 9 6 1 8 9 6 15 15 19 KRS 106 110 105 103 112 111 105 107 112 112 LCL 18 56 24 48 33 10 25 51 78 101 MED 5 2 7 1 9 5 5 9 8 9 MGT 54 58 45 62 65 67 50 56 67 67 NLP 6 11 13 11 22 22 15 22 22 22 NUM 31 45 70 49 43 45 87 48 62 105 PUZ 6 6 6 6 6 6 6 6 6 6 SET 229 262 339 276 316 245 335 290 339 369 SEU 149 229 296 259 291 305 491 353 316 347 SWC 84 131 87 101 170 175 99 256 326 297 SWV 157 181 177 183 210 221 181 235 225 236 SYN 210 214 217 267 277 265 218 279 276 281 refuted 0 113 33 0 329 0 33 362 366 0 time out 700 1596 1949 668 1455 1395 1732 1125 1023 985 gave up 1181 213 0 320 0 373 0 0 0 280 errors 453 157 24 979 2 0 0 30 5 2 99 3.2 The MPTP Challenge The MPTP challenge is a set of problems from the Mizar library translated into first-order logic [20]. There are two divisions, bushy and chainy, each containing 252 problems. Whereas the bushy division contains only the relevant axioms and lemmata required to prove the main theorem, the chainy division contains all axioms and lemmata that were available at the time of proving the main theorem of the challenge. The time limit for each problem is 300 seconds. The result of randoCoP on the bushy and chainy division is shown in Table 4 and Table 5, respectively. Again, equality axioms are added, which results in formulae with a total of up to 1700 axioms. The performance is compared with the theorem provers mentioned in Section 3.1. Table 4. Benchmark results for the bushy division of the MPTP challenge Otter Prover9 SNARK leanCoP iProver Equinox E SPASS Vampire randoCoP 3.3 12/07 08/07 2.0 0.2 1.2 0.999 3.0 9.0 1.0 proved 68 119 122 128 128 131 141 160 166 189 [%] 27% 47% 48% 51% 51% 52% 56% 64% 66% 75% 0s to 1s 56 86 91 93 71 86 120 121 104 93 1s to 10s 4 19 17 7 33 23 12 16 22 23 10s to 100s 7 9 12 20 20 17 8 17 31 60 100s to 300s 1 5 2 8 4 5 1 6 9 13 time out 46 82 130 124 124 121 111 90 86 63 gave up 137 3 0 0 0 0 0 0 0 0 errors 1 48 0 0 0 0 0 2 0 0 Table 5. Benchmark results for the chainy division of the MPTP challenge Otter Prover9 SNARK Equinox iProver Vampire SPASS leanCoP E randoCoP 3.3 12/07 08/07 1.2 0.2 9.0 3.0 2.0 0.999 1.0 proved 29 52 58 79 79 81 82 88 91 128 [%] 12% 21% 23% 31% 31% 32% 33% 35% 36% 51% 0s to 1s 17 33 22 41 29 29 45 38 47 38 1s to 10s 7 8 14 14 37 31 18 21 18 29 10s to 100s 5 6 14 17 10 14 15 23 20 34 100s to 300s 0 5 8 7 3 7 4 6 6 27 time out 150 101 194 173 173 171 170 164 161 124 gave up 73 0 0 0 0 0 0 0 0 0 errors 0 99 0 0 0 0 0 0 0 0 randoCoP shows a decent performance in both division. The time complexity is better compared to the other provers, as many problems are (still) proved after 10 seconds. We have not tested the provers MaLARea 0.1, SRASS 0.1, and Fampire 1.3, which prove 187/142, 171/127, and 191/126 of the problems in the bushy/chainy division, respectively (according to the MPTP challenge web site). 100 4 Conclusion and Related Work We have presented randoCoP, a theorem prover for classical first-order logic, which integrates a random proof search strategy into the connection prover leanCoP 2.0. Repeatedly reordering the axioms of the problem and the literals within its clausal form improves performance of leanCoP 2.0 significantly. Some incomplete strategies of leanCoP 2.0 effectively restrict backtracking and increase the depth of the search space that can be investigated within a certain amount of time. But they might cut off specific proof search orders required to find a proof. randoCoP partly compensates for this disadvantage and the loss of completeness by increasing again the breadth of the explored search space. The combination of restricted backtracking and randomized reordering is highly effective, in particular for hard problems containing many axioms. The core prover of randoCoP and leanCoP 2.0 consists only of a few lines of Prolog code. This indicates that tens to hundreds of thousands of lines in, e.g., C and low-level optimizations are not needed to succeed in automated theorem proving. Instead it shows that good heuristics for traversing the vast search are important in automated reasoning research. The RCTHEO system [4] randomly reorders clause instances. It is a an OR- parallel version of SETHEO [7], where each node executes one instance of the sequential prover SETHEO. The performance is similar to PARTHEO, a parallel version of SETHEO. The SETHEO system offers a dynamic subgoal reordering option. The re- ordering is not at random but prefers subgoals with the highest probability to fail, in order to reduce the search space. Syntactic criteria, such as the number of variables, are used to determine the specific order. The subgoal order is then determined dynamically whenever the next subgoal is selected. We have tested SETHEO without and with subgoal reordering on all non- clausal problems of the TPTP library (see Section 3.1). Without subgoal reorder- ing SETHEO proves 1192 out of the 3644 problems. With subgoal reordering (using the option -dynsgreord 2) it only solves 1185 problems, i.e. the perfor- mance does not really improve. This confirms our own testing with reordering on several variants of the leanCoP 2.0 core prover (see Section 2.2). The effect of reordering axioms (or clauses) and literals is limited when a complete search in connection calculi is done. In this case the performance can even get worse. On the other hand the incomplete variants of the leanCoP 2.0 core prover benefit significantly from the randomized reordering technique. Further research includes the adaption of randoCoP to other (non-classical) logics — such as intuitionistic logic [12, 14] or modal logic [6] — for which matrix characterisations exist (see also [21, 22]). It is also worth investigating approaches that randomize, e.g., the order of the conjecture clauses, or dynamically reorders clauses and/or literals during the proof search. And finally we plan to (slightly) extend the leanCoP 2.0 core prover so that a compact connection proof is re- turned. A readable proof is then output by a separate prover component. The source code of randoCoP can be obtained at the leanCoP website at http://www.leancop.de. 101 Acknowledgements. The authors would like to thank the referees for their useful comments, which have helped to improve this paper. References 1. W. Bibel. Matings in matrices. Communications of the ACM , 26:844–852, 1983. 2. W. Bibel. Automated Theorem Proving. Vieweg, second edition, 1987. 3. K. Classen. Equinox, a new theorem prover for full first-order logic with equality. In Dagstuhl Seminar 05431 on Deduction and Applications, 2005. 4. W. Ertel. OR-parallel theorem proving with random competition. In A. Voron- kov, Ed., LPAR’92, LNAI 624, pp. 226–237. Springer, 1992. 5. K. Korovin. Implementing an instantiation-based theorem prover for first-order logic. In C. Benzmueller, B. Fischer, G. Sutcliffe IWIL-6, 2006. 6. C. Kreitz, J. Otten. Connection-based theorem proving in classical and non- classical logics. Journal of Universal Computer Science, 5:88–112, Springer, 1999. 7. R. Letz, J. Schumann, S. Bayerl, W. Bibel. SETHEO: a high-performance theorem prover. Journal of Automated Reasoning, 8:183–212, 1992. 8. R. Letz, G. Stenz Model elimination and connection tableau procedures. Hand- book of Automated Reasoning, pp. 2015–2114, Elsevier, 2001. 9. D. Loveland. Mechanical theorem proving by model elimination. JACM , 15:236– 251, 1968. 10. W. McCune. Otter 3.0 reference manual and guide. Technical Report ANL- 94/6, Argonne National Laboratory, 1994. 11. W. McCune. Release of Prover9. Mile High Conference on Quasigroups, Loops and Nonassociative Systems, Denver, Colorado, 2005. 12. J. Otten. Clausal connection-based theorem proving in intuitionistic first-order logic. TABLEAUX 2005 , LNAI 3702, pages 245–261, Springer, 2005. 13. J. Otten. Restricting backtracking in connection calculi. Technical report, Institut für Informatik, University of Potsdam, 2008. 14. J. Otten. leanCoP 2.0 and ileanCoP 1.2: high performance lean theorem proving in classical and intuitionistic logic. IJCAR 2008. LNCS, Springer, 2008. 15. J. Otten, W. Bibel. leanCoP: lean connection-based theorem proving. Journal of Symbolic Computation, 36:139–161, 2003. 16. A. Riazanov, A. Voronkov. The design and implementation of Vampire. AI Communications 15(2-3): 91–110, 2002. 17. S. Schulz. E - a brainiac theorem prover. AI Communications, 15(2):111–126, 2002. 18. G. Sutcliffe, C. Suttner. The TPTP problem library - CNF release v1.2.1. Journal of Automated Reasoning, 21: 177–203, 1998. 19. M. Stickel, R. Waldinger, M. Lowry, T. Pressburger, I. Underwood. Deductive composition of astronomical software from subroutine libraries. In A. Bundy, Ed., CADE-12, LNCS 814, pp. 341–355, Springer, 1994. 20. J. Urban. MPTP 0.2: design, implementation, and initial experiments. Journal of Automated Reasoning, 37:21–43, 2006. 21. A. Waaler. Connections in nonclassical logics. Handbook of Automated Reason- ing, pp. 1487–1578, Elsevier, 2001. 22. L. Wallen. Automated deduction in nonclassical logic. MIT Press, 1990. 23. C. Weidenbach, R. A. Schmidt, T. Hillenbrand, R. Rusev, D. Topic. Sys- tem description: SPASS version 3.0. In F. Pfenning, Ed., CADE-21 , LNCS 4603, pp. 514–520. Springer, 2007. 102