Translating Higher-Order Problems to First-Order Clauses Jia Meng1 , Lawrence C. Paulson2 1 National ICT, Australia jiameng@nicta.com.au 2 Computer Laboratory, University of Cambridge, U.K. LP15@cam.ac.uk Abstract Proofs involving large specifications are typically carried out through interactive provers that use higher-order logic. A promising approach to improve the automa- tion of interactive provers is by integrating them with automatic provers, which are usually based on first-order logic. Consequently, it is necessary to translate higher-order logic formulae to first-order form. This translation should ideally be both sound and practical. We have implemented three higher-order to first-order translations, with particular emphasis on the translation of types. Omitting some type information improves the success rate, but can be unsound, so the interactive prover must verify the proofs. In this paper, we will describe our translations and experimental data that compares the three translations in respect of their success rates for various automatic provers. 1 Introduction Interactive theorem provers, such as HOL4 [GM93], Isabelle [NPW02] and PVS [ORR+ 96] are widely used for formal verifications and specifications. They provide expressive formalisms and tools for managing large scale proof projects. However, a ma- jor weakness of interactive provers is the lack of automation. In order to overcome the problem, we have integrated Isabelle with automatic theorem provers (ATPs) [MQP06]. ATPs combine a variety of reasoning methods and do not require users’ instructions on how to use an axiom or when to use an axiom. For example, they do not require equalities to be oriented, but use them as undirected equations. Among the many logics used by interactive provers, higher-order logic (HOL) is the most popular one because of its expressiveness. One of the most widely used logics in Isabelle is Isabelle/HOL. In contrast, most of the powerful ATPs are based on first-order logic (FOL). Therefore, it is important to translate HOL problems into FOL format. Those HOL problems that do not involve function variables, predicate variables or λ-abstractions can be translated directly to FOL format. However, we must be careful about how to translate HOL problems that are truly higher-order. In particular, we need to include the problem’s type information to preserve soundness. A sound approach is to include all types for all terms. Unfortunately, this will result in large terms and clauses, and much of the type information is redundant. A more compact type representation 70 Empirically Successful Computerized Reasoning could enhance the performance of ATPs. Omitting some type information could lead to unsoundness, which ultimately we will prevent through proof reconstruction (Sect. 2.5). We have implemented three HOL to FOL translations, and we believe two of them are new. We have also carried out extensive experiments in order to assess their effectiveness with the provers E [Sch04], SPASS [Wei01] and Vampire [RV01]. Paper outline. We first describe three HOL to FOL translations and discuss their soundness (Sect. 2). We then describe the experiments we ran (Sect. 3) and finally offer some conclusions (Sect. 4). 2 Background Higher-order logic (HOL) extends first-order logic (FOL) in several respects. The main difference is that HOL terms can denote truth values and functions. Function values can be expressed using λ-abstractions or by currying: that is, by applying a function to fewer than the maximum number of arguments. In FOL, a function must always be supplied the same number of arguments. In translating from HOL to FOL, the only way to reconcile this difference is to regard all HOL functions as constants while providing a two-argument function (called app and is abbreviated by @ below) to express function application. In addition, we need a predicate B to convert all top-level FOL terms of boolean type to predicates. These translations allow first-order provers to solve many problems that contain higher-order features, though they do not yield the full power of higher-order logic. For example, the HOL formula ∀F p(F (x)) is translated to {++B(@(p,@(F,x)))} We use the combinators I, K, C, B and S to represent λ-abstractions, asserting the combinator reduction equations as axioms. (Although K and S suffice in theory, the resulting translation is exponential in the number of abstractions.) Another axiom we assert is function extensionality: ∀f g [(∀x f (x) = g(x)) → f = g]. It has the following clause form, where e is a reserved Skolem function symbol, yielding some x such that f (x) 6= g(x). {--equal(@(F,@(@(e,F),G)),@(G,@(@(e,F),G))), ++equal(F,G)} Finally, equality in Isabelle may appear in a λ-abstraction, and thus is treated as an ordinary function symbol. We use a new function symbol fequal to represent the function version of equality. Through λ-reduction, this equality may be promoted to predicate level, becoming an ordinary equality. Promotion requires two additional ax- ioms: {++B(@(@(fequal,X),Y)),--equal(X,Y)} {--B(@(@(fequal,X),Y)),++equal(X,Y)} Empirically Successful Computerized Reasoning 71 Not all subgoals require the full power of HOL. Often the initial steps of the proof replace complicated constructions by simple ones. Of the remaining subgoals, many are purely first-order. Others are higher-order but use no λ-abstractions. These special cases admit more efficient translations into first-order clauses, though naturally we must also provide a translation that accommodates the general case. 2.1 Types in Isabelle/HOL We have just seen how to represent HOL formulae in FOL form. A more important issue is to embed type information of HOL formulae in FOL clauses. Let us review how this works for problems that are already first-order [MP04, MQP06]. Before that, we give a brief overview of Isabelle/HOL’s polymorphically sorted type system. We refer readers to the two papers above for more information. Isabelle/HOL supports axiomatic type classes, where a type class is a set of types. For example, the type for real numbers real is a member of type class linorder. A type class is axiomatic because it may have a set of properties—specified by axioms—that all its member types should satisfy. A type may belong to several type classes and an intersection of type classes is a sort. Moreover, each type constructor has one or more arities, which describe how the result type class depends upon the arguments’ type classes. For example, type constructor list has an arity that says if its argument is a member of class linorder then the resulting list’s type is also a member of linorder. Constants can be overloaded and types can be polymorphic, allowing instantiation to more specific types. For example, the ≤ operator has the polymorphic type α → α → bool; when it has type nat → nat → bool it denotes the usual ordering of the natural numbers, and when it has type α set → α set → bool it denotes the subset relation. The latter type is still polymorphic in the type of the set’s elements. Isabelle’s overloading cannot be eliminated by preprocessing because polymorphic theorems about ≤ are applicable to all instances of this function, despite their different meanings. When we translate Isabelle formulae to FOL clauses, we need to formalize types, especially in view of Isabelle’s heavy use of overloading. We need to ensure that Isabelle theorems involving polymorphic functions are only used for appropriate types. To ac- complish this, polymorphic functions carry type information as additional arguments and we translate Isabelle types to FOL terms. For example, we translate x ≤ y where x and y are α set to le(x, y, set(α)). We also translate Isabelle’s axiomatic type classes into first-order clauses. For this, we translate type classes to FOL predicates and types to FOL terms. This translation is reasonably compact, and it enforces overloading (≤ on sets is not confused with ≤ on integers), but it does not capture other aspects of types. For example, if we declare a two-element datatype two, then we obtain the theorem ∀x [x = a ∨ x = b]. The corresponding clause does not mention type two: {++equal(X,a), ++equal(X,b)} It therefore asserts that the universe consists of two elements; given our other axioms, ATPs easily detect the inconsistency. We simply live with this risk for the moment, 72 Empirically Successful Computerized Reasoning pending the implementation of proof reconstruction. A simple way of detecting such issues is to check whether a proof refers to at least one conjecture clause: if not, then the axiom clauses by themselves are inconsistent. This method detects some invalid proofs, but not all of them. Since HOL problems require currying, we need a different type embedding method from first-order ones. We have implemented three type translations, namely fully-typed, partial-typed and constant-typed. 2.2 The Fully-Typed Translation The fully-typed translation, which resembles Hurd’s translation [Hur02], is sound. The special function typeinfo, which we abbreviate to T, pairs each term with its type. For instance, the formula P < Q is translated to {++B(T(@(T(@(T(<, a=>a=>bool), T(P,a)), a=>bool), T(Q,a)), bool))} This translation is sound because it includes types for all terms and subterms, right down to the variables. When two terms are unified during a resolution step, their types are unified as well. This instantiation of types guarantees that terms created in the course of a proof continue to carry correct types. Isabelle unifies polymorphic terms similarly. In fact, the resolution steps performed by an ATP could in principle be reconstructed in Isabelle. Each FOL axiom clause corresponds to an Isabelle theorem. If two FOL clauses are resolved, then the resolvant FOL clause will correspond to the Isabelle theorem produced by Isabelle’s own resolution rule. The fully-typed translation introduces much redundancy. Every part of a function application is typed: the function’s type includes its argument and result types, which are repeated in the translation of the function’s argument and by including the type of the returned result. Through experiments (Sect. 3), we have found that these large terms degrade the ATPs’ performance. A more compact HOL translation should improve the success rate. Hurd [Hur03] uses an untyped translation for the same reason. No term or predicate has any type information. Because this translation can produce unsound proofs, Hurd relies on proof reconstruction to verify them. If reconstruction fails, Hurd calls the ATP again, using a typed translation. Hurd says that this happens less than one percent of the time. This combination of an efficient but unsound translation with a soundness check achieves both efficiency and soundness. We intend to take the same approach. If we are to achieve a compact HOL translation, we will have to omit some types, potentially admitting some unsound proofs. We cannot use a completely untyped trans- lation because our requirements differ from Hurd’s. His tactic sends to ATPs a few theorems that are chosen by users. In contrast, we send ATPs hundreds of theorems, many involving overloading. Omitting the types from this large collection would result in many absurd proofs, where for example, the operator ≤ simultaneously denoted “less than” on integers and the subset relation. We have designed and experimented with two compact HOL translations: the partial-typed and constant-typed translations. These translations attach the most important type information (such as type instantiations of polymorphic constants) that can block some incorrect resolutions. Empirically Successful Computerized Reasoning 73 2.3 The Partial-Typed Translation The partial-typed translation only includes the types of functions in function calls. The type is translated to a FOL term and is inserted as a third argument of the application operator (@). Taking the previous formula P < Q as an example, we translate it to {++B(@(@(<, P, a=>a=>bool), Q, a=>bool))}. Here, the type of < is a=>a=>bool, and we include this type as an additional argument of function application @. In a HOL formula, a function may be passed to another function as an argument. If a function appears without arguments, we do not include its type. The FOL clauses are derived from Isabelle formulae, which we know to be well-formed and type correct. The partial-typed translation avoids the redundancy of the fully-typed translation. Most of the time, this type encoding also ensures correct treatments of Isabelle overloading: Isabelle overloaded constants are most likely to appear as operators (functions and predicates) in formulae, whose types are inserted by the partial-typed encoding. However, the partial-typed translation can still yield unsound proofs. It is vulnerable to the example involving datatype two, described in Sect. 2.1 above. 2.4 The Constant-Typed Translation In the constant-typed translation, we include types of polymorphic constants only. Fur- thermore, we do not include a constant’s full type but only the instantiated values of its type variables. Monomorphic constants do not need to carry types because their names alone determine the types of their arguments. A polymorphic constant is translated to a first-order function symbol. Its arguments, which represent types, are obtained by matching its actual type against its declared type. This treatment of types is similar to the one we use for problems that are already first-order. Again considering our standard example, if P and Q are natural numbers (type nat), we translate the formula P < Q to {++B(@(@(<(nat), P), Q))}. Similarly, if P and Q are sets (type α set), it becomes {++B(@(@(<(set(a)), P), Q))}. As for equality, if it appears as a predicate, then we do not insert its type. However, if it appears as a constant in a combinator term, then we include its argument’s type as its argument. Similarly, we translated the equality axiom above to the two clauses {++B(@(@(fequal(T),X),Y)),--equal(X,Y)} {--B(@(@(fequal(T),X),Y)),++equal(X,Y)} This translation can reduce the size of terms significantly. However, like the partial- typed one, it can be unsound. 74 Empirically Successful Computerized Reasoning 2.5 Which Translation to Use and Soundness Issues Of the three HOL to FOL translations above, the fully-typed one is sound but pro- duces excessively large terms. The partial-typed and constant-typed translations are more compact, but may introduce unsound proofs. If we use either of the compact translations, then we must verify proofs in Isabelle to ensure soundness. There are several factors that affect the decision about which translation should be used as the default. • Can we verify the proofs for partial-typed and constant-typed translations? The answer is yes. Although the FOL clauses carry insufficient type information, the clauses still correspond to Isabelle lemmas and goals. Our approach to proof reconstruction, which is currently being implemented, involves following the low- level resolution steps. If two clauses cannot be resolved due to incompatible types, Isabelle will detect this. • What benefit do we obtain from using the compact translations? Our experi- mental results (Sect. 3) show that the compact translations can boost the success rate significantly. Therefore, it is worthwhile to use a compact translation, even if occasional unsound proofs require retrying the problem using the fully-typed translation. Moreover, we aim to reconstruct proofs in Isabelle even if we use fully-typed trans- lation. This is so that proofs can go through Isabelle kernel. Since proof reconstruction is needed regardless of which translation is used, the only potential extra cost involved in using a compact translation is that of occasional retries. 3 Experiments It is obvious that the constant-typed translation is the most compact, while the fully- typed one is the least compact. We can therefore predict that the constant-typed trans- lation will deliver the best results with ATPs, while the fully-typed one will turn out to be the worst. However, such claims need to be backed up by observations, especially given that the fully-typed translation is the best for soundness. For our experiments, we took 79 problems generated by Isabelle, most of which are higher-order. Since our HOL translation can also be used for purely FOL problems and our experiments were aimed at testing efficiency of the translation methods, we translated all problems (both HOL and FOL) using the three translation methods we mentioned in the previous section. We used our relevance filter [MP06] to reduce the sizes of the problem. We ran these tests on a bank of Dual AMD Opteron processors running at 2400MHz, using Condor1 to manage our batch jobs. Each graph compares the success rates of the three translations, for some prover, as the runtime per problem increases from 10 to 300 seconds. These short runtimes are appropriate for our application of ATPs to support interactive proofs, We tested three provers: E (Fig. 1), SPASS 2.2 (Fig. 2) and Vampire 8 (Fig. 3). We used E version 1 http://www.cs.wisc.edu/condor/ Empirically Successful Computerized Reasoning 75 80% full partial 70% constant 60% 50% 40% 0s 50s 100s 150s 200s 250s 300s Figure 1: E, Version 0.91dev001 0.91dev001, a development version that surpasses E 0.9 “Soom”. SPASS ran in auto- matic mode and, in a second run, with SOS enabled and splitting disabled.2 Vampire ran in its default mode and with its CASC option. On the whole, the constant-typed translation did indeed yield the highest success rate, while the fully-typed translation yielded the lowest, especially when runtime is increased. This is as we would expect, but a glance at the graphs shows that the situation is more complex than this. Against expectations, the partial-typed translation frequently outperforms the constant-typed one. For runtimes below about 200 seconds, the partial-typed translation gives the best results with SPASS (default settings) and Vampire. As runtime increases to 300 seconds, the constant-typed translation comes out top (or nearly) in all five graphs. At its default settings, SPASS does not perform well with any translation. It is safe to conjecture that only trivial problems are being proved, where the translation makes little difference. By enabling SOS, which makes the proof search more goal-directed, SPASS delivers excellent results with the constant-typed translation. Readers may ask whether the fully-typed translation has a lower success rate because the other translations are finding unsound proofs for our problems. We have found that three of the 79 problems have unsound proofs. For one of the problems, incorrect axioms cause all three of its translations to be unsound, so this error does not bias the results. For the other two problems, the compact translations are indeed to blame, giving a bias of 2/79 or 2.5% against the fully-typed translation. The advantage given by the compact translations is much greater than this. We do regard this unsoundness rate as too high, and we are considering a number of options for reducing it. To obtain a quantitative picture of the differences between the three translations, we chose one of the problems and used tptp2X [SS04] to summarize its syntactic features. This problem is of median size in our problem set. It has 1150 clauses after relevance filtering, of which 105 are non-trivial: the remainder constitute a monadic Horn theory that describes Isabelle’s type class system. Table 1 shows the figures common to all three translations. Table 2 shows variations among the translations. A major difference is the maximal term depth, where fully-typed appears to have the greatest maximal term depth while constant-typed has the least. Shallower terms 2 The precise option string is -Splits=0 -FullRed=0 -SOS=1. 76 Empirically Successful Computerized Reasoning 80% full partial 70% constant 60% 50% 40% 0s 50s 100s 150s 200s 250s 300s 80% full partial 70% constant 60% 50% 40% 0s 50s 100s 150s 200s 250s 300s Figure 2: SPASS, Default Settings and with SOS 80% full partial 70% constant 60% 50% 40% 0s 50s 100s 150s 200s 250s 300s 80% full partial 70% constant 60% 50% 40% 0s 50s 100s 150s 200s 250s 300s Figure 3: Vampire, Default Settings and in CASC Mode Empirically Successful Computerized Reasoning 77 Number of clauses 1150 (6 non-Horn; 198 unit) Number of literals 2105 (152 equality) Maximal clause size 3 (1 average) Number of predicates 74 (0 propositional; 1–2 arity) Table 1: Common to All Translations No. of functors No. of variables Max. term depth File size full 51 (42 constant) 1292 (69 singleton) 19 (1 average) 223460 partial 50 (42 constant) 1265 (66 singleton) 13 (1 average) 196663 constant 49 (10 constant) 1292 (106 singleton) 8 (1 average) 171724 Table 2: Differences between Three Translations may be easier to process by ATPs, which may be a reason why constant-typed performs better overall. In addition, constant-typed produces the smallest problem file. Finally, although the fully-typed and constant-typed translations have the same number of vari- ables, the latter has more singleton variables, which are variables that occur in a clause only once. These statistics again suggest that the constant-typed translation should give the best results, so we have no explanation for the many situations in which the partial- typed translation performs best. The dips in the graphs, where the success rate drops as the runtime goes up, are also puzzling. A notable one is with Vampire, default settings, with the fully-typed translation (Fig. 3). Vampire’s limited resource strategy, which discards clauses that cannot be used within the time limit, may explain its dips. For E and SPASS we have no explanation, but we have no doubt that the dips are real. Similar dips appear in our other experiments [MP06], and all of these measurements are made by automatic procedures. 4 Conclusions We have described three HOL to FOL translations, which differ in their treatment of types. We have carried out extensive experiments to evaluate the effectiveness of the three translations. We have also obtained some statistics concerning how compact our translations are. Of the three translations—fully-typed, partial-typed and constant- typed—the constant-typed translation produces the most compact output. Naturally, we would expect a provers’ success rate to increase with a more compact clause form. This is what we have seen with E and with SPASS (provided SOS is enabled). However, we are surprised to see the simplest and most compact format does not always yield the best results with Vampire. Given that Vampire gives the best overall results, the partial-typed translation is worth considering. Because the more compact translations can be unsound, proofs found using them must be validated in some way, such as by proof reconstruction. The higher-order logic we have investigated is Isabelle/HOL. However, our transla- tions should be equally applicable to the similar logic implemented in the HOL4 system. 78 Empirically Successful Computerized Reasoning Any translations for PVS would have to take account of predicate subtyping, but their treatment of basic types might be based on our techniques. Acknowledgements The research was funded by the epsrc grant GR/S57198/01 Automation for Interactive Proof and by the L4.verified project of National ICT Australia. Joe Hurd has given much helpful advice on how to translate from HOL to FOL. The referees made many useful suggestions for improving this paper. References [BR04] David Basin and Michaël Rusinowitch, editors. Automated Reasoning — Second International Joint Conference, IJCAR 2004, LNAI 3097. Springer, 2004. [GM93] Michael J. C. Gordon and Thomas F. Melham. Introduction to HOL: A Theorem Proving Environment for Higher Order Logic. Cambridge Univ. Press, 1993. [Hur02] Joe Hurd. An LCF-style interface between HOL and first-order logic. In Andrei Voronkov, editor, Automated Deduction — CADE-18 International Conference, LNAI 2392, pages 134–138. Springer, 2002. [Hur03] Joe Hurd. First-order proof tactics in higher-order logic theorem provers. In Myla Archer, Ben Di Vito, and César Muñoz, editors, Design and Applica- tion of Strategies/Tactics in Higher Order Logics, number NASA/CP-2003- 212448 in NASA Technical Reports, pages 56–68, September 2003. [MP04] Jia Meng and Lawrence C. Paulson. Experiments on supporting interactive proof using resolution. In Basin and Rusinowitch [BR04], pages 372–384. [MP06] Jia Meng and Lawrence C. Paulson. Lightweight relevance filtering for machine-generated resolution problems. These proceedings, 2006. [MQP06] Jia Meng, Claire Quigley, and Lawrence C. Paulson. Automation for in- teractive proof: First prototype. Information and Computation, 2006. in press. [NPW02] Tobias Nipkow, Lawrence C. Paulson, and Markus Wenzel. Isabelle/HOL: A Proof Assistant for Higher-Order Logic. Springer, 2002. LNCS Tutorial 2283. [ORR+ 96] S. Owre, S. Rajan, J.M. Rushby, N. Shankar, and M.K. Srivas. PVS: Com- bining specification, proof checking, and model checking. In Rajeev Alur and Thomas A. Henzinger, editors, Computer Aided Verification: 8th Inter- national Conference, CAV ’96, LNCS 1102, pages 411–414. Springer, 1996. Empirically Successful Computerized Reasoning 79 [RV01] Alexander Riazanov and Andrei Voronkov. Vampire 1.1 (system descrip- tion). In Rajeev Goré, Alexander Leitsch, and Tobias Nipkow, editors, Au- tomated Reasoning — First International Joint Conference, IJCAR 2001, LNAI 2083, pages 376–380. Springer, 2001. [Sch04] Stephan Schulz. System description: E 0.81. In Basin and Rusinowitch [BR04], pages 223–228. [SS04] Geoff Sutcliffe and Christian Suttner. The TPTP problem library for au- tomated theorem proving. On the Internet at http://www.cs.miami.edu/ ∼tptp/, 2004. [Wei01] Christoph Weidenbach. Combining superposition, sorts and splitting. In Alan Robinson and Andrei Voronkov, editors, Handbook of Automated Rea- soning, volume II, chapter 27, pages 1965–2013. Elsevier Science, 2001. 80 Empirically Successful Computerized Reasoning