Translating Higher-Order Problems to First-Order Clauses
                              Jia Meng1 , Lawrence C. Paulson2
                                  1
                                    National ICT, Australia
                                   jiameng@nicta.com.au
                    2
                      Computer Laboratory, University of Cambridge, U.K.
                                      LP15@cam.ac.uk


                                                 Abstract
               Proofs involving large specifications are typically carried out through interactive
           provers that use higher-order logic. A promising approach to improve the automa-
           tion of interactive provers is by integrating them with automatic provers, which
           are usually based on first-order logic. Consequently, it is necessary to translate
           higher-order logic formulae to first-order form. This translation should ideally be
           both sound and practical. We have implemented three higher-order to first-order
           translations, with particular emphasis on the translation of types. Omitting some
           type information improves the success rate, but can be unsound, so the interactive
           prover must verify the proofs. In this paper, we will describe our translations and
           experimental data that compares the three translations in respect of their success
           rates for various automatic provers.


     1    Introduction
     Interactive theorem provers, such as HOL4 [GM93], Isabelle [NPW02] and
     PVS [ORR+ 96] are widely used for formal verifications and specifications. They provide
     expressive formalisms and tools for managing large scale proof projects. However, a ma-
     jor weakness of interactive provers is the lack of automation. In order to overcome the
     problem, we have integrated Isabelle with automatic theorem provers (ATPs) [MQP06].
     ATPs combine a variety of reasoning methods and do not require users’ instructions
     on how to use an axiom or when to use an axiom. For example, they do not require
     equalities to be oriented, but use them as undirected equations.
         Among the many logics used by interactive provers, higher-order logic (HOL) is the
     most popular one because of its expressiveness. One of the most widely used logics in
     Isabelle is Isabelle/HOL. In contrast, most of the powerful ATPs are based on first-order
     logic (FOL). Therefore, it is important to translate HOL problems into FOL format.
         Those HOL problems that do not involve function variables, predicate variables or
     λ-abstractions can be translated directly to FOL format. However, we must be careful
     about how to translate HOL problems that are truly higher-order. In particular, we need
     to include the problem’s type information to preserve soundness. A sound approach is to
     include all types for all terms. Unfortunately, this will result in large terms and clauses,
     and much of the type information is redundant. A more compact type representation


70                                                                     Empirically Successful Computerized Reasoning
     could enhance the performance of ATPs. Omitting some type information could lead to
     unsoundness, which ultimately we will prevent through proof reconstruction (Sect. 2.5).
        We have implemented three HOL to FOL translations, and we believe two of them are
     new. We have also carried out extensive experiments in order to assess their effectiveness
     with the provers E [Sch04], SPASS [Wei01] and Vampire [RV01].
        Paper outline. We first describe three HOL to FOL translations and discuss their
     soundness (Sect. 2). We then describe the experiments we ran (Sect. 3) and finally offer
     some conclusions (Sect. 4).


     2      Background
     Higher-order logic (HOL) extends first-order logic (FOL) in several respects. The main
     difference is that HOL terms can denote truth values and functions. Function values
     can be expressed using λ-abstractions or by currying: that is, by applying a function
     to fewer than the maximum number of arguments. In FOL, a function must always be
     supplied the same number of arguments. In translating from HOL to FOL, the only way
     to reconcile this difference is to regard all HOL functions as constants while providing
     a two-argument function (called app and is abbreviated by @ below) to express function
     application. In addition, we need a predicate B to convert all top-level FOL terms of
     boolean type to predicates. These translations allow first-order provers to solve many
     problems that contain higher-order features, though they do not yield the full power of
     higher-order logic.
         For example, the HOL formula ∀F p(F (x)) is translated to

     {++B(@(p,@(F,x)))}

     We use the combinators I, K, C, B and S to represent λ-abstractions, asserting the
     combinator reduction equations as axioms. (Although K and S suffice in theory, the
     resulting translation is exponential in the number of abstractions.) Another axiom we
     assert is function extensionality:

                                          ∀f g [(∀x f (x) = g(x)) → f = g].

     It has the following clause form, where e is a reserved Skolem function symbol, yielding
     some x such that f (x) 6= g(x).

     {--equal(@(F,@(@(e,F),G)),@(G,@(@(e,F),G))),
      ++equal(F,G)}

        Finally, equality in Isabelle may appear in a λ-abstraction, and thus is treated as
     an ordinary function symbol. We use a new function symbol fequal to represent the
     function version of equality. Through λ-reduction, this equality may be promoted to
     predicate level, becoming an ordinary equality. Promotion requires two additional ax-
     ioms:

     {++B(@(@(fequal,X),Y)),--equal(X,Y)}
     {--B(@(@(fequal,X),Y)),++equal(X,Y)}


Empirically Successful Computerized Reasoning                                                     71
         Not all subgoals require the full power of HOL. Often the initial steps of the proof
     replace complicated constructions by simple ones. Of the remaining subgoals, many
     are purely first-order. Others are higher-order but use no λ-abstractions. These special
     cases admit more efficient translations into first-order clauses, though naturally we must
     also provide a translation that accommodates the general case.

     2.1   Types in Isabelle/HOL
     We have just seen how to represent HOL formulae in FOL form. A more important
     issue is to embed type information of HOL formulae in FOL clauses. Let us review how
     this works for problems that are already first-order [MP04, MQP06]. Before that, we
     give a brief overview of Isabelle/HOL’s polymorphically sorted type system. We refer
     readers to the two papers above for more information.
         Isabelle/HOL supports axiomatic type classes, where a type class is a set of types.
     For example, the type for real numbers real is a member of type class linorder. A type
     class is axiomatic because it may have a set of properties—specified by axioms—that
     all its member types should satisfy. A type may belong to several type classes and an
     intersection of type classes is a sort. Moreover, each type constructor has one or more
     arities, which describe how the result type class depends upon the arguments’ type
     classes. For example, type constructor list has an arity that says if its argument is a
     member of class linorder then the resulting list’s type is also a member of linorder.
         Constants can be overloaded and types can be polymorphic, allowing instantiation
     to more specific types. For example, the ≤ operator has the polymorphic type α →
     α → bool; when it has type nat → nat → bool it denotes the usual ordering of the
     natural numbers, and when it has type α set → α set → bool it denotes the subset
     relation. The latter type is still polymorphic in the type of the set’s elements. Isabelle’s
     overloading cannot be eliminated by preprocessing because polymorphic theorems about
     ≤ are applicable to all instances of this function, despite their different meanings.
         When we translate Isabelle formulae to FOL clauses, we need to formalize types,
     especially in view of Isabelle’s heavy use of overloading. We need to ensure that Isabelle
     theorems involving polymorphic functions are only used for appropriate types. To ac-
     complish this, polymorphic functions carry type information as additional arguments
     and we translate Isabelle types to FOL terms. For example, we translate x ≤ y where
     x and y are α set to le(x, y, set(α)). We also translate Isabelle’s axiomatic type classes
     into first-order clauses. For this, we translate type classes to FOL predicates and types
     to FOL terms.
         This translation is reasonably compact, and it enforces overloading (≤ on sets is
     not confused with ≤ on integers), but it does not capture other aspects of types. For
     example, if we declare a two-element datatype two, then we obtain the theorem

                                         ∀x [x = a ∨ x = b].

     The corresponding clause does not mention type two:
     {++equal(X,a), ++equal(X,b)}
     It therefore asserts that the universe consists of two elements; given our other axioms,
     ATPs easily detect the inconsistency. We simply live with this risk for the moment,


72                                                                Empirically Successful Computerized Reasoning
     pending the implementation of proof reconstruction. A simple way of detecting such
     issues is to check whether a proof refers to at least one conjecture clause: if not, then the
     axiom clauses by themselves are inconsistent. This method detects some invalid proofs,
     but not all of them.
         Since HOL problems require currying, we need a different type embedding method
     from first-order ones. We have implemented three type translations, namely fully-typed,
     partial-typed and constant-typed.

     2.2      The Fully-Typed Translation
     The fully-typed translation, which resembles Hurd’s translation [Hur02], is sound. The
     special function typeinfo, which we abbreviate to T, pairs each term with its type. For
     instance, the formula P < Q is translated to

     {++B(T(@(T(@(T(<, a=>a=>bool), T(P,a)), a=>bool), T(Q,a)), bool))}

     This translation is sound because it includes types for all terms and subterms, right
     down to the variables. When two terms are unified during a resolution step, their types
     are unified as well. This instantiation of types guarantees that terms created in the
     course of a proof continue to carry correct types. Isabelle unifies polymorphic terms
     similarly. In fact, the resolution steps performed by an ATP could in principle be
     reconstructed in Isabelle. Each FOL axiom clause corresponds to an Isabelle theorem.
     If two FOL clauses are resolved, then the resolvant FOL clause will correspond to the
     Isabelle theorem produced by Isabelle’s own resolution rule.
         The fully-typed translation introduces much redundancy. Every part of a function
     application is typed: the function’s type includes its argument and result types, which
     are repeated in the translation of the function’s argument and by including the type of
     the returned result. Through experiments (Sect. 3), we have found that these large terms
     degrade the ATPs’ performance. A more compact HOL translation should improve the
     success rate.
         Hurd [Hur03] uses an untyped translation for the same reason. No term or predicate
     has any type information. Because this translation can produce unsound proofs, Hurd
     relies on proof reconstruction to verify them. If reconstruction fails, Hurd calls the ATP
     again, using a typed translation. Hurd says that this happens less than one percent of
     the time. This combination of an efficient but unsound translation with a soundness
     check achieves both efficiency and soundness. We intend to take the same approach.
         If we are to achieve a compact HOL translation, we will have to omit some types,
     potentially admitting some unsound proofs. We cannot use a completely untyped trans-
     lation because our requirements differ from Hurd’s. His tactic sends to ATPs a few
     theorems that are chosen by users. In contrast, we send ATPs hundreds of theorems,
     many involving overloading. Omitting the types from this large collection would result
     in many absurd proofs, where for example, the operator ≤ simultaneously denoted “less
     than” on integers and the subset relation. We have designed and experimented with two
     compact HOL translations: the partial-typed and constant-typed translations. These
     translations attach the most important type information (such as type instantiations of
     polymorphic constants) that can block some incorrect resolutions.


Empirically Successful Computerized Reasoning                                                        73
     2.3   The Partial-Typed Translation
     The partial-typed translation only includes the types of functions in function calls. The
     type is translated to a FOL term and is inserted as a third argument of the application
     operator (@). Taking the previous formula P < Q as an example, we translate it to

     {++B(@(@(<, P, a=>a=>bool), Q, a=>bool))}.

     Here, the type of < is a=>a=>bool, and we include this type as an additional argument
     of function application @.
         In a HOL formula, a function may be passed to another function as an argument. If
     a function appears without arguments, we do not include its type. The FOL clauses are
     derived from Isabelle formulae, which we know to be well-formed and type correct. The
     partial-typed translation avoids the redundancy of the fully-typed translation. Most
     of the time, this type encoding also ensures correct treatments of Isabelle overloading:
     Isabelle overloaded constants are most likely to appear as operators (functions and
     predicates) in formulae, whose types are inserted by the partial-typed encoding.
         However, the partial-typed translation can still yield unsound proofs. It is vulnerable
     to the example involving datatype two, described in Sect. 2.1 above.

     2.4   The Constant-Typed Translation
     In the constant-typed translation, we include types of polymorphic constants only. Fur-
     thermore, we do not include a constant’s full type but only the instantiated values of its
     type variables. Monomorphic constants do not need to carry types because their names
     alone determine the types of their arguments. A polymorphic constant is translated to
     a first-order function symbol. Its arguments, which represent types, are obtained by
     matching its actual type against its declared type. This treatment of types is similar to
     the one we use for problems that are already first-order.
         Again considering our standard example, if P and Q are natural numbers (type nat),
     we translate the formula P < Q to

     {++B(@(@(<(nat), P), Q))}.

     Similarly, if P and Q are sets (type α set), it becomes

     {++B(@(@(<(set(a)), P), Q))}.

          As for equality, if it appears as a predicate, then we do not insert its type. However,
     if it appears as a constant in a combinator term, then we include its argument’s type as
     its argument. Similarly, we translated the equality axiom above to the two clauses

     {++B(@(@(fequal(T),X),Y)),--equal(X,Y)}
     {--B(@(@(fequal(T),X),Y)),++equal(X,Y)}

        This translation can reduce the size of terms significantly. However, like the partial-
     typed one, it can be unsound.


74                                                                Empirically Successful Computerized Reasoning
     2.5        Which Translation to Use and Soundness Issues
     Of the three HOL to FOL translations above, the fully-typed one is sound but pro-
     duces excessively large terms. The partial-typed and constant-typed translations are
     more compact, but may introduce unsound proofs. If we use either of the compact
     translations, then we must verify proofs in Isabelle to ensure soundness.
        There are several factors that affect the decision about which translation should be
     used as the default.

             • Can we verify the proofs for partial-typed and constant-typed translations? The
               answer is yes. Although the FOL clauses carry insufficient type information, the
               clauses still correspond to Isabelle lemmas and goals. Our approach to proof
               reconstruction, which is currently being implemented, involves following the low-
               level resolution steps. If two clauses cannot be resolved due to incompatible types,
               Isabelle will detect this.

             • What benefit do we obtain from using the compact translations? Our experi-
               mental results (Sect. 3) show that the compact translations can boost the success
               rate significantly. Therefore, it is worthwhile to use a compact translation, even
               if occasional unsound proofs require retrying the problem using the fully-typed
               translation.

         Moreover, we aim to reconstruct proofs in Isabelle even if we use fully-typed trans-
     lation. This is so that proofs can go through Isabelle kernel. Since proof reconstruction
     is needed regardless of which translation is used, the only potential extra cost involved
     in using a compact translation is that of occasional retries.


     3        Experiments
     It is obvious that the constant-typed translation is the most compact, while the fully-
     typed one is the least compact. We can therefore predict that the constant-typed trans-
     lation will deliver the best results with ATPs, while the fully-typed one will turn out
     to be the worst. However, such claims need to be backed up by observations, especially
     given that the fully-typed translation is the best for soundness.
         For our experiments, we took 79 problems generated by Isabelle, most of which are
     higher-order. Since our HOL translation can also be used for purely FOL problems
     and our experiments were aimed at testing efficiency of the translation methods, we
     translated all problems (both HOL and FOL) using the three translation methods we
     mentioned in the previous section. We used our relevance filter [MP06] to reduce the
     sizes of the problem. We ran these tests on a bank of Dual AMD Opteron processors
     running at 2400MHz, using Condor1 to manage our batch jobs.
         Each graph compares the success rates of the three translations, for some prover,
     as the runtime per problem increases from 10 to 300 seconds. These short runtimes
     are appropriate for our application of ATPs to support interactive proofs, We tested
     three provers: E (Fig. 1), SPASS 2.2 (Fig. 2) and Vampire 8 (Fig. 3). We used E version
         1
             http://www.cs.wisc.edu/condor/


Empirically Successful Computerized Reasoning                                                         75
            80%
                                                                                               full
                                                                                               partial
            70%
                                                                                               constant
            60%

            50%

            40%
               0s         50s      100s      150s      200s      250s       300s

                                     Figure 1: E, Version 0.91dev001


     0.91dev001, a development version that surpasses E 0.9 “Soom”. SPASS ran in auto-
     matic mode and, in a second run, with SOS enabled and splitting disabled.2 Vampire
     ran in its default mode and with its CASC option.
          On the whole, the constant-typed translation did indeed yield the highest success
     rate, while the fully-typed translation yielded the lowest, especially when runtime is
     increased. This is as we would expect, but a glance at the graphs shows that the
     situation is more complex than this. Against expectations, the partial-typed translation
     frequently outperforms the constant-typed one. For runtimes below about 200 seconds,
     the partial-typed translation gives the best results with SPASS (default settings) and
     Vampire. As runtime increases to 300 seconds, the constant-typed translation comes
     out top (or nearly) in all five graphs.
          At its default settings, SPASS does not perform well with any translation. It is safe
     to conjecture that only trivial problems are being proved, where the translation makes
     little difference. By enabling SOS, which makes the proof search more goal-directed,
     SPASS delivers excellent results with the constant-typed translation.
          Readers may ask whether the fully-typed translation has a lower success rate because
     the other translations are finding unsound proofs for our problems. We have found that
     three of the 79 problems have unsound proofs. For one of the problems, incorrect axioms
     cause all three of its translations to be unsound, so this error does not bias the results.
     For the other two problems, the compact translations are indeed to blame, giving a
     bias of 2/79 or 2.5% against the fully-typed translation. The advantage given by the
     compact translations is much greater than this. We do regard this unsoundness rate as
     too high, and we are considering a number of options for reducing it.
          To obtain a quantitative picture of the differences between the three translations, we
     chose one of the problems and used tptp2X [SS04] to summarize its syntactic features.
     This problem is of median size in our problem set. It has 1150 clauses after relevance
     filtering, of which 105 are non-trivial: the remainder constitute a monadic Horn theory
     that describes Isabelle’s type class system. Table 1 shows the figures common to all
     three translations. Table 2 shows variations among the translations.
          A major difference is the maximal term depth, where fully-typed appears to have
     the greatest maximal term depth while constant-typed has the least. Shallower terms
       2
           The precise option string is -Splits=0 -FullRed=0 -SOS=1.


76                                                                      Empirically Successful Computerized Reasoning
           80%
                                                                                 full
                                                                                 partial
           70%
                                                                                 constant
           60%

           50%

           40%
              0s            50s         100s    150s   200s   250s    300s

           80%
                                                                                 full
                                                                                 partial
           70%
                                                                                 constant
           60%

           50%

           40%
              0s            50s         100s    150s   200s   250s    300s

                               Figure 2: SPASS, Default Settings and with SOS

           80%
                                                                                 full
                                                                                 partial
           70%
                                                                                 constant
           60%

           50%

           40%
              0s            50s         100s    150s   200s   250s    300s

           80%
                                                                                 full
                                                                                 partial
           70%
                                                                                 constant
           60%

           50%

           40%
              0s            50s         100s    150s   200s   250s    300s

                          Figure 3: Vampire, Default Settings and in CASC Mode


Empirically Successful Computerized Reasoning                                               77
      Number of clauses       1150 (6 non-Horn; 198 unit)
      Number of literals      2105 (152 equality)
      Maximal clause size     3 (1 average)
      Number of predicates    74 (0 propositional; 1–2 arity)

                              Table 1: Common to All Translations

                 No. of functors     No. of variables       Max. term depth          File size
      full       51 (42 constant)    1292 (69 singleton)    19 (1 average)           223460
      partial    50 (42 constant)    1265 (66 singleton)    13 (1 average)           196663
      constant   49 (10 constant)    1292 (106 singleton)   8 (1 average)            171724

                        Table 2: Differences between Three Translations

     may be easier to process by ATPs, which may be a reason why constant-typed performs
     better overall. In addition, constant-typed produces the smallest problem file. Finally,
     although the fully-typed and constant-typed translations have the same number of vari-
     ables, the latter has more singleton variables, which are variables that occur in a clause
     only once.
         These statistics again suggest that the constant-typed translation should give the
     best results, so we have no explanation for the many situations in which the partial-
     typed translation performs best. The dips in the graphs, where the success rate drops
     as the runtime goes up, are also puzzling. A notable one is with Vampire, default
     settings, with the fully-typed translation (Fig. 3). Vampire’s limited resource strategy,
     which discards clauses that cannot be used within the time limit, may explain its dips.
     For E and SPASS we have no explanation, but we have no doubt that the dips are real.
     Similar dips appear in our other experiments [MP06], and all of these measurements are
     made by automatic procedures.


     4    Conclusions
     We have described three HOL to FOL translations, which differ in their treatment of
     types. We have carried out extensive experiments to evaluate the effectiveness of the
     three translations. We have also obtained some statistics concerning how compact our
     translations are. Of the three translations—fully-typed, partial-typed and constant-
     typed—the constant-typed translation produces the most compact output.
         Naturally, we would expect a provers’ success rate to increase with a more compact
     clause form. This is what we have seen with E and with SPASS (provided SOS is
     enabled). However, we are surprised to see the simplest and most compact format does
     not always yield the best results with Vampire. Given that Vampire gives the best
     overall results, the partial-typed translation is worth considering. Because the more
     compact translations can be unsound, proofs found using them must be validated in
     some way, such as by proof reconstruction.
         The higher-order logic we have investigated is Isabelle/HOL. However, our transla-
     tions should be equally applicable to the similar logic implemented in the HOL4 system.


78                                                              Empirically Successful Computerized Reasoning
     Any translations for PVS would have to take account of predicate subtyping, but their
     treatment of basic types might be based on our techniques.


     Acknowledgements
     The research was funded by the epsrc grant GR/S57198/01 Automation for Interactive
     Proof and by the L4.verified project of National ICT Australia. Joe Hurd has given
     much helpful advice on how to translate from HOL to FOL. The referees made many
     useful suggestions for improving this paper.


     References
     [BR04]         David Basin and Michaël Rusinowitch, editors. Automated Reasoning —
                    Second International Joint Conference, IJCAR 2004, LNAI 3097. Springer,
                    2004.

     [GM93]         Michael J. C. Gordon and Thomas F. Melham. Introduction to HOL: A
                    Theorem Proving Environment for Higher Order Logic. Cambridge Univ.
                    Press, 1993.

     [Hur02]        Joe Hurd. An LCF-style interface between HOL and first-order logic. In
                    Andrei Voronkov, editor, Automated Deduction — CADE-18 International
                    Conference, LNAI 2392, pages 134–138. Springer, 2002.

     [Hur03]        Joe Hurd. First-order proof tactics in higher-order logic theorem provers. In
                    Myla Archer, Ben Di Vito, and César Muñoz, editors, Design and Applica-
                    tion of Strategies/Tactics in Higher Order Logics, number NASA/CP-2003-
                    212448 in NASA Technical Reports, pages 56–68, September 2003.

     [MP04]         Jia Meng and Lawrence C. Paulson. Experiments on supporting interactive
                    proof using resolution. In Basin and Rusinowitch [BR04], pages 372–384.

     [MP06]         Jia Meng and Lawrence C. Paulson. Lightweight relevance filtering for
                    machine-generated resolution problems. These proceedings, 2006.

     [MQP06]        Jia Meng, Claire Quigley, and Lawrence C. Paulson. Automation for in-
                    teractive proof: First prototype. Information and Computation, 2006. in
                    press.

     [NPW02] Tobias Nipkow, Lawrence C. Paulson, and Markus Wenzel. Isabelle/HOL:
             A Proof Assistant for Higher-Order Logic. Springer, 2002. LNCS Tutorial
             2283.

     [ORR+ 96] S. Owre, S. Rajan, J.M. Rushby, N. Shankar, and M.K. Srivas. PVS: Com-
               bining specification, proof checking, and model checking. In Rajeev Alur
               and Thomas A. Henzinger, editors, Computer Aided Verification: 8th Inter-
               national Conference, CAV ’96, LNCS 1102, pages 411–414. Springer, 1996.


Empirically Successful Computerized Reasoning                                                       79
     [RV01]    Alexander Riazanov and Andrei Voronkov. Vampire 1.1 (system descrip-
               tion). In Rajeev Goré, Alexander Leitsch, and Tobias Nipkow, editors, Au-
               tomated Reasoning — First International Joint Conference, IJCAR 2001,
               LNAI 2083, pages 376–380. Springer, 2001.

     [Sch04]   Stephan Schulz. System description: E 0.81. In Basin and Rusinowitch
               [BR04], pages 223–228.

     [SS04]    Geoff Sutcliffe and Christian Suttner. The TPTP problem library for au-
               tomated theorem proving. On the Internet at http://www.cs.miami.edu/
               ∼tptp/, 2004.


     [Wei01]   Christoph Weidenbach. Combining superposition, sorts and splitting. In
               Alan Robinson and Andrei Voronkov, editors, Handbook of Automated Rea-
               soning, volume II, chapter 27, pages 1965–2013. Elsevier Science, 2001.


80                                                          Empirically Successful Computerized Reasoning