Efficiently Checking Propositional Resolution Proofs in Isabelle/HOL Tjark Weber1 1 Institut für Informatik, Technische Universität München Boltzmannstr. 3, D-85748 Garching b. München, Germany webertj@in.tum.de Abstract This paper describes the integration of zChaff and MiniSat, currently two leading SAT solvers, with Isabelle/HOL. Both SAT solvers generate resolution-style proofs for (instances of) propositional tautologies. These proofs are verified by the theorem prover. The presented approach significantly improves Isabelle’s performance on propositional problems, and exhibits counterexamples for unprovable conjectures. It is shown that an LCF-style theorem prover can serve as a viable proof checker even for large SAT problems. An efficient representation of the propositional problem in the theorem prover turns out to be crucial; several possible solutions are discussed. 1 Introduction Interactive theorem provers like PVS [ORS92], HOL [GM93] or Isabelle [Pau94] tradi- tionally support rich specification logics. Proof search and automation for these logics however is difficult, and proving a non-trivial theorem usually requires manual guidance by an expert user. Automated theorem provers on the other hand, while often designed for simpler logics, have become increasingly powerful over the past few years. New algo- rithms, improved heuristics and faster hardware allow interesting theorems to be proved with little or no human interaction, sometimes within seconds. By integrating automated provers with interactive systems, we can preserve the richness of our specification logic and at the same time increase the degree of automa- tion [Sha01]. This is an idea that goes back at least to the early nineties [KKS91]. However, to ensure that a potential bug in the automated prover does not render the whole system unsound, theorems in Isabelle, like in other LCF-style [Gor00] provers, can be derived only through a fixed set of core inference rules. Therefore it is not suffi- cient for the automated prover to return whether a formula is provable, but it must also generate the actual proof, expressed (or expressable) in terms of the interactive system’s inference rules. Formal verification is an important application area of interactive theorem prov- ing. Problems in verification can often be reduced to Boolean satisfiability (SAT), and recent SAT solver advances have made this approach feasible in practice. Hence the performance of an interactive prover on propositional problems may be of significant practical importance. In this paper we describe the integration of zChaff [MMZ+ 01] and MiniSat [ES04], two leading SAT solvers, with the Isabelle/HOL [NPW02] prover. We have shown earlier [Web05a, Web05b] that using a SAT solver to prove theo- rems of propositional logic dramatically improves Isabelle’s performance on this class of formulae, even when a rather naive (and unfortunately, as we will see in Section 3, inef- ficient) representation of propositional problems is used. Furthermore, while Isabelle’s previous decision procedures simply fail on unprovable conjectures, SAT solvers are able to produce concrete counterexamples. In this paper we focus on recent improvements of the proof reconstruction algorithm in Isabelle/HOL, which cause a speedup by sev- eral orders of magnitude. In particular the representation of the propositional problem turns out to be crucial for performance. While the implementation in [Web05a] was still limited to relatively small SAT problems, the recent improvements now allow to check proofs with millions of resolution steps in reasonable time. This shows that, somewhat contrary to common belief, efficient proof checking in an LCF-style system is feasible. The next section describes the integration of zChaff and MiniSat with Isabelle/HOL in more detail. In Section 3 we evaluate the performance of our approach, and report on experimental results. Related work is discussed in Section 4. Section 5 concludes this paper with some final remarks and points out directions for future research. 2 System Description To prove a propositional tautology φ in the Isabelle/HOL system with the help of zChaff or MiniSat, we proceed in several steps. First φ is negated, and the negation is converted into an equivalent formula φ∗ in conjunctive normal form. φ∗ is then written to a file in DIMACS CNF format [DIM93], the standard input format supported by most SAT solvers. zChaff and MiniSat, when run on this file, return either “unsatisfiable”, or a satisfying assignment for φ∗ . In the latter case, the satisfying assignment is displayed to the user. The assignment constitutes a counterexample to the original (unnegated) conjecture. When the solver returns “unsatisfiable” however, things are more complicated. If we have confidence in the SAT solver, we can simply trust its result and accept φ as a theorem in Isabelle. The theorem is tagged with an “oracle” flag to indicate that it was proved not through Isabelle’s own inference rules, but by an external tool. In this scenario, a bug in the SAT solver could potentially allow us to derive inconsistent theorems in Isabelle/HOL. The LCF-approach instead demands that we verify the solver’s claim of unsatisfia- bility within Isabelle/HOL. While this is not as simple as the validation of a satisfying assignment, the increasing complexity of SAT solvers has before raised the question of support for independent verification of their results, and in 2003 zChaff has been ex- tended by L. Zhang and S. Malik [ZM03] to generate resolution-style proofs that can be verified by an independent checker. (This issue has also been acknowledged by the an- nual SAT Competition, which has introduced a special track on certified unsat answers in 2005.) More recently, a proof-logging version of MiniSat has been released [ES06], and John Matthews has extended this version to produce human-readable proofs that are easy to parse [Mat06], similar to those produced by zChaff. Hence our main task boils down to using Isabelle/HOL as an independent checker for the resolution proofs Isabelle SAT Solver DIMACS CNF Input Preprocessing satisfiable? formula yes no Model Counterexample Proof Trace Proof Theorem reconstruction Figure 1: Isabelle – SAT System Architecture found by zChaff and MiniSat. Both solvers store their proof in a text file that is read in by Isabelle, and the individual resolution steps are replayed in Isabelle/HOL. Section 2.1 briefly describes the necessary preprocessing of the input formula, and details of the proof reconstruction are explained in Section 2.2. The overall system architecture is shown in Figure 1. 2.1 Preprocessing Isabelle/HOL offers higher-order logic (on top of Isabelle’s meta logic), whereas most SAT solvers only support formulae of propositional logic in conjunctive normal form (CNF). Therefore the (negated) input formula φ must be preprocessed before it can be passed to the solver. Two different CNF conversions are currently implemented: a naive encoding that may cause an exponential blowup of the formula, and a Tseitin-style encoding [Tse83] that may introduce (existentially quantified) auxiliary Boolean variables, cf. [Gor01]. The technical details can be found in [Web05a]. More sophisticated CNF conversions, e.g. from [NRW98], could be implemented as well. The main focus of our work however is on efficient proof reconstruction, less on transformations of the input formula: the benchmark problems used for evaluation in Section 3 are already given in CNF anyway. Note that it is not sufficient to convert φ into an equivalent formula φ∗ in CNF. Rather, we have to prove this equivalence inside Isabelle/HOL. The result is not a sin- gle formula, but a theorem of the form ! φ = φ∗ . The fact that our CNF transformation must be proof-producing leaves some potential for optimization. One could implement a non proof-producing (and therefore much faster) version of the same CNF transfor- mation, and use it for preprocessing instead. Application of the proof-producing version would then be necessary only if the SAT solver has in fact shown a formula to be un- satisfiable. The total runtime on provable formulae would increase slightly, as the CNF transformation needed to be done twice – first without, later with proofs. Preprocessing times for unprovable formulae however should improve. φ∗ is written to a file in DIMACS CNF format, and the SAT solver is invoked on this input file. 2.2 Proof Reconstruction When zChaff and MiniSat return “unsatisfiable”, they also generate a resolution-style proof of unsatisfiability and store the proof in a text file [ZM03, Mat06]. While the precise format of this file differs between the solvers, the essential proof technique is the same. Both SAT solvers use propositional resolution to derive new clauses from existing ones: P ∨x Q ∨ ¬x P ∨Q It is well-known that this single inference rule is sound and complete for propositional logic. A set of clauses is unsatisfiable iff the empty clause is derivable via resolution. (For the purpose of proof reconstruction, we are only interested in the proof returned by the SAT solver, not in the techniques and heuristics that the solver uses internally to find this proof. Therefore the integration of zChaff and MiniSat is quite similar – minor differences in their proof trace format aside –, and further SAT solvers capable of generating resolution-style proofs could be integrated with Isabelle in the exact same manner.) We assign a unique identifier – simply a non-negative integer, starting with 0 – to each clause of the original CNF formula. Further clauses derived by resolution are assigned identifiers by the solver. Often we are not interested in the clause obtained by resolving just two existing clauses, but only in the result of a whole resolution chain, where two clauses are resolved, the result is resolved with yet another clause, and so on. Consequently, we define an ML [MTHM97] type of propositional resolution proofs as a pair whose first component is a table mapping integers (to be interpreted as the identifiers of clauses derived by resolution) to lists of integers (to be interpreted as the identifiers of previously derived clauses that are part of the defining resolution chain). The second component of the proof is just the identifier of the empty clause. type proof = int list Inttab.table * int This type is merely intended as an internal format to store the information contained in a resolution proof. There are many restrictions on valid proofs that are not enforced by this type. For example, it does not ensure that its second component indeed denotes the empty clause, that every resolution step is legal, or that there are no circular depen- dencies between derived clauses. It is only important that every resolution proof can be represented as a value of type proof, not conversely. The proof returned by zChaff or MiniSat is translated into this internal format, and passed to the actual proof recon- struction algorithm. This algorithm will either generate an Isabelle/HOL theorem, or fail in case the proof is invalid (which should not happen unless the SAT solver contains a bug). 2.2.1 zChaff ’s Proof Trace Format The format of the proof trace generated by zChaff has not been documented before. Therefore a detailed description of the format and its interpretation, although not the main focus of this paper, seems in order. The proof file generated by zChaff consists of three sections, the first two of which are optional (but present in any non-trivial proof). The first section defines clauses derived from the original problem by resolution. A typical line would be “CL: 7 <= 2 3 0”, meaning that a new clause was derived by resolving clauses 2 and 3, and resolving the result with clause 0. In this example, the new clause is assigned the identifier 7, which may then be used in further lines of the proof file. Clauses of the original CNF formula are implicitly assigned identifiers starting from 0, in the order they are given in the DIMACS file. When converting zChaff’s proof into our internal format, the clause identifiers in a CL line can immediately be added to the table which constitutes the proof’s first component, with the new identifier as the key, and the list of resolvents as the associated value. The second section of the proof file records variable assignments that are implied by the first section, and by other variable assignments. As an example, consider “VAR: 3 L: 2 V: 0 A: 1 Lits: 4 7”. This line states that variable 3 must be false (i.e. its value must be 0; zChaff uses “V: 1” for variables that must be true) at decision level 2, the antecedent being clause 1. The antecedent is a clause in which every literal except for the one containing the assigned variable must evaluate to false because of variable assignments at lower decision levels (or because the antecedent is already a unit clause). The antecedent’s literals are given explicitly by zChaff, using an encoding that multiplies each variable by 2 and adds 1 for negative literals. Hence “Lits: 4 7” corresponds to the clause x2 ∨ ¬x3 . Our internal proof format does not allow to record variable assignments directly, but we can translate them by observing that they correspond to unit clauses. For each variable assignment in zChaff’s trace, a new clause identifier is generated (using the number of clauses derived in the first section as a basis, and the variable itself as offset) and added as a key to the proof’s table. The associated list of resolvents contains the antecedent, and is otherwise obtained from the explicitly given literals: for each literal’s variable (except for the one that is being assigned), a similar unit clause must have been added to the table before; its identifier computed according to the same formula. We ignore both the value and the level information in zChaff’s trace. The former is implicit in the derived unit clause (which contains the variable either positively or negatively), and the latter is implicit in the overall proof structure. The last section of the proof file consists of just one line which specifies the conflict clause, a clause which has only false literals: e.g. “CONF: 3 == 4 6”. Literals are encoded in the same way as in the second section, so clause 3 would be x2 ∨ x3 in this case. We translate this line into our internal proof format by generating a new clause identifier i which is added to the proof’s table, with the conflict clause itself and the unit clause for each of the conflict clause’s variables as associated resolvents. Finally, we set the proof’s second component to i. 2.2.2 MiniSat’s Proof Trace Format The proof-logging version of MiniSat originally generated proof traces in a rather com- pact (and again undocumented) binary format, for which we have not implemented a parser. John Matthews [Mat06] however has extended this version with the ability to produce readable proof traces in ASCII format, similar to those produced by zChaff. We describe the precise proof trace format, and its translation into our internal proof format. MiniSat’s proof traces, unlike zChaff’s, are not divided into sections. They contain four different types of statements: “R” to reference original clauses, “C” for clauses derived via resolution, “D” to delete clauses that are not needed anymore, and “X” to indicate the end of proof. Aside from “X”, which must appear exactly once and at the end of the proof trace, the other statements may appear in any number and (almost) any order. MiniSat does not implicitly assign identifiers to clauses in the original CNF formula. Instead, “R” statements, e.g. “R 0 <= -1 3 4”, are used to establish clause identifiers. This particular line introduces a clause identifier 0 for the clause ¬x1 ∨x3 ∨x4 , which must have been one of the original clauses in this example. (Note that MiniSat, unlike zChaff, uses the DIMACS encoding of literals in its proof trace.) Since our internal proof format uses different identifiers for the original clauses, the translation of MiniSat’s proof trace into the internal format becomes parameterized by a renaming R of clause identifiers. An “R” statement does not affect the proof itself, but it extends the renaming. The given literals are used to look up the identifier of the corresponding original clause, and the clause identifier introduced by the “R” statement is mapped to the clause’s original (internal) identifier. New clauses are derived from existing clauses via resolution chains. A typical line would be “C 7 <= 2 5 3 4 0”, meaning that a new clause with identifier 7 was derived by resolving clauses 2 and 3 (with x5 as the pivot variable), and resolving the result with clause 0 (with x4 as the pivot variable). In zChaff’s notation, this would correspond to “CL: 7 <= 2 3 0”. We add this line to the proof’s table just like for zChaff, but with one difference: MiniSat’s clause identifiers cannot be used directly. Instead, we generate a new internal clause identifier for this line, extend the renaming R by mapping MiniSat’s clause identifier (7 in this example) to the newly generated identifier, and apply R to the identifiers of resolvents as well. Clauses that are not needed anymore can be indicated by a “D” statement, followed by a clause identifier. Currently such statements are ignored. Making beneficial use of them would require not only a modified proof format, but also a different algorithm for proof reconstruction. Finally a line like “X 0 42” indicates the end of proof. The numbers are the mini- mum and maximum, respectively, identifiers of clauses used in the proof. We ignore the first identifier (which is usually 0 anyway), and use the second identifier, mapped from MiniSat’s identifier scheme to our internal one by applying R, as the identifier of the empty clause, i.e. as the proof’s second component. There is one significant difference between MiniSat’s and zChaff’s proof traces that should have become apparent from the foregoing description. MiniSat, unlike zChaff, records the pivot variable for each resolution step in its trace, i.e. the variable that occurs positively in one clause partaking in the resolution, and negatively in the other. This information is redundant, as the pivot variable can always be determined from those two clauses: If two clauses containing more than one variable both positively and negatively were to be resolved, the resulting clause would be tautological, i.e. contain a variable and its negation. Both zChaff and MiniSat are smart enough not to derive such tautological clauses in the first place. We have decided to ignore the pivot information in MiniSat’s traces, since proof reconstruction for zChaff requires the pivot variable to be determined anyway, and using MiniSat’s pivot data would need a modified internal proof format. This however leaves some potential for optimization wrt. replaying MiniSat proofs. 2.2.3 Proof Reconstruction We now come to the core of this paper. The task of proof reconstruction is to derive False from the original clauses, using information from a value of type proof (which represents a resolution proof found by a SAT solver). This can be done in various ways. In particular the precise representation of the problem as an Isabelle/HOL theorem (or a collection of Isabelle/HOL theorems) turns out to be crucial for performance. Naive HOL Representation In an early implementation [Web05a], the whole prob- lem was represented as a single theorem ! (φ∗ =⇒ False) =⇒ (φ∗ =⇒ False), where φ∗ was completely encoded in HOL as a conjunction of disjunctions. Step by step, this theorem was then modified to reduce the antecedent φ∗ =⇒ False to True, which would eventually prove ! φ∗ =⇒ False. This was extremely inefficient for two reasons. First, every resolution step required manipulation of the whole (possibly huge) problem at once. Second, and just as im- portant, SAT solvers treat clauses as sets of literals, making implicit use of associa- tivity, commutativity and idempotence of disjunction. Likewise, CNF formulae are treated as sets of clauses, making implicit use of the same properties for conjunction. The encoding in HOL however required numerous explicit rewrites (with theorems like ! (P ∨ Q) = (Q ∨ P )) to reorder clauses and literals before each resolution step. Separate Clauses Representation A better representation of the CNF formula was discussed in [FMM+ 06]. In order to understand it, we need to look at the ML datatype of theorems that Isabelle uses internally. Every theorem encodes a sequent Γ ! φ, where φ is a single formula, and Γ is a finite set of formulae (implemented as an ordered list of terms, although this detail doesn’t matter to us). The intended interpretation is that φ holds when every formula in Γ is assumed as a hypothesis. So far we have only considered theorems where Γ = ∅, written ! φ for short. This was motivated by the normal user-level view on theorems in Isabelle, where assumptions are encoded using implications =⇒ , rather than hypotheses. Isabelle’s inference kernel however provides rules that let us convert between hypotheses and implications as we like: Γ!ψ Γ ! φ =⇒ ψ Γ" ! φ Assume impI impE {φ} ! φ Γ \ φ ! φ =⇒ ψ Γ ∪ Γ" ! ψ Let us use [[A1 ; . . . ; An ]] =⇒ B as a short hand for A1 =⇒ . . . =⇒ An =⇒ B (with implication associating to the right). In [FMM+ 06], each clause p1 ∨ . . . ∨ pn is encoded as an implication p1 =⇒ . . . =⇒ pn =⇒ False (where pi denotes the negation normal form of ¬pi , for 1 ≤ i ≤ n), and turned into a separate theorem {p1 ∨ . . . ∨ pn } ! [[p1 ; . . . ; pn ]] =⇒ False. This allows resolution to operate on comparatively small objects, and resolving two clauses Γ ! [[p1 ; . . . ; pn ]] =⇒ False and Γ" ! [[q1 ; . . . ; qm ]] =⇒ False, where ¬pi = qj for some i and j, essentially becomes an application of the cut rule. The first clause is rewritten to Γ ! [[p1 ; . . . ; pi−1 ; pi+1 ; . . . ; pn ]] =⇒ ¬pi . A derived Isabelle tactic then performs the cut to obtain Γ ∪ Γ" ! [[q1 ; . . . ; qj−1 ; p1 ; . . . ; pi−1 ; pi+1 ; . . . ; pn ; qj+1 ; . . . ; qm ]] =⇒ False from the two clauses. Note that this representation, while breaking apart the given clauses into separate theorems allows us to view the CNF formula as a set of clauses, still does not allow us to view each individual clause as a set of literals. Some reordering of literals is necessary before cuts can be performed, and after each cut, duplicate literals have to be removed from the result. Sequent Representation We can further exploit the fact that Isabelle’s inference kernel treats a theorem’s hypotheses as a set of formulae, by encoding each clause using hypotheses only. Consider the following representation of a clause p1 ∨ . . . ∨ pn as an Isabelle/HOL theorem: {p1 ∨ . . . ∨ pn , p1 , . . . , pn } ! False. Resolving two clauses p1 ∨ . . . ∨ pn and q1 ∨ . . . ∨ qm , where ¬pi = qj , now starts with two applications of the impI rule to obtain theorems {p1 ∨ . . . ∨ pn , p1 , . . . , pi−1 , pi+1 , . . . , pn } ! ¬pi =⇒ False and {q1 ∨ . . . ∨ qm , q1 , . . . , qj−1 , qj+1 , . . . , qm } ! pi =⇒ False. We then instantiate a previously proven theorem ! (P =⇒ False) =⇒ (¬P =⇒ False) =⇒ False (where P is an arbitrary proposition) with pi for P . Instantiation is another basic operation provided by Isabelle’s inference kernel. Finally two applications of impE yield {p1 ∨. . .∨pn , p1 , . . . , pi−1 , pi+1 , . . . , pn }∪{q1 ∨. . .∨qm , q1 , . . . , qj−1 , qj+1 , . . . , qm } ! False. This approach requires no explicit reordering of literals anymore. Furthermore, duplicate literals do not need to be eliminated after resolution. This is all handled by the inference kernel now; the sequent representation is as close to a SAT solver’s view of clauses as sets of literals as possible in Isabelle. With this representation, we do not rely on derived tactics anymore to perform resolution, but we can give a precise description of the implementation in terms of (five, as we see above) applications of core inference rules. CNF Sequent Representation The sequent representation has the disadvantage that each clause contains itself as a hypothesis. Since hypotheses are accumulated during resolution, this leads to larger and larger sets of hypotheses, which will eventually contain every clause used in the resolution proof. Forming the union of these sets takes the kernel a significant amount of time. It is therefore faster to use a slightly different clause representation, !k where each clause contains the whole CNF formula φ ∗ as a hypothesis. Let φ ≡ i=1 Ci , where k is the number of clauses. Using the Assume ∗ ! ! rule, we obtain a theorem { ki=1 Ci } ! ki=1 Ci . Repeated elimination of conjunction (with the help of ! two theorems, namely!! P ∧ Q =⇒ P and ! P ∧ Q =⇒ Q) yields a list of theorems { ki=1 Ci } ! C1 , . . . , { ki=1 Ci } ! Ck . Each of these theorems is then converted into the sequent form described above, with literals as hypotheses and False as the theorem’s conclusion. This representation increases preprocessing times slightly, but !k throughout the entire proof, the set of hypotheses for each clause now consists of i=1 Ci and the clause’s literals only. It is therefore much smaller than before, which speeds !k up resolution. Furthermore, memory requirements do not increase: the term i=1 Ci needs to be kept in memory only once, and can be shared between different clauses. This can also be exploited when the union of hypotheses is formed (assuming that the inference kernel and the underlying ML system support ! it): a simple pointer comparison is sufficient to determine that both theorems contain ki=1 Ci as a hypothesis (and hence that the resulting theorem needs to contain it only once); no lengthy term traversal is required. We should mention that this representation of clauses, despite its superior practical performance, has a small downside. The resulting theorem always has every given clause as a premise, while the theorem produced by the sequent representation only has those clauses as premises that were actually used in the proof. If the logically stronger theorem is needed, it can be obtained by analyzing the resolution proof to identify the used clauses beforehand, and filtering out the unused ones before proof reconstruction. We still need to determine the pivot literal (i.e. pi and ¬pi in the above example) before resolving two clauses. This could be done by directly comparing the hypotheses of the two clauses, and searching for a term that occurs both positively and negatively. It turns out to be slightly faster however (and also more robust, since we make fewer assumptions about the actual implementation of hypotheses in Isabelle) to use our own data structure. With each clause, we associate a table that maps integers – one for each literal in the clause – to the Isabelle term representation of a literal. The table is an inverse of the mapping from literals to integers that was constructed for translation into DIMACS format, but restricted to the literals that actually occur in a clause. Positive integers are mapped to positive literals (atoms), and negative integers are mapped to negative literals (negated atoms). This way term negation simply corresponds to integer negation. The table associated with the result of a resolution step is the union of the two tables that were associated with the resolvents, but with the entry for pi (¬pi , respectively) removed from the table associated with the first (second, respectively) clause. Another optimization, related not to the representation of individual clauses, but to the overall proof structure, is perhaps more obvious and has been present in our implementations since the beginning. zChaff and MiniSat, during proof search, may generate many clauses that are ultimately not needed to derive the empty clause. Instead of replaying the whole proof trace in chronological order, we perform “backwards” proof reconstruction, starting with the identifier of the empty clause, and recursively proving the required resolvents using depth-first search. While some clauses may not be needed at all, others may be used multiple times in the resolution proof. It would be highly inefficient to prove these clauses over and over again. Therefore all clauses proved are stored in an array, which is allocated at the beginning of proof reconstruction (with a size big enough to possibly hold all clauses derived during the proof). Initially, this array only contains clauses present in the original CNF formula, still in their original format as a disjunction of literals. Whenever an original clause is used as a resolvent, it is converted into the sequent format described above. (Note that this avoids converting original clauses that are not used in the proof at all.) The converted clause, along with its literal table, is stored in the array instead of the original (unconverted) clause. Each clause obtained as the result of a resolution chain is stored as well. Reusing a previously proved clause merely causes an array lookup. For this reason, it could be beneficial to analyze the resolution chains in more detail: sometimes very similar chains occur in a proof, differing only in a clause or two. Common parts of resolution chains could be stored as additional lemmas (which only need to be derived once), thereby reducing the total number of resolution steps. A detailed evaluation of this idea is beyond the scope of this paper. 2.3 A Simple Example In this section we illustrate the proof reconstruction using a small example. Consider the following input formula φ ≡ (¬x1 ∨ x2 ) ∧ (¬x2 ∨ ¬x3 ) ∧ (x1 ∨ x2 ) ∧ (¬x2 ∨ x3 ). Since φ is already in conjunctive normal form, preprocessing simply yields the theorem ! φ = φ. The corresponding DIMACS CNF file, aside from its header, contains one line for each clause in φ: -1 2 0 -2 -3 0 1 2 0 -2 3 0 zChaff and MiniSat easily detect that this problem is unsatisfiable. zChaff creates a text file with the following data: CL: 4 <= 2 0 VAR: 2 L: 0 V: 1 A: 4 Lits: 4 VAR: 3 L: 1 V: 0 A: 1 Lits: 5 7 CONF: 3 == 5 6 We see that first a new clause, with identifier 4, is derived by resolving clause 2, x1 ∨ x2 , with clause 0, ¬x1 ∨ x2 . The pivot variable which occurs both positively (in clause 2) and negatively (in clause 0) is x1 ; this variable is eliminated by resolution. x1 ∨ x2 ¬x1 ∨ x2 x1 ∨ x2 ¬x1 ∨ x2 ¬x2 ∨ x3 x2 ¬x2 ∨ ¬x3 x2 x3 ¬x3 ⊥ Figure 2: Resolution Proof found by zChaff Now the value of x2 (VAR: 2) can be deduced from clause 4 (A: 4). x2 must be true (V: 1). Clause 4 contains only one literal (Lits: 4), namely x2 (since 4 ÷ 2 = 2), occuring positively (since 4 mod 2 = 0). This decision is made at level 0 (L: 0), before any decision at higher levels. Likewise, the value of x3 can then be deduced from clause 1, ¬x2 ∨ ¬x3 . x3 must be false (V: 0). Finally clause 3 is our conflict clause. It contains two literals, ¬x2 (since 5 ÷ 2 = 2, 5 mod 2 = 1) and x3 (since 6 ÷ 2 = 3, 6 mod 2 = 0). But we already know that both literals must be false, so this clause is not satisfiable. In Isabelle, the resolution proof corresponding to zChaff’s proof trace is constructed backwards from the conflict clause. A tree-like representation of the proof is shown in Figure 2. Note that information concerning the level of decisions, the actual value of variables, or the literals that occur in a clause is redundant in the sense that it is not needed by Isabelle to validate zChaff’s proof. The clause x2 , although used twice in the proof, is derived only once during resolution (and reused the second time), saving one resolution step in this little example. The proof trace produced by MiniSat for the same problem happens to encode a different resolution proof: R 0 <= -1 2 R 1 <= -2 -3 R 2 <= 1 2 R 3 <= -2 3 C 4 <= 3 3 1 C 5 <= 0 2 4 C 6 <= 2 2 4 C 7 <= 5 1 6 X 0 7 The first four lines introduce clause identifiers for all four clauses in the original problem, in their original order as well (effectively making the renaming R from Mini- Sat’s clause identifiers to internal clause identifiers the identity in this case). The next four lines define four new clauses (one clause per line), derived by resolution. Clause 4 is the result of resolving clause 3 (¬x2 ∨ x3 ) with clause 1 (¬x2 ∨ ¬x3 ), where x3 is used as pivot literal. Hence clause 4 is equal to ¬x2 . Likewise, clause 5 is the result of resolving clauses 0 and 4, and clause 6 is obtained by resolving clauses 2 and 4. Finally resolving clauses 5 and 6 yields the empty clause, which is assigned clause identifier 7. The proof is shown in Figure 3. Again one resolution step is saved in the Isabelle implementation because clause ¬x2 is proved only once. ¬x2 ∨ x3 ¬x2 ∨ ¬x3 ¬x2 ∨ x3 ¬x2 ∨ ¬x3 ¬x1 ∨ x2 ¬x2 x1 ∨ x2 ¬x2 ¬x1 x1 ⊥ Figure 3: Resolution Proof found by MiniSat Problem Representation Proof Reconstruction (zChaff) Naive HOL 726.5 Separate Clauses 7.8 Sequent 1.2 CNF Sequent 0.5 Table 1: Runtimes (in seconds) for MSC007-1.008 3 Evaluation In [Web05a], we compared the performance of our approach, using the naive HOL problem representation, to that of Isabelle’s existing automatic proof procedures on all 42 problems contained in version 2.6.0 of the TPTP library [SS98] that have a representation in propositional logic. The problems were negated, so that unsatisfiable problems became provable. The benchmarks were run on a machine with a 3 GHz Intel Xeon CPU and 1 GB of main memory. 19 of these 42 problems are rather easy, and were solved in less than a second each by both the existing procedures and the SAT solver approach. On the remaining 23 prob- lems, zChaff proved to be clearly superior to Isabelle’s built-in proof procedures. zChaff solved all problems in less than a second, and proof reconstruction in Isabelle/HOL took a few seconds at most for all but one problem: with the naive HOL representation, the proof for problem MSC007-1.008 was reconstructed in just over 12 minutes. To give an impression of the effect that the different clause representations discussed in Section 2.2.3 have on performance, Table 1 shows the different times required to prove problem MSC007-1.008. The proof found by zChaff for this problem has 8,705 resolution steps. MiniSat finds a proof with 40,790 resolution steps for the same problem, which is reconstructed in about 3.8 seconds total with the sequent representation, and in 1.1 seconds total with the CNF sequent representation. The times to prove the other problems from the TPTP library have decreased in a similar fashion and are well below one second each now. This enables us to evaluate the performance on some significantly larger problems, taken from the SATLIB library [HS00]. These problems do not only push Isabelle’s inference kernel to its limits, but also other parts of the prover. While the smaller TPTP problems were converted to Isabelle’s input syntax by a Perl script, this approach turns out to be infeasible for the larger SATLIB problems. The Perl script still works fine, but Isabelle’s parser (which was mainly intended for small, hand-crafted terms) is unable to parse the resulting theory files, which are several megabytes large, in reasonable time. Also, the prover’s user interface is unable to display the resulting formulae. We have therefore implemented our own little parser, which builds ML terms directly from Problem Variables Clauses zChaff (s) Proof (s) Resolutions Total (s) c7552mul.miter 11282 69529 73 70 252200 145 6pipe 15800 394739 167 321 268808 512 6pipe 6 ooo 17064 545612 308 2575 870345 3179 7pipe 23910 751118 495 1132 357136 1768 Table 2: Runtimes (in seconds) for SATLIB problems, zChaff Problem Variables Clauses MiniSat (s) Proof (s) Resolutions Total (s) c7552mul.miter 11282 69529 25 49 908231 106 6pipe 15800 394739 x — — — 6pipe 6 ooo 17064 545612 x — — — 7pipe 23910 751118 x — — — Table 3: Runtimes (in seconds) for SATLIB problems, MiniSat DIMACS files, and we work entirely at the system’s ML level, avoiding the usual user interface, to prove unsatisfiability. Statistics for four unsatisfiable SATLIB problems (chosen among those that were used to evaluate zChaff’s performance in [ZM03]) are shown in Tables 2 and 3, for zChaff and MiniSat respectively. The first column shows the time in seconds that it takes the SAT solver to find a proof of unsatisfiability. The second column, “Proof”, shows the time in seconds required to replay the proof’s resolutions steps in Isabelle/HOL, using the CNF sequent representation of clauses. The third column shows the number of resolution steps performed during proof replay. The last column, “Total”, finally shows the total time to prove the problem unsatisfiable in Isabelle, including SAT solving time, proof replay, parsing of input and output files, and any other intermediate pre- and postprocessing. These timings were obtained on an AMD Athlon 64 X2 Dual Core Processor 3800+ with 4 GB of main memory. An x indicates that the solver ran out of memory, or that the proof trace file exceeded a size of 2 GB. Needless to say that none of these problems can be solved automatically by Isabelle’s built-in proof procedures. It seems that proof checking in Isabelle/HOL, despite all optimizations that we have implemented, is sometimes about an order of magnitude slower than proof verification with an external checker written in C++ [ZM03]. From Table 2 we conclude that proving unsatisfiability in Isabelle/HOL is by a factor of roughly 2 to 10 slower than using zChaff alone. This additional overhead was to be expected: it is the price that we have to pay for using Isabelle’s LCF-style kernel, which is not geared towards propositional logic. However, we also see that proof reconstruction in Isabelle scales quite well with our latest implementation, and that it remains feasible even for large SAT problems. Comparing the runtimes for problem c7552mul.miter on the proofs found by zChaff and MiniSat, we see that the time taken to reconstruct a proof does not solely depend on the number of resolutions steps. In particular our algorithm for resolving two clauses, as described in Section 2.2.3, is linear in the length (i.e. number of literals) of those clauses. The average length of a clause is about 31.0 for the MiniSat proof, and about 98.6 for the proof found by zChaff. This explains why the zChaff proof, despite its smaller number of resolution steps, takes longer to reconstruct. 4 Related Work Michael Gordon has implemented HolSatLib [Gor01], a library which is now part of the HOL 4 theorem prover. This library provides functions to convert HOL 4 terms into CNF, and to analyze them using a SAT solver. In the case of unsatisfiability however, the user only has the option to trust the external solver. No proof reconstruction takes place, “since there is no efficient way to check for unsatisfiability using pure Hol98 theorem proving” [Gor01]. A bug in the SAT solver could ultimately lead to an inconsistency in HOL 4. Perhaps closer related to our work is the integration of automated first-order provers, in the context of Isabelle recently further explored by Joe Hurd [Hur99, Hur02], Jia Meng [Men03], and Lawrence Paulson [MP04, MP06]. Proofs found by the automated system are either verified by the interactive prover immediately [Hur99], or translated into a proof script that can be executed later [MP04]. Also Andreas Meier’s TRAMP sys- tem [Mei00] transforms the output of various automated first-order provers into natural deduction proofs. The main focus of their work however is on the necessary translation from the interactive prover’s specification language to first-order logic. In contrast our approach is so far restricted to instances of propositional tautologies, but we have fo- cused on performance (rather than on difficult translation issues), and we use a SAT solver, rather than a first-order prover. Other work on combining proof and model search includes [dNM06]. A custom-built SAT solver has been integrated with the CVC Lite system [BB04] by Clark Barrett et al. [BBD03]. While this solver produces proofs that can be checked independently, our work shows that it is possible to integrate existing, highly efficient solvers with an LCF-style prover: the information provided by recent versions of zChaff and MiniSat is sufficient to produce a proof object in a theorem prover, no custom-built solver is necessary. An earlier version of this work was presented in [Web05a], and improved by Alwen Tiu et al. [FMM+ 06]. Furthermore Hasan Amjad [Amj06b] has recently integrated a proof-generating version of the MiniSat solver with HOL 4 in a similar fashion. In this paper we have discussed our most recent implementation, which is based on a novel clause representation and constitutes a significant performance improvement when compared to earlier work. 5 Conclusions and Future Work The SAT solver approach dramatically outperforms the automatic proof procedures that were previously available in Isabelle/HOL. With the help of zChaff or MiniSat, many formulae that were previously out of the scope of Isabelle’s built-in tactics can now be proved – or refuted – automatically, often within seconds. Isabelle’s applicability as a tool for formal verification, where large propositional problems occur in practice, has thereby improved considerably. Furthermore, using the data structures and optimizations described in this paper, proof reconstruction for propositional logic scales quite well even to large SAT problems and proofs with several hundred thousand resolution steps. The additional confidence gained by using an LCF-style prover to check the proof obviously comes at a price (in terms of runtime), but it’s not nearly as expensive as one might have expected after earlier implementations. While improving the performance of our implementation, we confirmed an almost self-evident truth: use profiling to see which functions take a lot of time, and focus on improving them – this is were the greatest benefits lie. This was an iterative process. A better implementation would allow us to tackle larger SAT problems, which in turn would uncover new performance bottlenecks. More importantly, we discovered some inefficiencies in the implementation of the Isabelle kernel. (Instantiating a theorem with a term, for example, was linear in the size of the term, rather than in constant time.) These inefficiencies played no important role as long as the kernel only had to deal with relatively small terms, but in our context, where formulae sometimes consist of millions of literals, they turned out to have a negative impact on performance. Subsequently the kernel implementation was modified, and these inefficiencies were removed. Tuning an implementation to the extend presented here requires a great deal of familiarity with the underlying theorem prover. Nevertheless our results are applica- ble beyond Isabelle/HOL. Other interactive provers for higher-order logic, e.g. HOL 4 and HOL-Light, use very similar data structures to represent their theorems. Hasan Amjad has confirmed that the CNF sequent representation works equally well in these provers [Amj06b]. We have already mentioned some possible directions for future work. There is prob- ably not very much potential left to optimize the implementation of resolution itself at this point. However, to further improve the performance of proof reconstruction, it could be beneficial to analyze the resolution proof found by the SAT solver in more detail. Merging similar resolution chains may reduce the overall number of resolutions required, and re-sorting resolutions may help to derive shorter clauses during the proof, which should improve the performance of individual resolution steps. Some preliminary results along these lines are reported in [Amj06a]. The approach presented in this paper has applications beyond propositional reason- ing. The decision problem for richer logics (or fragments thereof) can be reduced to SAT [ABC+ 02, Str02, MS05, RH06]. Consequently, proof reconstruction for proposi- tional logic can serve as a foundation for proof reconstruction for other logics. Based on our work, only a proof-generating implementation of the reduction is needed to inte- grate the more powerful, yet SAT-based decision procedure with an LCF-style theorem prover. This has already been used to integrate haRVey, a Satisfiability Modulo Theo- ries (SMT) prover, with Isabelle [Hur06]. haRVey, like other SMT systems, uses various decision procedures (e.g. congruence closure for uninterpreted functions) on top of a SAT solver. Acknowledgments. The author would like to thank Markus Wenzel for several good ideas and his extensive help with tuning the implementation, and Hasan Amjad for more good ideas and the challenge that he posed. The clause representations used in this paper were suggested to me by various people, including Hasan Amjad, John Harrison, John Matthews, and Markus Wenzel. References [ABC+ 02] G. Audemard, P. Bertoli, A. Cimatti, A. Kornilowicz, and R. Sebastiani. A SAT based approach for solving formulas over Boolean and linear math- ematical propositions. In Andrei Voronkov, editor, Proceedings of the 18th International Conference on Automated Deduction (CADE-18), vol- ume 2392 of Lecture Notes in Artificial Intelligence, pages 195–210, Copen- hagen, Denmark, July 2002. Springer. [Amj06a] Hasan Amjad. Compressing propositional refutations. In Stephan Merz and Tobias Nipkow, editors, Sixth International Workshop on Automated Verification of Critical Systems (AVOCS ’06) – Preliminary Proceedings, pages 7–18, 2006. [Amj06b] Hasan Amjad. A HOL/MiniSat interface. Personal communication, Septem- ber 2006. [BB04] Clark Barrett and Sergey Berezin. CVC Lite: A new implementation of the cooperating validity checker. In Proceedings of the 16th International Conference on Computer Aided Verification (CAV 2004), Boston, Mas- sachusetts, USA, July 2004. [BBD03] Clark Barrett, Sergey Berezin, and David L. Dill. A proof-producing Boolean search engine. In Proceedings of the Workshop on Pragmatics of Decision Procedures in Automated Reasoning (PDPAR 2003), Miami, Florida, USA, July 2003. [DIM93] DIMACS satisfiability suggested format, 1993. Available online at ftp: //dimacs.rutgers.edu/pub/challenge/satisfiability/doc. [dNM06] Hans de Nivelle and Jia Meng. Geometric resolution: A proof procedure based on finite model search. In Ulrich Furbach and Natarajan Shankar, editors, Automated Reasoning – Third International Joint Conference, IJ- CAR 2006, Seattle, WA, USA, August 2006, Proceedings, volume 4130 of Lecture Notes in Artificial Intelligence, pages 303–317, 2006. [ES04] Niklas Eén and Niklas Sörensson. An extensible SAT-solver. In En- rico Giunchiglia and Armando Tacchella, editors, Theory and Applications of Satisfiability Testing, 6th International Conference, SAT 2003. Santa Margherita Ligure, Italy, May 5-8, 2003 Selected Revised Papers, volume 2919 of Lecture Notes in Computer Science, pages 502–518. Springer, 2004. [ES06] Niklas Eén and Niklas Sörensson. MiniSat-p-v1.14 – A proof-logging ver- sion of MiniSat, September 2006. Available online at http://www.cs. chalmers.se/Cs/Research/FormalMethods/MiniSat/. [FMM+ 06] Pascal Fontaine, Jean-Yves Marion, Stephan Merz, Leonor Prensa Nieto, and Alwen Tiu. Expressiveness + automation + soundness: Towards com- bining SMT solvers and interactive proof assistants. In Holger Hermanns and Jens Palsberg, editors, Tools and Algorithms for the Construction and Analysis of Systems, 12th International Conference, TACAS 2006 Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2006, Vienna, Austria, March 25 - April 2, 2006, Proceedings, vol- ume 3920 of Lecture Notes in Computer Science, pages 167–181. Springer, 2006. [GM93] M. J. C. Gordon and T. F. Melham, editors. Introduction to HOL: A theorem proving environment for higher order logic. Cambridge University Press, 1993. [Gor00] M. J. C. Gordon. From LCF to HOL: A short history. In G. Plotkin, Colin P. Stirling, and Mads Tofte, editors, Proof, Language, and Interaction. MIT Press, 2000. [Gor01] M. J. C. Gordon. HolSatLib documentation, version 1.0b, June 2001. Avail- able online at http://www.cl.cam.ac.uk/∼mjcg/HolSatLib/HolSatLib. html. [HS00] Holger H. Hoos and Thomas Stützle. SATLIB: An online resource for re- search on SAT. In Ian Gent, Hans van Maaren, and Toby Walsh, ed- itors, SAT 2000, pages 283–292. IOS Press, 2000. Available online at http://www.satlib.org/. [Hur99] Joe Hurd. Integrating Gandalf and HOL. In Yves Bertot, Gilles Dowek, André Hirschowitz, Christine Paulin, and Laurent Théry, editors, Theorem Proving in Higher Order Logics, 12th International Conference, TPHOLs ’99, volume 1690 of Lecture Notes in Computer Science, pages 311–321, Nice, France, September 1999. Springer. [Hur02] Joe Hurd. An LCF-style interface between HOL and first-order logic. In Andrei Voronkov, editor, Proceedings of the 18th International Conference on Automated Deduction (CADE-18), volume 2392 of Lecture Notes in Artificial Intelligence, pages 134–138, Copenhagen, Denmark, July 2002. Springer. [Hur06] Clément Hurlin. Proof reconstruction for first-order logic and set-theoretical constructions. In Stephan Merz and Tobias Nipkow, editors, Sixth Inter- national Workshop on Automated Verification of Critical Systems (AVOCS ’06) – Preliminary Proceedings, pages 157–162, 2006. [KKS91] R. Kumar, T. Kropf, and K. Schneider. Integrating a first-order automatic prover in the HOL environment. In M. Archer, J. J. Joyce, K. N. Levitt, and P. J. Windley, editors, Proceedings of the 1991 International Workshop on the HOL Theorem Proving System and its Applications, pages 170–176, Davis, California, USA, August 1991. IEEE Computer Society Press, 1992. [Mat06] John Matthews. ASCII proof traces for MiniSat. Personal communication, August 2006. [Mei00] Andreas Meier. TRAMP: Transformation of machine-found proofs into natural deduction proofs at the assertion level. In David A. McAllester, ed- itor, Automated Deduction – CADE-17, 17th International Conference on Automated Deduction, Pittsburgh, PA, USA, June 17-20, 2000, Proceed- ings, volume 1831 of Lecture Notes in Artificial Intelligence, pages 460–464. Springer, 2000. [Men03] Jia Meng. Integration of interactive and automatic provers. In Manuel Carro and Jesus Correas, editors, Second CologNet Workshop on Implementation Technology for Computational Logic Systems, FME 2003, September 2003. [MMZ+ 01] M. Moskewicz, C. Madigan, Y. Zhao, L. Zhang, and S. Malik. Chaff: Engi- neering an efficient SAT solver. In Proceedings of the 38th Design Automa- tion Conference, Las Vegas, June 2001. [MP04] Jia Meng and Lawrence C. Paulson. Experiments on supporting interac- tive proof using resolution. In David Basin and Michaël Rusinowitch, edi- tors, Automated Reasoning: Second International Joint Conference, IJCAR 2004, Cork, Ireland, July 4-8, 2004, Proceedings, volume 3097 of Lecture Notes in Artificial Intelligence, pages 372–384. Springer, 2004. [MP06] Jia Meng and Lawrence C. Paulson. Translating higher-order problems to first-order clauses. In Geoff Sutcliffe, Renate Schmidt, and Stephan Schulz, editors, ESCoR: Empirically Successful Computerized Reasoning, volume 192 of CEUR Workshop Proceedings, pages 70–80, 2006. [MS05] Andreas Meier and Volker Sorge. Applying SAT solving in classification of finite algebras. Journal of Automated Reasoning, 35(1–3):201–235, October 2005. [MTHM97] Robin Milner, Mads Tofte, Robert Harper, and David MacQueen. The Definition of Standard ML - Revised. MIT Press, May 1997. [NPW02] Tobias Nipkow, Lawrence C. Paulson, and Markus Wenzel. Isabelle/HOL – A Proof Assistant for Higher-Order Logic, volume 2283 of Lecture Notes in Computer Science. Springer, 2002. [NRW98] Andreas Nonnengart, Georg Rock, and Christoph Weidenbach. On generat- ing small clause normal forms. In Claude Kirchner and Hélène Kirchner, ed- itors, Automated Deduction – CADE-15, 15th International Conference on Automated Deduction, Lindau, Germany, July 5-10, 1998, Proceedings, vol- ume 1421 of Lecture Notes in Computer Science, pages 397–411. Springer, 1998. [ORS92] S. Owre, J. M. Rushby, and N. Shankar. PVS: A prototype verification sys- tem. In Deepak Kapur, editor, 11th International Conference on Automated Deduction (CADE), volume 607 of Lecture Notes in Artificial Intelligence, pages 748–752, Saratoga, NY, June 1992. Springer. [Pau94] Lawrence C. Paulson. Isabelle: A Generic Theorem Prover, volume 828 of Lecture Notes in Computer Science. Springer, 1994. [RH06] Erik Reeber and Warren A. Hunt, Jr. A SAT-based decision procedure for the subclass of unrollable list formulas in ACL2 (SULFA). In Ulrich Furbach and Natarajan Shankar, editors, Automated Reasoning – Third International Joint Conference, IJCAR 2006, Seattle, WA, USA, August 2006, Proceedings, volume 4130 of Lecture Notes in Artificial Intelligence, pages 453–467, 2006. [Sha01] Natarajan Shankar. Using decision procedures with a higher-order logic. In Richard J. Boulton and Paul B. Jackson, editors, Theorem Proving in Higher Order Logics, 14th International Conference, TPHOLs 2001, Ed- inburgh, Scotland, UK, September 3-6, 2001, Proceedings, volume 2152 of Lecture Notes in Computer Science, pages 5–26. Springer, 2001. [SS98] Geoff Sutcliffe and Christian Suttner. The TPTP problem library: CNF release v1.2.1. Journal of Automated Reasoning, 21(2):177–203, 1998. Avail- able online at http://www.cs.miami.edu/∼tptp/. [Str02] Ofer Strichman. On solving Presburger and linear arithmetic with SAT. In M. D. Aagaard and J. W. O’Leary, editors, Formal Methods in Computer- Aided Design: 4th International Conference, FMCAD 2002, Portland, OR, USA, November 6-8, 2002, Proceedings, volume 2517 of Lecture Notes in Computer Science, pages 160–169. Springer, 2002. [Tse83] G. S. Tseitin. On the complexity of derivation in propositional calculus. In J. Siekmann and G. Wrightson, editors, Automation Of Reasoning: Clas- sical Papers On Computational Logic, Vol. II, 1967-1970, pages 466–483. Springer, 1983. Also in Structures in Constructive Mathematics and Math- ematical Logic Part II, ed. A. O. Slisenko, 1968, pp. 115–125. [Web05a] Tjark Weber. Integrating a SAT solver with an LCF-style theorem prover. In A. Armando and A. Cimatti, editors, Proceedings of PDPAR’05 – Third International Workshop on Pragmatical Aspects of Decision Procedures in Automated Reasoning, Edinburgh, UK, July 2005. [Web05b] Tjark Weber. Using a SAT solver as a fast decision procedure for proposi- tional logic in an LCF-style theorem prover. In Joe Hurd, Edward Smith, and Ashish Darbari, editors, Theorem Proving in Higher Order Logics – 18th International Conference, TPHOLs 2005, Oxford, UK, August 2005, Emerging Trends Proceedings, pages 180–189, Oxford, UK, August 2005. Oxford University Computing Laboratory, Programming Research Group. Research Report PRG-RR-05-02. [ZM03] Lintao Zhang and Sharad Malik. Validating SAT solvers using an indepen- dent resolution-based checker: Practical implementations and other appli- cations. In Design, Automation and Test in Europe (DATE 2003), pages 10880–10885. IEEE Computer Society, 2003.