=Paper=
{{Paper
|id=Vol-192/paper-4
|storemode=property
|title=Lightweight Relevance Filtering for Machine-Generated Resolution
|pdfUrl=https://ceur-ws.org/Vol-192/paper04.pdf
|volume=Vol-192
}}
==Lightweight Relevance Filtering for Machine-Generated Resolution==
Lightweight Relevance Filtering for Machine-Generated
Resolution Problems
Jia Meng1 , Lawrence C. Paulson2
1
National ICT, Australia
jiameng@nicta.com.au
2
Computer Laboratory, University of Cambridge, U.K.
LP15@cam.ac.uk
Abstract
Irrelevant clauses in resolution problems increase the search space, making it
hard to find proofs in a reasonable time. Simple relevance filtering methods, based
on counting function symbols in clauses, improve the success rate for a variety of
automatic theorem provers and with various initial settings. We have designed
these techniques as part of a project to link automatic theorem provers to the
interactive theorem prover Isabelle. They should be applicable to other situations
where the resolution problems are produced mechanically and where completeness
is less important than achieving a high success rate with limited processor time.
1 Introduction
Resolution-based automatic theorem provers (ATPs) are complete for first-order logic:
given unlimited resources, they will find a proof if one exists. Unfortunately, we seldom
have unlimited resources. Removing unnecessary axioms from the problem reduces the
search space and thus the effort needed to obtain proofs. Identifying unnecessary axioms
while retaining completeness appears to be as difficult as proving the theorem in the first
place. In general, it is hard to see how we can know that an axiom is redundant except
by finding a proof without using it. Syntactic criteria may be helpful when the amount
of redundancy is extreme, especially if we can relax the requirement for completeness.
The relevance problem dates back to the earliest days of resolution. As first defined
by Robinson [Rob65], a literal is pure if it is not unifiable with a complementary literal
in any other clause. Clauses containing pure literals can be removed without affecting
consistency. This process is a form of relevance test, since pure literals often indicate
that the axioms describe predicates not mentioned in the (negated) conjecture. It can
be effective, but it is not a full solution.
If a resolution theorem prover is invoked by another reasoning tool, then the problems
it receives will have been produced mechanically. Machine-generated problems may
contain thousands of clauses, each containing large terms. Many ATPs are not designed
to cope with such problems. Traditionally, the ATP user prepares a mathematical
problem with the greatest of care, and is willing to spend weeks running proof attempts,
adjusting weights and refining settings until the problem is solved. Machine-generated
Empirically Successful Computerized Reasoning 53
problems are not merely huge but may be presented to the automatic prover by the
dozen, with its heuristics set to their defaults and with a small time limit.
Our integration of Isabelle with ATPs [MQP06] generates small or large problems,
depending on whether or not rewrite rules are included. A small problem occupies 200
kilobytes and comprises over 1300 clauses; a large problem occupies about 450 kilobytes
and comprises 2500 clauses, approximately. Even our small problems look rather large
to a resolution prover. We have run extensive tests with a set of 285 such problems (153
small, 132 large). Vampire does well on our problem set, if it is given 300 seconds per
problem: it can prove 86 percent of them. Unfortunately, given 30 seconds per problem,
which is more realistic for user interaction, Vampire’s success rate drops to 53 percent.
(See Fig. 7 in Sect. 5.) Can a cheap, simple relevance filter improve the success rate for
short runtimes? Completeness does not matter to us: we are happy to forgo solutions
to some problems provided we obtain a high success rate.
In the course of our investigations, we found that many obvious ideas were incorrect.
For example, ATPs generate hundreds of thousands of clauses during their operation,
so an extra fifty clauses at the start should not do any harm. However, they do.
Paper outline. We begin with background material on Isabelle, ATPs, and our
link-up between the two, mentioning related work on other such link-ups (Sect. 2). We
describe our initial attempts to improve the success rate of our link-up (Sect. 3), and then
describe a family of related relevance filters (Sect. 4). We proceed to our experimental
results, illustrating the benefits of filtering through a series of graphs (Sect. 5). Finally,
we present brief conclusions (Sect. 6).
2 Background
Resolution theorem provers work by deriving a contradiction from a supplied set of
clauses [BG01]. Each clause is a disjunction of literals (atomic formulae and their nega-
tions) and the set of clauses is interpreted as a conjunction. Clause form can be difficult
to read, and the proofs that are found tend to be unintuitive, but there is no denying that
these provers are powerful. In the sequel we refer to them as automatic theorem provers
or ATPs. (This term includes clausal tableau provers, but not SAT solvers, decision
procedures etc.) Our experiments mainly use E versions 0.9 and 0.91dev001 [Sch04],
SPASS V2.2 [Wei01] and Vampire 8 [RV01a].1
Interactive theorem provers allow proofs to be constructed by natural chains of rea-
soning, generally in a rich formalism such as higher-order logic, but their automation
tends to be limited to rewriting and arithmetic. Quantifier reasoning tends to be weak:
many interactive systems cannot even prove a simple theorem like ∃x ∀y P (x, y) →
∀y ∃x P (x, y) automatically. Developers of interactive tools would naturally like to har-
ness the power of ATPs, while preserving the intuitive reasoning style that their users
expect.
We are adding automated support to the interactive prover Isabelle. There are many
differences between our project and other related work [BHdN02, SBF+ 03]. The chief
difference is that Isabelle’s built-in automated support gives the ATPs tough competi-
1
We downloaded Vampire from the CASC-20 website, http://www.cs.miami.edu/ tptp/CASC/20/;
it calls itself version 7.45.
54 Empirically Successful Computerized Reasoning
tion. By typing auto, the Isabelle user causes approximately 2000 lemmas to be used as
rewriting rules and for forward and backward chaining. A related tool, blast, performs
deep searches in a fashion inspired by tableau provers. Even ten years ago, using early
predecessors of these tools, Isabelle users could automatically prove theorems like this
set equality [Pau97]: [ [
[
(Ai ∪ Bi ) = Ai ∪ Bi
i∈I i∈I i∈I
Set theory problems, and the combinatory logic examples presented in that paper,
remain difficult for automatic theorem provers. When we first got our link-up working,
we were disappointed to find that ATPs were seldom better than auto. We have devoted
much effort to improving its success rate. We found that bugs in our link-up were
partly to blame for the poor results. Much of our effort went to improving the problem
presentation; for example, we found a more compact representation of types. We devoted
some time to identifying prover settings to help ATPs cope with huge problems. Above
all, we have struggled to find ways to filter out irrelevant axioms.
Of previous work, the most pertinent is the integration between the Karlsruhe Inter-
active Verifier (KIV) and the tableau prover 3TAP, by Ahrendt and others [ABH+ 98].
Reif and Schellhorn [RS98] present a component of that integration: an algorithm for
removing redundant axioms. It relies on analysing the structure of the theory in which
the conjecture is posed. Specifically, their method is based on four criteria for reduction,
which they call the minimality, structure, specification and recursion criteria. We did
not feel that this method would work with Isabelle theories, which tend to be large: a
typical theory introduces many functions and definitions, and proves many theorems.
For instance, the theory of the natural numbers proves nearly 200 theorems. Also, a
typical Isabelle problem lies at the top of a huge theory hierarchy that includes the full
Isabelle/HOL development, any included library theories, and theories for the user’s
problem domain. We decided to try other methods, which are described in the sequel.
The Isabelle-ATP linkup generates problems that contain conjecture clauses along
with clauses from four other sources:
• Classical clauses arise from the theorems Isabelle uses for forward and backward
chaining.
• Simplification clauses arise from the theorems Isabelle uses as rewrite rules. They
contain equalities.
• Arity and class inclusion clauses do not correspond to Isabelle theorems. Instead,
they express aspects of Isabelle’s type system. Isabelle’s axiomatic type classes
are sets of types that meet a given specification. For instance, the type nat of
natural numbers is a member of order, the class of partial orderings; we express
its membership as the unit clause order(nat). An arity relates the type classes of
the arguments and results of type constructors. For example, an arity clause
∀ τ [type(τ ) → order(list(τ ))]
says if the argument of list is a member of class type, then the resulting type of
lists belong to class order. For more information, we refer readers to our previous
papers [MP04, MQP06].
Empirically Successful Computerized Reasoning 55
Although the arity and class inclusion clauses typically number over one thousand,
they probably pose no difficulties for modern ATPs. They are Horn clauses that contain
only monadic (unary) predicates, which are not related by any equations. Most arity
clauses have one literal, while class inclusion clauses consist of two literals. Pure literal
elimination probably suffices to remove redundant arity and class inclusion clauses, so
ATPs may delete most of them immediately.
3 Initial Experiments
We have tuned the Isabelle-ATP link-up with the help of a set of problems in clause
form. We obtained these by modifying our link-up to save the clause form problems
it was producing. They simulate attempted calls to our system at various points in
several Isabelle proofs. The original Isabelle proofs for some of these problems require
multiple steps, explicit quantifier instantiations, or other detailed guidance. The set
now numbers 285 problems. Many have trivial proofs and the only difficulty lies in the
problem size. Other problems take a few minutes to prove, and a few remain unsolved.
Our first task was to verify that the problems were solvable. If a problem could
not be proved by any ATP, we could remove irrelevant clauses manually, using our
knowledge of the problem domain, in the hope of rendering it provable. This process
was too laborious to carry out for every failing problem. Over time, with the help of
the filtering techniques described below, we were able to obtain proofs for all but five of
our problems. We also uncovered problems that were incorrectly posed, lacking crucial
lemmas, and found several bugs. These ranged from the trivial (failing to notice that
the original problem already contained the empty clause) to the subtle (Skolemization
failing to take account of polymorphism).
The laborious hand-minimization can be automated. A simple idea is to note which
axioms take part in any successful proofs—call them referenced axioms—and to remove
all other axioms from the unsolved problems. We have automated this idea for the
provers Vampire and E. Both clearly identify references to axiom clauses, which they
designate by positive integers. Simple Perl scripts read the entire clause set into an array;
referenced axioms are found by subscripting and written to a new file. We thus obtain a
reduced version of the problem, containing only the clauses referenced. Repeating this
process over a directory of problems yields a new directory containing reduced versions
of each solved problem. If both Vampire and E prove a theorem, then the smaller file
is chosen. We then concatenate the solutions, removing conjecture clauses. The result
is a file containing all referenced axioms. Another Perl script intersects this file with
each member of the problem set, yielding a reduced problem set where each problem
contains only referenced axioms.
Auto-reduction by using only referenced axioms has an obvious drawback: some
unsolved problems are likely to need some axioms that have not been referenced before.
Even so, this idea improved our success rate from about 60 percent to 80 percent. It is
not clear how to incorporate this idea into an interactive prover, since then its success on
certain problems would depend upon the previous history of proof attempts, making the
system’s behaviour hard to predict. Auto-reduction’s immediate benefit is that it yields
evidence that the original problems have proofs: a reduced problem is easily checked to
56 Empirically Successful Computerized Reasoning
be a subset of the original problem, and with few exceptions it is easy to prove. Fewer
suspect problems require hand examination.
Using only referenced axioms does not guarantee that problems will be small. At
present, there are 405 referenced clauses. Most of these are specific to various problem
domains; approximately 150 of these correspond to standard Isabelle/HOL theorems,
and are common to all problems. A proof about protocol verification could have another
90 clauses, for a problem size of about 240. Two points are noteworthy:
1. The problems are smaller if not small, and
2. referenced clauses may be somehow more suitable for automated proof than other
clauses.
This second point suggests the concept of pathological clauses. Is it possible that
there are a few particularly bad clauses whose removal would improve the success rate?
It is difficult to test the hypothesis that the success rate is lowered by a few patho-
logical clauses. There is no reason to believe that the same clauses will turn out to be
pathological for all ATPs. Identifying pathological clauses seems to require much guess-
ing and manual inspection. Surely a pathological clause would contain highly general
literals such as X = Y , X < Y , X ∈ Y , or their negations.
We eventually blacklisted 148 theorems from the standard Isabelle/HOL library.
A theorem could be blacklisted for various reasons, such as having too big a clause
form, being too similar to other theorems, or dealing with too obscure a property. This
effort yielded only a small improvement to the success rate, probably because Isabelle’s
sets of classical and simplification rules already exclude obviously prolific facts such as
transitivity. The main benefit of this exercise was our discovery that the generated
problems included large numbers of functional reflexivity axioms: that is, axioms such
as X = Y −→ f (X) = f (Y ). They are redundant in the presence of paramodulation;
since we only use ATPs that use that inference rule for equality reasoning, we now omit
such clauses in order to save ATPs the effort of discarding them. This aspect of our
project was in part a response to SPASS developer Thomas Hillenbrand’s insistence—in
an e-mail dated 23 July 2005—on “engineering your clause base once and forever”.
4 Signature-Based Relevance Filters
Automatic relevance filtering is clearly more attractive than any method requiring man-
ual inspection of clauses. We decided not to adopt Reif and Schellhorn’s approach
[RS98], which required analysis of the theory in which each problem was set, and in-
stead to define relevance with respect to the provided conjecture clauses. The simplest
way of doing this is to enable the Set of Support option, if it is available. Wos’s SOS
heuristic [WRC65], which dates from 1965, ensures that all inference rule applications
involve at least one clause derived from a conjecture clause. It prevents inferences among
the axioms and makes the search goal-directed. It is incomplete in the presence of the
ordering heuristics used by modern ATPs, but SPASS still offers SOS and it greatly
improves the success rate, as the graphs presented below will demonstrate.
We tried many simple relevance filters based on the conjecture clauses. We already
knew that simple methods could be effective: in an earlier experiment, we modified the
Empirically Successful Computerized Reasoning 57
link-up to block all axiom clauses except those from a few key theories. That improved
the success rate enough to yield proofs for eight hitherto unsolved problems. Clearly if
such a crude filter could be beneficial, then something based on the conjecture clauses
could be better still. Having automatically-reduced versions of most of our problems
allows us to test the relevance filter without actually running proofs: yet more Perl
scripts compare the new problems with the reduced ones, reporting any missing axioms.
The abstraction-based relevancy testing approach of Fuchs and Fuchs [FF99] is
specifically designed for model elimination (or connection tableau) provers. It is not
clear how to modify this approach for use with saturation provers, which are the type
we use almost exclusively. Their approach has some curious features. Though it is
based upon very general notions, the specific abstraction they implement is a symbol
abstraction, which involves “identifying some predicate or function symbols” and form-
ing equivalence classes of clauses. We confess that we were not able to derive any ideas
from this highly mathematical paper.
4.1 Plaisted and Yahya’s Strategy
Plaisted and Yahya’s relevance restriction strategy [PY03] introduces the concept of
relevance distance between two clauses, reflecting how closely two clauses are related.
Simply put, the idea is to start with the conjecture clauses and to identify a set R1 of
clauses that contain complementary literals to at least one of the conjecture clauses.
Each clause in R1 has distance 1 to the conjecture clauses. The next round of iteration
produces another set R2 of clauses, where each of its clauses resolves with one or more
clauses in R1 ; thus clauses in R2 have distance 2 to the conjecture clauses. The iteration
repeats until all clauses that have distances less than or equal to some upper limit are
included. This is an all-or-nothing approach: a clause is either included if it can resolve
with some already-included clause, or, not included at all.
We found this method easy to implement (in Prolog), but unfortunately too many
clauses are included after two or three iterations. This method does not take into
account the ordering restrictions that ATPs would respect, thereby including clauses on
the basis of literals that would not be selected for resolution. Also, this strategy does
not handle equality.
Plaisted and Yahya’s strategy suggests a simple approach based on signatures. Start-
ing with the conjecture clauses, repeatedly add all other clauses that share any function
symbols with them. This method handles equality, but it again includes too many
clauses. Therefore, we have refined Plaisted and Yahya’s strategy and designed several
new algorithms that work well (Sect. 5).
4.2 A Passmark-Based Algorithm
Our filtering strategies abandon the all-or-nothing approach. Instead, we use a measure
of relevance, and a clause is added to the pool of relevant clauses provided it is “suffi-
ciently close” to an existing relevant clause. If a clause mentions n functions, of which
m are relevant, then the clause receives a score (relevance mark) of m/n. The clause is
rejected unless its score exceeds a given pass mark, a real number between 0 and 1. If
a clause is accepted, all of its functions become relevant. Iterate this process until no
58 Empirically Successful Computerized Reasoning
new clauses are accepted. To prevent too many clauses from being accepted, somehow
the test must become stricter with each iteration.
In the first filtering strategy, we attach a relevance mark to each clause and this
mark may be increased during the filtering process. The pseudo-code for our algorithm
is shown in Figure 1.
The pseudo-code should be self-explanatory. We only give a few more comments
below.
• When the function relevant clauses is first called, the working relevant clauses
set W contains the goal clauses, while T contains all the axiom clauses and U is
empty.
• In function find clause mark, |R| is the number of elements in the set R.
• Isabelle allows overloading of functions, so that for example <= can denote
the usual ordering on integers as well as subset inclusion. Therefore function
find clause mark regards two functions as matching only if their types match as
well.
• The multiplication by P M in function find clause mark makes the relevance test
increasingly strict as the distance from the conjecture clauses increases, which
keeps the process focussed around the conjecture clauses and prevents too many
clauses from being taken as relevant.
4.3 Using the Set of Relevant Functions
We have refined the strategy above, removing the requirement that a clause be close to
one single relevant clause. It instead accumulates a pool of relevant functions, which is
used when calculating scores. This strategy is slightly simpler to implement, because
scores no longer have to be stored, and it potentially handles situations where a clause
is related to another via multiple equalities. To make the relevance test stricter on
successive iterations, we increase the pass mark after each successive iteration by the
formula p′ = p + (1 − p)/c, where c is an arbitrary convergence parameter. If c = 2,
for example, then each iteration halves the difference 1 − p. The algorithm appears in
Figure 2.
Since the value of c is used to modify that of p, the optimal values of these parameters
need to be found simultaneously. We ran extensive empirical tests. It became clear that
large values of c performed poorly, so we concentrated on 1.6, 2.4 and 3.2 with a range
of pass marks. We obtained the best results with p = 0.6 and c = 2.4. These values give
a strict test that rapidly gets stricter, indicating that our problems require a drastic
reduction in size.
To illustrate these points, Figure 3 presents two graphs. They plot success rates
and problem sizes as the pass mark increases from 0.0 (all clauses accepted) to 0.9 (few
clauses accepted). Success rates are for Vampire in its default mode, allowing 40 seconds
per problem. Problem sizes refer to the average number of clauses per problem, ignoring
conjecture clauses and the clauses that formalize Isabelle’s type system. Vampire’s
success rate peaks sharply at 0.6, by which time the average problem size has decreased
from 909 to 142 clauses. Since the aim of these experiments was to determine the best
Empirically Successful Computerized Reasoning 59
function relevant_clauses:
input: W, working relevant clauses set
T, working irrelevant clauses set
P, pass mark
U, used relevant clauses set
(Each clause is attached with a relevance mark)
output: relevant clauses to be input to ATPs
begin
while W not empty do {
for each clause-relevance-mark pair (C,M) in T do
{ update_clause_mark (W, (C,M)) }
#partition (C,M) pairs in T into two sets
Rel set contains (C,M) if P ≤ M
Irrel := T - Rel;
U := W ∪ U;
W := Rel;
T := Irrel;
}
return U;
end
function update_clause_mark:
input: W, relevant clauses set
(C,M), a clause-mark pair
effect: updates the relevance mark of C
begin
for each clause-mark (P,P_M) in W do {
PC := functions_of P;
CF := functions_of C;
R := CF ∩ PC; #where names and types agree
IR := CF − R; #remaining functions of clause C
M := max(M, P_M * |R| / (|R| + |IR|));
}
end
Figure 1: A Passmark-Based Filtering Strategy
60 Empirically Successful Computerized Reasoning
function relevant_clauses:
input: RF, set of relevant functions
T, working irrelevant clauses set
A, relevant clauses accumulator
P, pass mark
output: relevant clauses to be input to ATPs
begin
repeat {
for each clause Ci in T do
{ Mi := clause_mark (RF,Ci ) }
Rel := set of all clauses Ci such that P ≤ Mi
T := T - Rel;
A := A ∪ Rel;
P := P + (1 - P) / c;
RF := (functions_of Rel) ∪ RF;
} until Rel is empty
return A;
end
function clause_mark:
input: RF, a set of relevant functions
C, a clause
output: the relevance mark of C
begin
CF := functions_of C;
R := CF ∩ RF; #where names and types agree
IR := CF − R; #remaining functions of clause C
return |R| / (|R| + |IR|);
end
Figure 2: An Improved Filtering Strategy Using a Set of Relevant Functions
Empirically Successful Computerized Reasoning 61
80%
70%
60%
50%
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1000
800
600
400
200
0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Figure 3: Success Rates and Problem Sizes Against Pass Marks
values for p and c, the choice of ATP probably makes little difference. In particular, if
the filter has removed essential axioms, then no ATP will find a proof; this is why the
success rate drops when p > 0.6. We repeat that these graphs are for illustration only;
our parameter settings are based on tests that took hundreds of hours of processor time.
4.4 Taking Rarity into Account
Another refinement takes account of the relative frequencies of the functions in the
clause set. Some functions are common while others are rare. An occurrence of a rare
function would seem to be a strong indicator of relevance, while its very rarity would
ensure that not too many new clauses are included. In the relevance quotient m/n,
we boost m to take rarity into account. while leaving n to denote the total number of
functions. Taking rarity into account in n would prevent the inclusion of clauses that
involve another rare function, such as a Skolem function (each Skolem function is rare,
since it only occurs in a few clauses).
This strategy requires a suitable frequency function, which calculates the weight of
a symbol according to the number of its occurrences in the entire set of axiom clauses.
It is similar to our strategy mentioned in Sect.4.3, except for the function clause mark
shown in Figure 4.
A few explanations may be helpful here.
• The relevance mark of a clause C calculated by clause mark is not the percentage
of relevant functions in C any more. Instead, we use func weight function to
compute the sum of relevant functions’ marks weighted by their frequencies.
• A suitable frequency function (frequency function) is needed. After much test-
62 Empirically Successful Computerized Reasoning
function clause_mark:
input: RF, a set of relevant functions
C, a clause
ftab, a table of the number of occurrences of each function in the clause set
output: the relevance mark of C
begin
CF := functions_of C;
R := CF ∩ RF; #where names and types agree
IR := CF − R; #remaining functions of C
M := 0;
for each function F in R do { M := M + func_weight (ftab, F) }
return (M / (M + |IR|));
end
function func_weight:
input: ftab, a table of the number of occurrences of each function in the cause set
F, a function symbol
output: the frequency weighted mark for F
begin
freq := number_of_occurrences (ftab, F);
return (frequency_function freq);
end
Figure 4: A Filtering Strategy for Rarely-Occurring Functions
Empirically Successful Computerized Reasoning 63
ing we found that it should decrease gently as the frequency increases. If a func-
tion occurs n times in the problem, then we increase its contribution from 1 to
1 + 2/ log(n + 1). An even gentler decrease, such as 1 + 1.4/ log(log(n + 2)), can
√
work well, but something like 1 + 1/ n penalizes frequently-occurring functions
too much. Hence, our definition of frequency function is
frequency_function n = 1 + 2/log(n+1)
4.5 Other Refinements
Hoping that unit clauses did not excessively increase resolution’s search space, we exper-
imented with adding all “sufficiently simple” unit clauses at the end of the procedure. A
unit clause was simple provided it was not an equation; an equation was simple provided
its left- or right-hand side was ground (that is, variable-free). We discovered that over
100 unit clauses were often being added, and that they could indeed increase the search
space. By improving the relevance filter in other respects, we found that we could do
without a special treatment of unit clauses. A number of attempts to favour shorter
clauses failed to produce any increase in the success rate.
Definition expansion is another refinement. If a function f is relevant, and a unit
clause such as f (X) = t is available, then it can be regarded as relevant. To avoid
including “definitions” like 0 = N × 0, we check that the variables of the right-hand
side are a subset of those of the left-hand side. Definition expansion is beneficial, but
its effect is small.
As of this writing, our system still contains a manually produced blacklist of 117
(down from 148) HOL theorems. We also have a whitelist of theorems whose inclusion is
forced; it contains one single theorem (concerning the subset relation), which we found
that the filter was frequently rejecting. We can probably reduce the blacklist drastically
by checking which of those theorems would survive relevance filtering. However, having
a high success rate is more important to us than having an elegant implementation.
5 Empirical Results
Extensive testing helped us determine which methods worked best and to find the best
settings of the various parameters. The graphs compare the success rates of filtered
problems against raw ones. The runtime per problem increases from 10 to 300 seconds.
Success rates are plotted on a scale ranging from 40 to 90 percent. We tested the three
provers we have found to be best—E, SPASS and Vampire—in their default mode and
with alternative settings. E version 0.9 “Soom” is surpassed by version 0.91dev001
(Fig. 5), a development version of “Kanyam” that includes heuristics designed for our
problem set. (Version 0.9 surpasses the official Kanyam release, E version 0.91, on our
problems.) SPASS (Fig. 6) seems to perform better if SOS is enabled and splitting is
disabled.2 Vampire does extremely well in its CASC mode (Fig. 7).
We ran these tests on a bank of Dual AMD Opteron processors running at 2400MHz.
The Condor system3 managed our batch jobs.
2
The precise option string is -Splits=0 -FullRed=0 -SOS=1.
3
http://www.cs.wisc.edu/condor/
64 Empirically Successful Computerized Reasoning
90
Filtered
80 Raw
Percent solved
70
60
50
40
0 50 100 150 200 250 300
Runtime per problem (seconds)
90
Filtered
80 Raw
Percent solved
70
60
50
40
0 50 100 150 200 250 300
Runtime per problem (seconds)
Figure 5: E, Versions 0.9 and 0.91dev001
90
Filtered
80 Raw
Percent solved
70
60
50
40
0 50 100 150 200 250 300
Runtime per problem (seconds)
90
Filtered
80 Raw
Percent solved
70
60
50
40
0 50 100 150 200 250 300
Runtime per problem (seconds)
Figure 6: SPASS, Default Settings and with SOS
Empirically Successful Computerized Reasoning 65
90
Filtered
80 Raw
Percent solved
70
60
50
40
0 50 100 150 200 250 300
Runtime per problem (seconds)
90
Filtered
80 Raw
Percent solved
70
60
50
40
0 50 100 150 200 250 300
Runtime per problem (seconds)
Figure 7: Vampire, Default Settings and in CASC Mode
The graphs offer compelling evidence that relevance filtering is beneficial in our
application. The success rate for the filtered problems exceeds that for the raw ones in
virtually every case. This improvement is particularly striking given that eight percent
of the filtered problems appear to be missing essential clauses, compared with their
automatically-reduced counterparts. Perhaps gains are made elsewhere, or these eight
percent are too difficult to prove anyway. A further surprise is that, with the exception of
Vampire running in CASC mode, increasing processor time to 300 seconds does not give
an advantage to the raw problems. The success rate for raw problems should eventually
exceed that of the filtered ones, but the crossover point appears to be rather distant.
Exceptions are E version 0.91dev001, where crossover appears to be taking place at 280
seconds, and Vampire in CASC mode, where by 300 seconds the two lines are close.
6 Conclusions
We wish to refute large, machine-generated sets of clauses. Experiments with the notion
of “referenced axioms” demonstrate that reducing the problem size greatly improves the
success rate. However, this technique introduces a dependence on past proof attempts,
so we have sought methods of reducing the problem size through a simple analysis of
the problem alone.
We have presented simple ideas for relevance filtering along with empirical evidence
to demonstrate that they improve the success rate in a great variety of situations.
The simplicity of our methods is in stark contrast to the tremendous sophistication
of automated theorem provers. It is surprising that such simple methods can yield
66 Empirically Successful Computerized Reasoning
benefits. We believe that the secret is our willingness to sacrifice completeness in order
to improve the overall success rate. Our method may be useful in other applications
where processor time is limited and completeness is not essential. It is signature based,
so it works for any problem for which the conjecture clauses have been identified. Our
version of the filters operates on Isabelle theorems and assumes Isabelle’s type system,
but versions for standard first-order logic should be easy to devise.
Our filtering gave the least benefit with the specially-modified version of the E prover.
Developer Stephan Schulz, in an e-mail dated 14 April 2006, had an explanation:
E has a number of goal-directed search heuristics. The new version always
selects a fairly extreme goal-directed one for your problems . . . [which] will
give a 10 times lower weight to symbols from a conjecture than to other
symbols (all else being equal). I suspect that this more or less simulates
your relevance filtering.
Such a setting could probably be added to other ATPs, and in a fashion that preserves
completeness.
The weighting mechanisms of other ATPs may be able to simulate our filter. How-
ever, finding good configurations requires expert knowledge of each ATP being used.
We were never able to improve upon the default configuration of E, despite its good
documentation and helpful developer. (Many ATPs do not have these benefits.) In
contrast, our filter provides a uniform solution for all ATPs.
We do not feel that our methods can be significantly improved; their technological
basis is too simple. More sophisticated techniques, perhaps based on machine learning,
may yield more effective relevance filtering.
As an offshoot of the work reported above, we have submitted 565 problems to the
TPTP library. These are 285 raw problems plus 280 solutions: versions that have been
automatically reduced as described in Sect. 3 above.
Acknowledgements
The research was funded by the epsrc grant GR/S57198/01 Automation for Interactive
Proof and by the L4.verified project of National ICT Australia. Stephan Schulz provided
a new version of the E prover, optimized for our problem set.
References
[ABH+ 98] Wolfgang Ahrendt, Bernhard Beckert, Reiner Hähnle, Wolfram Menzel,
Wolfgang Reif, Gerhard Schellhorn, and Peter H. Schmitt. Integrating au-
tomated and interactive theorem proving. In Wolfgang Bibel and Peter H.
Schmitt, editors, Automated Deduction— A Basis for Applications, volume
II. Systems and Implementation Techniques, pages 97–116. Kluwer Academic
Publishers, 1998.
[BG01] Leo Bachmair and Harald Ganzinger. Resolution theorem proving. In Robin-
son and Voronkov [RV01b], chapter 2, pages 19–99.
Empirically Successful Computerized Reasoning 67
[BHdN02] Marc Bezem, Dimitri Hendriks, and Hans de Nivelle. Automatic proof con-
struction in type theory using resolution. Journal of Automated Reasoning,
29(3-4):253–275, 2002.
[BR04] David Basin and Michaël Rusinowitch, editors. Automated Reasoning —
Second International Joint Conference, IJCAR 2004, LNAI 3097. Springer,
2004.
[FF99] Marc Fuchs and Dirk Fuchs. Abstraction-based relevancy testing for model
elimination. In Harald Ganzinger, editor, Automated Deduction — CADE-16
International Conference, LNAI 1632, pages 344–358. Springer, 1999.
[MP04] Jia Meng and Lawrence C. Paulson. Experiments on supporting interactive
proof using resolution. In Basin and Rusinowitch [BR04], pages 372–384.
[MQP06] Jia Meng, Claire Quigley, and Lawrence C. Paulson. Automation for in-
teractive proof: First prototype. Information and Computation, 2006. in
press.
[Pau97] Lawrence C. Paulson. Generic automatic proof tools. In Robert Veroff,
editor, Automated Reasoning and its Applications: Essays in Honor of Larry
Wos, chapter 3. MIT Press, 1997.
[PY03] David A. Plaisted and Adnan Yahya. A relevance restriction strategy for
automated deduction. Artificial Intelligence, 144(1-2):59–93, March 2003.
[Rob65] J. A. Robinson. A machine-oriented logic based on the resolution principle.
Journal of the ACM, 12:23–41, 1965.
[RS98] Wolfgang Reif and Gerhard Schellhorn. Theorem proving in large theories.
In Wolfgang Bibel and Peter H. Schmitt, editors, Automated Deduction—
A Basis for Applications, volume III. Applications, pages 225–240. Kluwer
Academic Publishers, 1998.
[RV01a] Alexander Riazanov and Andrei Voronkov. Vampire 1.1 (system description).
In Rajeev Goré, Alexander Leitsch, and Tobias Nipkow, editors, Automated
Reasoning — First International Joint Conference, IJCAR 2001, LNAI 2083,
pages 376–380. Springer, 2001.
[RV01b] Alan Robinson and Andrei Voronkov, editors. Handbook of Automated Rea-
soning. Elsevier Science, 2001.
[SBF+ 03] Jörg Siekmann, Christoph Benzmüller, Armin Fiedler, Andreas Meier, Im-
manuel Normann, √ and Martin Pollet. Proof development with Ωmega: The
irrationality of 2. In Fairouz Kamareddine, editor, Thirty Five Years of Au-
tomating Mathematics, pages 271–314. Kluwer Academic Publishers, 2003.
[Sch04] Stephan Schulz. System description: E 0.81. In Basin and Rusinowitch
[BR04], pages 223–228.
68 Empirically Successful Computerized Reasoning
[Wei01] Christoph Weidenbach. Combining superposition, sorts and splitting. In
Robinson and Voronkov [RV01b], chapter 27, pages 1965–2013.
[WRC65] Lawrence Wos, George A. Robinson, and Daniel F. Carson. Efficiency and
completeness of the set of support strategy in theorem proving. Journal of
the ACM, 12(4):536–541, 1965.
Empirically Successful Computerized Reasoning 69