Semantic Selection of Premisses
                       for Automated Theorem Proving
                                         Petr Pudlák1
                                 1
                                   Charles University, Prague
                                 petr.pudlak@mff.cuni.cz


                                              Abstract
            We develop and implement a novel algorithm for discovering the optimal sets of
        premisses for proving and disproving conjectures in first-order logic. The algorithm
        uses interpretations to semantically analyze the conjectures and the set of premisses
        of the given theory to find the optimal subsets of the premisses. For each given
        conjecture the algorithm repeatedly constructs interpretations using an automated
        model finder, uses the interpretations to compute the optimal subset of premisses
        (based on the knowledge it has at the point) and tries to prove the conjecture using
        an automated theorem prover.


1       Importance of selecting appropriate premisses
        in automated theorem proving
A proper set of premisses1 can be essential for proving a conjecture by an automated
theorem prover. Clearly, the larger the number of the initial premisses the larger the
number of the inferred formulae. And as for the most proving techniques the number
of inferred formulae is in general super-exponential in the number of input formulae,
the impact on the performance of an automated theorem prover is quite significant.
Removing even a single superfluous premiss can make the difference between being or
not being able to prove the conjecture.
    This major problem is even more serious when reasoning within large mathematical
theories, which can contain hundreds of premisses and thousands of conjectures. In such
cases, selecting proper premisses for proving the conjectures becomes a necessity.
    In our previous work [Pud06a, Pud06b] we described a method for compacting proofs
of conjectures. By constructing lemmas from the proofs and by syntactically analyzing
both the lemmas and the proofs, we constructed sets of premisses that produced shorter
proofs of given conjectures, or allowed to construct proofs much faster. However, the
assumption was that for each conjecture we already know some set of premisses from
which we were able to prove the conjecture (although possibly redundant and inefficient
for constructing a proof). If we were not able to prove the conjecture at all, the situation
was more complicated.
    1
    Using the notion axioms for the formulae the prover uses as assumptions sometimes causes confusion,
therefore we shall prefer the notion premisses instead. See section 2.1.
    One possibility is to syntactically analyze the formulae and/or use an AI algorithm
for guessing the proper premisses. Successful examples of such procedures are Josef
Urban’s tools for the Mizar Project [Urb06a, Urb06b], reducing axiom sets in software
verification [RS98] or filtering of axioms for machine-generated problems [MP06]. Al-
though even simple syntactic heuristics can be very effective, the syntactic approach is
generally restricted to the cases when the (syntactic) structure of formulae well reflects
their semantics. Clearly, this is not always the case. Moreover, the syntactic analysis
is usually only a heuristic procedure that tries to learn what premisses could be needed
but is likely to fail on a new kind of problem. Syntactic filters are also often incomplete
in the sense that they can eventually remove too many premisses.
    The procedure we shall describe in this article uses semantic analysis. By observing
which formulae are true in which interpretations, we can get a deeper insight into the
nature of a conjecture and compute such a set of premisses that is proper for proving
the conjecture. As far as we are aware, this is a novel approach, which has not been
researched before.
    We shall focus on two interconnected goals:
    • Determine which sets of premisses are sufficient for proving the conjecture;

    • among those sets, choose such a set that contains no redundant premisses and
      that is minimal with respect to some criterion.
The criterion can be just a simple one – to minimize the number of premisses, or a
more complex one, for example, to avoid premisses of a certain kind that complicate the
proving process.


2     Semantic analysis of problem using interpretations
2.1    Notation
Throughout the whole article we shall always work within the domain of first-order
logic. We shall always assume that a language is given and all formulae we work with
are formulated in the language and are closed.
    We shall denote formulae by regular letters (e.g. A, F ), interpretations by calli-
graphic letters (e.g. M), sets of formulae by boldface letters (e.g. B, AC ) and sets
of sets or sets of interpretations by script letters (e.g. F , N ). We shall also use the
following symbols:

       M |= B         M is a model of a formula B                                      (1)
       M |= B         M is a model of all formulae in B
                      (i.e. a model of the conjunction of all formulae in B)           (2)
        B |= G        for all possible interpretations M, if M |= B then M |= G
                      (i.e. G is a consequence of B)                                   (3)
        B`G           G can be proved from B                                           (4)

For brevity we shall often say that “a formula B avoids an interpretation M” (resp. B
avoids a set of interpretations M ) if and only if M 6|= B (resp. M 6|= B for all M ∈ M ).
    Often formulae given to the prover as a basis for (dis)proving conjectures are simply
called axioms. However, this sometimes causes confusion, because these given formulae
actually do not have to be the axioms of a particular theory. They can be also propo-
sitions that have already been proved before, lemmas, etc. Therefore, we shall instead
call such assumptions premisses.

2.2   Basic principle
Let a conjecture C and a set of premisses A (finite, but presumably large) be given.
We would like to prove (or to refute) C from A using an automated theorem prover.
But since A is large, the prover is overloaded by the high number of premisses and we
are not able to prove C directly from A. We expect that only a small subset of A is
necessary to prove or to refute C and we would like to find such a subset.
    The idea of the semantic analysis is based on two principal properties of first-order
logic:

  1. If B ` C then for any interpretation M such that M 6|= C also M 6|= B. That is,
     there is B ∈ B such that M 6|= B.

  2. Having a set of premisses B, then

        • either C is provable from B, hence B ` C, which is equivalent to B |= C by
          the completeness theorem,
        • or there is an interpretation M such that M |= B ∪ {¬C}.

    Using these two properties we shall construct a sequence of interpretations Mi and
a sequence of premisses Fi (for convenience let us set Bi = {F0 , . . . , Fi−1 }). At each
step we select Fi to be from the set Ai = {F ∈ A | Mi 6|= F } and eventually some set
Bk will be sufficient for proving C.
    Let us start with an empty set of premisses B0 = ∅. If C is a tautology, then B0 ` C
and we are finished. Otherwise, there is some interpretation M0 such that M0 |= ¬C
(see Fig. 1a). Let
                                A0 = {F ∈ A | M0 6|= F }
Let us select any F0 ∈ A0 and set B1 = {F0 }. Because M0 6|= F0 , also M0 6|= B1 .
Now, either B1 |= C, thus B1 ` C, or there is an interpretation M1 such that M1 |=
B1 ∪ {¬C} (see Fig. 1b). We set

                               A1 = {F ∈ A | M1 6|= F }

and continue in similar fashion. In step i, either already Bi |= C (as in Fig. 1d, where
i = 3) or there is an interpretation Mi such that Mi |= Bi ∪ {¬C} (as in Fig. 1c, where
i = 2). In such a case we set

                                Ai = {F ∈ A | Mi 6|= F }

If Ai is empty, then Mi |= A ∪ {¬C}, hence C is not a theorem of A. Otherwise, we
set Bi = Bi−1 ∪ {Fi−1 }, where Fi is an arbitrary formula chosen from Ai .
                 (a)                                       (b)


                 (c)                                       (d)

     Each point of the plane of the diagram represents a single possible inter-
pretation of the language. A formula is represented as a shape that contains
precisely the interpretations in which the formula is true. The conjecture C
is illustrated by the gray rectangle. First, we construct a model M0 of ¬C
(a). We pick some premiss F0 that avoids the model. Next, we construct a
model M1 of {F0 , ¬C} (b) and pick some other premiss F1 which avoids this
model. Then, because we still cannot prove C from {F0 , F1 }, we construct a
model M2 of {F0 , F1 , ¬C} (c) and pick yet another premiss F2 that avoids
it. Now, there is no model of {F0 , F1 , F2 , ¬C} (d) and hence we can prove
C from {F0 , F1 , F2 }.


        Figure 1: Illustration of the semantic selection of premisses.
     Observe that at each step Bi avoids all constructed interpretations M1 , . . . , Mi−1 .
Also note that Bi ∩ Ai = ∅, since Mi |= Bi , but for each F ∈ Ai we know that Mi 6|= F
by the definition of Ai . Therefore, at each step, a different formula is moved from Ai
into Bi (and possibly other formulae are removed from Ai ) and |Bi | = i. And since A
is finite, either Bk |= C for some k, or Al = ∅ for some l and thus A 6|= C.

2.3   Adaptation for automated theorem provers
Automated theorem provers, model finders and other similar tools are always limited
in available resources such as time or computer memory. Hence, if we are given a set of
premisses B and a conjecture C, the result of a computation could be that we either

   • find a proof of C from B, or

   • find an interpretation M such that M |= B ∪ {¬C}, or

   • run out of resources.

We have to take into account also the last possible outcome and modify the procedure
described in the previous section to be able to deal with such a result.
    The simplest possible solution is that if at step n we are neither able to prove C
from Bn nor to find an interpretation Mn such that Mn |= Bn ∪ {C}, we backtrack
and at some previous step k < n we try to pick a different formula Fi from Ak .
    Nevertheless, we shall now describe a more general solution that allows us to pro-
ceed in those cases when we are neither able to prove nor disprove C. The procedure
remembers all failed attempts (sets of premisses) in a system of sets F to avoid trying
the same attempt again and thereby getting into an infinite cycle. All interpretations
that are constructed during the computation are stored in a set M and are then reused
for selecting the premisses.
    Let us give a schema of the procedure:

  1. Initialize F = ∅ and M = ∅.

  2. Repeat:

       (a) Construct a subset of premisses B ⊆ A such that B 6∈ F and such that for
           each M ∈ M : M 6|= B. If no such B can be found, report failure and exit.
      (b) Try to prove C from B. If successful, print the proof and exit.
       (c) Otherwise, try to construct an interpretation M such that M |= B ∪ {¬C}.
           In our case, this step is performed by an automated model finder like Paradox
           or Darwin.
      (d) If such a model M is found, then
              i. if M |= A then report that A 6|= C and exit;
             ii. otherwise add M to M , and
            iii. loop to 2.
       (e) If no such model is found, add B to F and loop to 2.
    We only need to remember the failed attempts, because if we successfully construct
a model M |= B, the condition M 6|= B in 2a prevents choosing B again.
    The procedure described in section 2.2 was a special case of this generalized proce-
dure. As there were no failed attempts, the system F remained empty and in i-th cycle
the set B constructed in 2a was the set Bi .
    As mentioned in step 2c, we use an external tool to construct the interpretations.
There is no restriction on the constructed interpretations except that we are able to
decide whether a formula is true in it or not. However, as far as we are aware, all
current model finders are limited to constructing only finite interpretations.
    The important part, which we have omitted until now, is the construction of the
set of premisses B in 2a. We shall examine this problem in the next section and give a
specification of a complete algorithm.


3     Specification of the algorithm
3.1   Converting the task to the weighted set cover problem
In the previous section we have outlined a schema of an algorithm for selecting premisses
for proving a conjecture. In this section we shall construct a complete, generalized
algorithm that tries to find a minimal set of premisses with respect to a given criterion.
The criterion will be represented by a weight function on premisses and our goal will be
to find a set of premisses for which the sum of the weights is minimized.

Definition 1 (Weight function). Let A be a set of premisses. A weight function β is a
function that maps formulae from A into positive real numbers:

                                      β : A → R+

    We shall now focus on the construction of the optimal set of premisses in step 2a
of the procedure given in section 2.3. Our aim is to find such a subset ofP premisses
B ⊆ A that avoids all known interpretations of ¬C and for which the sum B∈B β(B)
is minimized.
    For each premiss B ∈ A let CB be the set of interpretations that B avoids:

                               CB = {M ∈ M | M 6|= B}                                 (5)
                                                            S
Thus, B avoids all the interpretations in M if and only if B∈B CB = M . By this
assignment we have converted our problem to the well-known weighted set cover problem:

Definition 2 (The weighted set cover problem). For the input

    • ground elements M = M1 , . . . , Mn

    • subsets CB ⊆ M where B ∈ A

    • weights β(B) defined for B ∈ A
                                   P                      S
find a set B ⊆ A that minimizes       B∈B β(B) such that   B∈B CB = M . For the
unweighted set cover problem we take β(B) = 1 for all Bs.
    In our case, the elements being covered are the interpretations we have constructed
so far and the covering sets are defined by (5).
    There are many theoretical results regarding this problem in literature. For us the
most important results concern the greedy algorithm, which approximates the problem:

Definition 3 (Greedy algorithm for the weighted set cover problem). The greedy algo-
rithm for set cover selects sets according to the rule:

         At each stage, select the set that minimizes the cost (with respect to the
         weight function β) per additional element covered.

More formally, if N is the set of uncovered ground elements, select B for which |Cβ(B)
                                                                                  B ∩N |
is minimal.
    For the unweighted set cover problem this means that we select the set for which
|CB ∩ N | is maximal, that is the set which contains the largest number of uncovered
elements.

   By the following results, the algorithm gives the best possible polynomial approxi-
mation of the weighted set cover problem:

Remark 1 (Properties of the greedy algorithm). The greedy algorithm for the weighted
set cover problem has the following properties:

       • The weighted set cover problem is NP-complete (see [Kar72]).

       • The approximation
                Pn 1       ratio2 of the greedy algorithm is H(maxB∈B |CB |), where
         H(n) = k=1 k ≤ ln(n) + 1 (see [Chv79]).

       • This is essentially the best possible approximation (see [Fei98]).

3.2       Searching for sets of premisses using the greedy set cover algo-
          rithm
The greedy algorithm described in the previous section will find an approximation of
the optimal subset of premisses. If we are able to prove C from B, we are finished,
and if we are able to find a model of B ∪ {¬C} we use the greedy algorithm to find
a new optimal subset. The problem arises if we are neither able to prove C or find a
new interpretation. Therefore, we slightly modify the greedy algorithm so that we are
able to backtrack and find another next-to-optimal set of premisses that covers all the
interpretations. Hereby we are able to construct a sequence of sets of premisses until
we prove C or find a new interpretation.
    The work of the modified algorithm can be viewed as traversing a search tree, where
each node of the tree corresponds to a subset of premisses selected from A. The original
   2
       An algorithm Γ is an α-approximation algorithm for a minimization problem Π if
   – Γ runs in polynomial time.
   – Γ always produces a solution which is within a factor of α of the value of the solution of the
     optimal algorithm Π.
The least α that satisfies these conditions is called the approximation ratio of Γ.
     greedy algorithm traverses only a single branch of the search tree until it finds a covering
     set. The modified algorithm remembers its position in the search tree and if it is neither
     able to prove the conjecture from the premisses nor to find a new interpretation, the
     algorithm continues traversing the tree according to the greedy rule. If it exhausts the
     whole tree, it exits with a failure.
     Remark 2. The theoretical results described in the previous section cover only the origi-
     nal greedy algorithm. The question whether the subsequent covering sets returned by this
     extended algorithm also well approximate the optimum is still to be researched. However,
     we believe that it a reasonable assumption.

     The algorithm in pseudo-code. Let us now look at the schema of the algorithm.
     The main function searchForProofOrModel recursively searches the sub-tree that cor-
     responds to a given set of premisses B. It sorts the remaining admissible premisses
     B1 , . . . , Bk by their cost per additional covered element (with respect to the weight
     function β) and recursively processes the sets B ∪ {Bi } until either a model is found or
     the whole sub-tree is exhausted (the original greedy algorithm always picks B ∪ {B1 }).
     The function returns either a new model, or a NIL value if it exhausts all nodes of the
     sub-tree.
         The traversed nodes are remembered using a global set variable F . It is constructed
     so that the set {Y ⊆ A | there is X ∈ F such that X ⊆ Y } is the set of all visited
     nodes of the tree. Every time we visit a whole sub-tree, we replace all the visited sets of
     premisses occurring in the sub-tree by the set of premisses corresponding to the root of
     the sub-tree (line 40). This allows us to keep F small and thus to efficiently recognize
     the sets of premisses we have already inspected.
1    /∗ first, try to find any model of ¬C ∗/
2    try to prove C from ∅; if successful, print |= C and exit;
3    try to find a model M |= ¬C;
4      if successful, set M := {M};
5      otherwise exit ”Failed to prove nor disprove C from A”;
6    /∗ now we have at least one element to cover ∗/
7    label ADD MODEL:
8      for B ∈ A do
9         CB := {M ∈ M : M 6|= B}; /∗ initialize the covering sets CB ∗/
10     F := ∅; /∗ initialize the set of failed attempts ∗/
11     M := searchForProofOrModel(∅); /∗ search for a new model ∗/
12     if M 6= NIL then /∗ if a new model is found, search again ∗/
13        if M |= A then
14           exit ”C not provable from A”
15        /∗ invariant: M 6∈ M , see Corollary 1 that follows ∗/
16        M := M ∪ {M};
17        goto ADD MODEL;
18     /∗ otherwise exit with failure ∗/
19   exit ”Failed to prove nor disprove C from A”;

20   function searchForProofOrModel(set of premisses B)
21     returns new model or NIL
22   begin
23     N := M \ ∪B∈B CB ; /∗ the set of uncovered models ∗/
24       let B1 , . . . , Bk be the premisses from A for which (CBi ∩ N ) 6= ∅
25            sorted by |CBβ(B)  ∩N | in ascending order;
                            i
26     for i := 1 to k do
27     begin
28       B0 := B ∪ {Bi };
29       if ¬((∃ X ∈ F ) : X ⊆ B0 ) then
30          N 0 := N \ CBi ;
31          if N 0 = ∅ then
32             try to prove C from B0 ; if successful, print B0 |= C and exit;
33             try to find a model M |= B0 ∪ {¬C}; if successful, return M;
34          else /∗ N 0 6= ∅, i.e. there are still uncovered models ∗/
35             M := searchForProofOrModel(B0 ); /∗ recursively try to find a model ∗/
36             if M 6= NIL then
37                return M;
38          /∗ whether N 0 = ∅ or not, remember the visited sub−tree: ∗/
39          /∗ replace all descendants of B0 by B0 ∗/
40          F := {B0 } ∪ (F \ {X ∈ F | B0 ⊆ X});
41     end;
42     /∗ exhausted all possibilities, neither proof nor new model was found ∗/
43     return NIL;
44   end;


     3.2.1      Constructed interpretations and termination of the algorithm
     Lemma 1. Let M1 and M2 be two interpretations constructed by the algorithm, M2
     after M1 . Let B1 and B2 be their corresponding sets of premisses B0 from which the
     interpretations were constructed at line 33 (Mi |= Bi ∪ {¬C}). Then B1 6= B2 and
     there is a formula F such that M2 |= F but M1 6|= F .

     Proof. B2 is selected so that for every M ∈ M at that point M 6|= B2 . In particular
     M1 6|= B2 , hence there is F ∈ B2 such that M1 6|= F . But M2 |= F (because
     M2 |= B2 ), hence F has the desired properties. And because M1 |= B1 , F 6∈ B1 , thus
     B1 6= B2 .

     Corollary 1. No two constructed interpretations are isomorphic.

     Theorem 1. The algorithm terminates.

     Proof. By lemma 1, every time the program constructs an interpretation it uses a dif-
     ferent subset of premisses. There only 2|A| possible subsets of A, hence the algorithm
     can construct at most 2|A| interpretations and must eventually terminate.


     4      Implementation
     We have designed an implementation of the algorithm that is aimed to prove or disprove
     several conjectures at once. The program is written in the Java programming language
     and is connected to E prover ver. 0.99 [Sch02, Sch07] and two model finders: Darwin
     ver. 1.3FM [FBT07, BFdNT06] and Paradox ver. 2.0 [CS07, CS03]. When constructing
     models the program runs both these model finders simultaneously and takes the result
of the first one that completes. As the model finders take quite different approaches,
this arrangement leads to a better success rate, and on multi-processor machines also
to shorter times.

4.1      Input of the program
The input of the program is a list of TPTP3 files. The formulae in each file can be
divided into these categories:

Conjecture (denoted by C). The formula which we want to prove or disprove from
    the premisses given in the same file. (TPTP name conjecture.)

Premisses (denoted by AC ). The set of formulae from which the algorithm should
    select the optimal set of premisses for (dis)proving the conjecture C. (TPTP name
    axiom.)

Definitions (denoted by DC ). The formulae which are always included as premisses
     for proving C. In some cases the user may surmise that some formulae are indis-
     pensable for the proof, for example definitions of essential functional or predicate
     symbols. (TPTP name definition.)

4.2      Execution of the program
Recall that we denote by C the set of all conjectures the program is asked to prove, by
AC the input set of premisses for proving each conjecture C ∈ C, by DC the definitions
given for proving C and finally by M the set of constructed interpretations.
    At the beginning M = ∅. The program runs in a loop until either all conjectures
are (dis)proved or until it runs out of time. During each pass of the loop the program
performs the following step:

  1. Decide which conjecture C to try to (dis)prove at this step (we shall describe this
     in more detail later).

  2. Use the greedy algorithm to select the optimal set of premisses B ⊆ AC that
     covers all interpretations from {M ∈ M : M |= ¬C}.

  3. Run the model finder and the prover on B ∪ {¬C}.

           • If a model M is found, set M := M ∪ {M};
           • or if B |= C, report the achievement and remove C from further processing
             (C := C \ {C});
           • or if neither a model is found nor C proved (we shall call this outcome a
             failed attempt), save the state of the greedy algorithm to be able to restore
             it next time C is selected and to construct another possible covering set.

  4. In all cases, loop again to 1.
  3
      Thousands of Problems for Theorem Provers, see [SS98].
Selecting the conjecture. When designing the way how to select the next conjecture
to be processed, we had two primary requirements:
  1. The program should not spend too much time trying to prove a single, possibly
     too hard conjecture. Instead, it should alternate between the given conjectures.

  2. The program should favour the easier conjectures because it is likely it will be able
     to prove them earlier.
Here too we can take an advantage of the information stored in the constructed inter-
pretations. Both above requirements are satisfied by assigning the following weight wC
to each conjecture and selecting the one with the highest weight:

         wC   = |{M ∈ M | M |= C}| − |C| · uC                                         (6)
         uC   = the number of previous failed attempts when C was selected.           (7)

The more interpretations C is valid in, the more general C is likely to be and thus
supposedly easier to prove. And each time C is selected we construct an interpretation
M such that M |= ¬C, hence wC does not increase. On the other hand, if another
conjecture D is true in M, wD increases. This happens for example if D ∈ AC and
it is selected as a premiss for proving C. Subtracting the number of failed attempts
(multiplied by the number of conjectures for better efficiency) prevents the program
from sticking to a single conjecture in the case when it is unable to construct a new
interpretation for a long time and therefore the weights do not change.

4.3     Optimizations
4.3.1    Large number of constructed interpretations
One of the main drawbacks of the algorithm is that when it constructs a new interpre-
tation M |= B ∪ {¬C}, only a small subset B of all possible premisses AC is true in M.
Most of the other premisses from A \ B are usually false in M and therefore they cover
M. High number of possible coverings then leads to construction of a large number of
interpretations. One possible solution is to add some of the remaining premisses and/or
conjectures (A\B)∪(C\{C}) to B in such a way that they will not affect the possibility
and the difficulty of constructing M. In particular, if A ∈ (A \ B) ∪ (C \ {C}) has no
common predicate symbol with any of the formulae in B ∪ {¬C}, we can safely add A
to B. Depending on the structure of the problem, we can enlarge B by several formulae
and thus reduce the number of possible coverings of M, especially at the beginning of
the process when the sets of selected premisses are small. On the other hand, in some
cases this optimization can considerably slow down the model finder.

4.3.2    Formulae problematic for the model finder
Another question that highly affects the efficiency of the algorithm is the ability of the
model finder to construct an interpretation. It is not surprising that some formulae
can make the process of searching for models much more difficult. By assigning higher
weights β to such problematic formulae we can discourage the algorithm from selecting
such formulae.
    For example, the model finder Darwin is not optimized for equality reasoning and
therefore we primarily wanted to avoid formulae with equality. Secondarily, we wanted
to assign higher weights to formulae with many variables, because such formulae also
make the process of model construction more difficult (this was suggested by Koen
Claessen, the author of Paradox). Therefore, we assigned the weight β of a formula A
as
                 β(A) = θ(A)               if A does not contain ‘=’
                                                                                  (8)
                 β(A) = θ(A) + 10000       if A contains ‘=’
where θ(A) is defined recursively as

             θ(P ) =   0                    if P is an atom
         θ((∀x)F ) =   θ(F ) + 1
         θ((∃x)F ) =   θ(F ) + 1                                                     (9)
            θ(¬F ) =   θ(F )
          θ(F ΩG) =    max(θ(F ), θ(G))     where Ω is any binary connective

θ(A) is simply the maximum number of quantifiers we can encounter when traversing
from an atom of A to A’s topmost connective. Such proper assignment of weights can
indeed reduce the unwanted outcomes when the model finder is not able to find an
interpretation.
    We also tried to set the weight to β(A) = 2θ(A) (resp. β(A) = 2θ(A) + 1000) as this
number more closely corresponds to the number of possible variable assignments that
must to be examined when deciding the validity of A. This weight also produced very
good results.
    There are of course many other possibilities how to assign the weights, depending on
the nature of the particular problem. For example, when deciding relationships between
modal logic systems in [Pud06b] we based the weights on the number of modal symbols
occurring in the formula.


5     Empirical evaluation
For the first time we have used the algorithm when deciding relationships between modal
systems in [Pud06b]. The first implementation of the algorithm allowed us to decide
several cases that we were not able to solve with an ordinary theorem prover. The recent
improved version of the program has not yet been evaluated on a large sets of problems.
However, the present experiments already show that the method is quite promising.

5.1   Tests on the bushy division of The MPTP Challenge
We have tested the program on the bushy division The MPTP Challenge [US06]. The
challenge is focused on automated theorem proving in environments with many axioms,
predicates and functors. For the test we have selected 121 problems on which E prover
spent more than 10s (or failed at all) in the MPTP referential tests.
    All experiments were run on a Linux machine with 4 Dual Core AMD OpteronTM
1.8 GHz Processors and 5 GB of memory. The program was able to take advantage of
the multiple processors by running the prover and the model finders simultaneously.
                   E proof    proved     proof        # of       # of        # of   # of failed
article     thm.     CPU        after     time    interpre-    initial   selected        proof
            name   time [s]        [s]      [s]     tations   prems.      prems.     attempts
finset 1    t17                   87     0.20        4860         87          32           392
funct 1     t21                  485     0.93         412         47          21           322
funct 1     t57                   90     0.03         135         46          20            40
funct 1     t62                   40     0.04         118         46          16            13
funct 1     t145                 391     0.15         171         40          15           253
lattice3    t1          14                                       143
lattice3    t3           9                                       152
orders 2    t25        147                                        84
ordinal1    t23                   8      0.19          56         44          12            3
pre topc    t48      1114       320      0.24        2074         71          17            2
relat 1     t12                  76      0.02         160         36          14           93
relat 1     t25         24      210      0.37         266         34          13          149
relat 1     t44                 276      0.04         249         34          15          351
relat 1     t45                 651      0.04         284         34          15         1247
relat 1     t74                1728      0.26         466         38          17         2140
relat 1     t86                 134      0.60         186         38          18           55
relat 1     t88         14       47      0.12         153         35          16           19
relat 1     t115                 66      0.62         228         38          18           22
relat 1     t117        15       39      0.10         145         35          15           15
relat 1     t143                 58      0.06         253         38          18           24
relat 1     t166                 69      0.08         207         38          16           33
relat 1     t167                759      0.04         508         39          18          890
relat 1     t174        30      101      0.66         170         39          16           36
relat 1     t178                 41      0.11         199         35          12           11
relset 1    t12        729                                        36
subset 1    t43         89          3    0.18           29        26          14             0
waybell 7   t8           2                                       195
waybell 9   t29         31                                       249
wellord1    l29                   12     0.03           93        36          15             3
wellord1    t19                   86     0.58          161        54          25            62
wellord1    t21                    9     0.48           89        41          12             0
wellord1    t24         31                                        35
wellord2    t3                    15     0.10         131         55          19             0
wellord2    t5                    45     0.06         153         55          21             0
yellow 0    t42                  472     0.07        3237         89          27             0
yellow 0    t60                  790     0.05        4119         91          27             0
yellow 0    t61                  771     0.09        2385         96          34           180
yellow 1    t2          16                                       183
zfmisc 1    t99                     3    0.60           11        13           7             0
zfmisc 1    t136        96                                         8

  Table 1: Results for 121 “difficult” problems from the bushy division of MPTP.
     The results of the test are summarized in Table 1. The table shows only the problems
solved by E or by the program (or by both). The third column shows the CPU time
of standalone E. The fourth column shows the wall-clock time4 of the program. The
fifth column shows the CPU time of E on the premisses selected by the program. The
sixth column shows the number of interpretations constructed by the program during
the search. The last column shows the number of failed attempts, that is the number
of cases where the program was neither able to construct a new model nor to prove the
conjecture. If the program (or E) were not able to solve a problem, the corresponding
columns are left blank.
     The presented experiments were conducted with the CPU limit set to 1.2s for both
the prover and the model finders. When have noticed that the program performs better
when the time limits for the prover and for the model finder are very low.5 This
is explained by the observation that in vast majority of cases if a model cannot be
found within a few seconds it is not found at all. And if an optimal (or near-optimal)
set of premisses is eventually found, the prover spends very little time on proving the
conjecture. Thus, if the time limits are higher, the program wastes much of its time on
unsuccessful attempts and is not able to construct as many interpretations.
     The program was able to prove 31 problems of 121 compared to 15 of standalone E.
6 problems were solved by both E and the program. The total time is mostly affected by
the number of failed attempts and by the total number of constructed interpretations.
As expected, the proofs from the selected subsets of premisses (fifth column) indeed took
very little time, just fractions of seconds. E exhibits the usual behavior of proving tools
that with increasing time it becomes overwhelmed by the number of derived clauses. In
our case, E successfully proved only two conjectures after time longer than 150s. On the
other hand, when the program was given enough time, it was able to prove many more
difficult conjectures.

5.2    SRASS
The idea was recently implemented by Geoff Sutcliffe and Yury Puzis in SRASS system
[SP07]. The system uses more advanced techniques to further improve performance,
including a syntactic relevance measure as an ordering heuristic to guide the selection
process. SRASS is able to solve many problems that can not be solved by the underlying
theorem prover alone and is among the most successful systems entering The MPTP
challenge [US06].


6     Conclusion and further work
The main difference from the commonly used proving techniques is that the algorithm
is capable of selecting only those premisses that are necessary for the proof and that its
decisions are based not on a syntactic but on a semantic basis.
   4
     The program runs in a platform-independent language and moreover it spawns multiple child pro-
cesses (the prover and the model finders), therefore we were not able to gather the total CPU time of
the program.
   5
     Of course this observation applies only to this particular set of problems.
     The program was eventually able to find a proof for some conjectures that were too
difficult for the classical black-box prover. When an optimal set of premisses was found,
the proof constructed from this set usually took just a fraction of second. Thus, even if
it takes the program a long time to find a proof, the optimal set of premisses provides
the information how to prove the conjectures efficiently.

6.1     Advantages
6.1.1    Reusing interpretations
The interpretations constructed during the process of proving one conjecture can be
reused by the algorithm for proving other conjectures, even from different theories. It
is only required that the theories share the same language.
    By accumulating such interpretations, the program gains knowledge about relation-
ships between theorems and premisses within the theory. As the process is fully auto-
mated, the algorithm could be used to gather a large amount of such interpretations
without any user intervention and these interpretations could be then used to answer
queries in the future.6 Therefore, we surmise that the algorithm could prove its worth for
large databases of mathematical knowledge, both for automated and interactive proving
of theorems.

6.1.2    Saving the state of the algorithm
The state of the algorithm can be easily saved and restored. The state is fully described
by the set of unsuccessful attempts F and by the set of constructed interpretations M .
Hence, the algorithm can be easily suspended, if there is a more important task to be
solved, and resumed later in the future. Or, like in our implementation, the program
can alternate the conjecture it tries to prove according to some criterion.

6.1.3    More efficient on long runs than conventional provers
As we have noticed in our example, classical black-box provers usually become ineffective
(and tend to consume a lot of memory) after a few minutes, when the number of inferred
clauses becomes too large. The program does not suffer from this issue, as it only needs
to remember the constructed interpretations7 and the set of failed attempts.

6.2     Disadvantages
6.2.1    A high number of constructed interpretations
A major disadvantage of the algorithm is that the number of constructed interpretations
can become quite high, especially if there are a lot of similar axioms and lemmas in the
set of premisses. Although the program was able to handle even several tens of thousands
   6
    This idea was first proposed by Jiřı́ Vyskočil in a personal discussion.
   7
    In fact, it is even not necessary to remember the interpretations. It is sufficient to store the in-
formation of what premisses from AC are true in each interpretation, that is just |M | · |AC | Boolean
values.
of interpretations, we believe that reducing the number would make the method much
more efficient. We suggest the following improvements:

  1. Focus on constructing interpretations in which a high number of the premisses is
     true. The higher the number, the less is the number of the remaining premisses
     that can be used to avoid the interpretation. Thus, the algorithm would be led
     more directly towards finding the optimal set of premisses.

  2. Try to construct interpretations in which several of the conjectures are false. A
     single interpretation can be used to guide the search for all conjectures which are
     false in the interpretation.
        We have already partially implemented these two suggestions as described in sec-
        tion 4.3.1.

  3. If possible, analyze what interpretations became unnecessary and remove them
     from the pool. As we have seen, the algorithm never constructs two isomorphic
     interpretations (Corollary 1). Therefore, every constructed interpretation carries
     a unique piece of knowledge. It will be necessary to devise a more sophisticated
     technique for detecting which interpretations participate only negligibly on the
     selection of premisses and thus can be discarded.

6.2.2     Limitation to finite interpretations
In general, the problem of constructing a model of a set of formulae is algorithmically
undecidable. Any algorithm we construct will be only able to construct some specific
class of interpretations. In most cases, it will be the class of finite interpretations. And
as many interesting theories (like arithmetic or set theory) have infinite interpretations,
this may impose a strict restriction on the method. However, the situation needs not
to be as hopeless as it seems to be. Recall that we only construct interpretations of
subsets of the given axioms. If we force the algorithm to elude those axioms that cause
the interpretation to be infinite, we may still be able to achieve good results. This can
be easily arranged by assigning high weights to those problematic axioms. However, we
are not aware how to automatically identify such axioms.
    Another solution would be to construct and represent some class of infinite interpre-
tations. For example, one could construct Herbrand interpretations, saturations or more
generally represent the interpretations as complex functions or programs that compute
the validity of formulae.

6.3     Comparison with syntactic techniques
As it turns out, the algorithm falls short of the original expectation that it could very
effective when the number of premisses is very large (hundreds or more). The reason is
the high number of interpretations that need to be constructed to determine the optimal
set of premisses in such cases. Syntactic approaches as [Urb06a, Urb06b, RS98, MP06]
seem to be much more efficient. However, the value of the algorithm emerges in the
cases where even the number of premisses is moderate, the prover is overwhelmed by
unnecessary premisses.
    It seem quite possible that a combination of the semantic approach with syntactic
techniques might be very efficient. This assumption is supported by the results of the
SRASS system [SP07] in the MPTP Challenge [US06] (see also Section 5.2).

6.4     Further work
We have already suggested many possible areas of improvement in the previous section.
We believe that considerable improvements could be attained by close cooperation with
the designers of model finders and theorem provers to implement some of the following
proposals:
  1. Guide the model finder when constructing interpretations. We suggest that the
     model finder would be given two sets of formulae, A and L. It would be required
     to find a model M |= A such that it is also a model of as much as possible formulae
     from L (for example within some given time limit, domain size, etc.).

  2. Reveal which premisses are well suited for a particular theorem prover or a model
     finder. This information could be used to minimize the number of cases where
     neither a proof nor a new interpretation are found, and secondly to develop a
     better founded, automated process of assigning weights to premisses.
    The program would also benefit from a graphical user interface that would facilitate
visual and interactive analysis of the process.
    Although the method is not mature yet and many aspects still need to be researched,
we believe that it can bring significant benefit to the task of automated proving of theo-
rems. Not only it can allow to prove conjectures that are hard to prove by conventional
prover, but it also opens the possibility to further analyze the relationships between the
conjectures, the premisses and the interpretations that it constructs.


References
[BFdNT06] Peter Baumgartner, Alexander Fuchs, Hans de Nivelle, and Cesare Tinelli.
          Computing finite models by reduction to function-free clause logic. to ap-
          pear, preliminary version, June 2006.

      [Chv79] Vašek Chvátal. A greedy heuristic for the set-covering problem. Math.
              Oper. Res., 4:233–235, 1979.

       [CS03] Koen Claessen and Niklas Sörensson. New techniques that improve
              MACE-style model finding. In Proc. of Workshop on Model Computation
              (MODEL), 2003.

       [CS07] Koen Claessen and Niklas Sörensson. Paradox – a first-order logic model
              finder. WWW pages, 2007.
              http://www.cs.chalmers.se/~koen/paradox/.

   [FBT07] Alexander Fuchs, Peter Baumgartner, and Cesare Tinelli. Darwin – a
           theorem prover for the model evolution calculus. WWW pages, 2007.
           http://combination.cs.uiowa.edu/Darwin/.
  [Fei98] Uriel Feige. A threshold of ln(n) for approximating set cover. J. ACM,
          45(4):634–652, 1998.

 [Kar72] Richard Manning Karp. Reducibility among combinatorial problems. In
         Complexity of Computer Computations, pages 85–103. Plenum Press, 1972.

 [MP06] Jia Meng and Lawrence Paulson. Lightweight relevance filtering for
        machine-generated resolution. volume 192 of CEUR Workshop Proceed-
        ings, 2006.

[Pud06a] Petr Pudlák. Search for faster and shorter proofs using machine generated
         lemmas. In G. Sutcliffe, R. Schmidt, and S. Schulz, editors, Proceedings of
         the FLoC’06 Workshop on Empirically Successful Computerized Reasoning,
         3rd International Joint Conference on Automated Reasoning, volume 192
         of CEUR Workshop Proceedings, pages 34–52, 2006.

[Pud06b] Petr Pudlák. Verification of Mathematical Proofs. PhD thesis, Charles
         University in Prague, Faculty of Mathematics and Physics, 2006. Available
         online at http://lipa.ms.mff.cuni.cz/~pudlak/pp-thesis.pdf.

  [RS98] W. Reif and G. Schellhorn. Theorem proving in large theories. In W. Bibel
         and P. Schmitt, editors, Automated Deduction—A Basis for Applications,
         volume III, 2. Kluwer Academic Publishers, Dordrecht, 1998.

 [Sch02] S. Schulz. E – A brainiac theorem prover. Journal of AI Communications,
         15(2-3):111–126, 2002.

 [Sch07] Stephan Schulz. The E equational theorem prover. WWW pages, 2007.
         http://www.eprover.org/.

  [SP07] Geoff Sutcliffe and Yury Puzis. SRASS – a semantic relevance axiom selec-
         tion system, 2007.
         http://www.cs.miami.edu/~tptp/ATPSystems/SRASS/.

  [SS98] G. Sutcliffe and C. B. Suttner. The TPTP Problem Library: CNF Release
         v1.2.1. Journal of Automated Reasoning, 21(2):177–203, 1998.

[Urb06a] Josef Urban. MizarMode - an integrated proof assistance tool for the Mizar
         way of formalizing mathematics. Journal of Applied Logic, 4(1):414–427,
         2006.

[Urb06b] Josef Urban. MoMM - fast interreduction and retrieval in large libraries
         of formalized mathematics. International Journal on Artificial Intelligence
         Tools, 15(1):109–130, 2006.

  [US06] Josef Urban and Geoff Sutcliffe. The MPTP $100 Challenges, 2006.
         http://www.cs.miami.edu/~tptp/MPTPChallenge/.