<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Eficient Implementation of Large-Scale Watchlists</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Constantin Ruhdorfer</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stephan Schulz</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Duale Hochschule Baden Württemberg Stuttgart</string-name>
        </contrib>
      </contrib-group>
      <fpage>120</fpage>
      <lpage>133</lpage>
      <abstract>
        <p>In this work, we explore techniques for improving the performance of the automated theorem proving system E when dealing with large watchlists. A watchlist can focus the proof search towards so-called hints, likely useful intermediate results provided externally. Recently, hints have been automatically extracted from previous proofs, creating massive watchlists and thus making evaluation of new clauses against the wachtlist a performance bottleneck. We introduce a new index for the frequent special case of unit clause hints, taking advantage of the fact that subsumption can be implemented much more eficiently for unit clauses than for the general case. We implement several strategies for exploiting the structure and properties of equational unit clauses. Additionally, we have added a new soft subsumption mechanism to E that can abstract away diferences of constant or Skolem symbols, efectively allowing a less precise match when evaluating a given clause against the watchlist. We have tested the new mechanisms on a large set of problems taken from the Mizar 40 project, using a large watchlist containing over 300 000 clauses. We show that the usage of the unit clause index significantly increases performance with this given watchlist. The use of soft subsumption shows more mixed results. We believe that most watchlists can take advantage of these techniques and have made them available to the user via E's command line interface.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Automated theorem proving</kwd>
        <kwd>Hints</kwd>
        <kwd>First order logic</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Automated theorem provers (ATPs or ATP systems) are programs that accept a set of axioms and
a conjecture in a suitable logic, and then try to automatically derive a proof of the conjecture.
Many of the most successful theorem provers are based on first-order logic (with equality), an
expressive logic with unambiguous semantics for which relatively mature calculi exist. First
order logic is semi-decidable. In theory, proofs for valid conjectures can always be found, but
proof search for an invalid conjecture may not terminate. This means that an ATP has to search
for proofs in an infinite and highly branching search space. Thus, guiding this search is of
critical importance for the success of the system. For systems based on forward deduction,
the critical choice is which of the many possible intermediate steps should be taken next, i.e.
which new formulas should be deduced. This is usually based on simple syntactic criteria (as
e.g. described in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]). However, these heuristics are often insuficient to find complex proofs.
      </p>
      <p>
        One way to improve the proof search is via hints. Originally [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], hints are possible intermediate
lemmas provided by the user of the prover. If the prover finds such a lemma (or a more general
one), it can focus its search on this lemma. User hints can come from the user’s domain expertise
and intuition, or possibly from simplified settings. In recent years, we have utilized the same
mechanism with a very diferent source of hints, namely intermediate results contributing to
proofs of other theorems in the domain [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The system can iteratively build a database of
results that are often useful in the domain. In contrast to manually provided hints, the number
of hints mined from existing proofs can often be extremely large, and hence evaluating new
formulas against the hint set can become quite expensive.
      </p>
      <p>
        In this work, we are interested in two aspects of the field at hand: Firstly, we want to improve
the eficiency of using large sets of hints in guiding a theorem prover (more concretely, the
equational theorem prover E [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]). Secondly, we explore a variety of notions about what
it means for a hint to match a new formula (or more specifically, clause), with the aim of
broadening the potential influence of a hint to also influence the selection of clauses that are
similar to the hint, not only those that are strictly more general.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Preliminiaries</title>
      <p>First-order logic with equality We will assume two disjoint sets  and  where  is the
set of function symbols and  the set of variable symbols. Function symbols have an associated
arity which we will denote with  / for symbol  and arity  ∈ N. Constants are function
symbols with  = 0.</p>
      <p>We will typically use , ,  to denote constants, , , ℎ to denote function symbols and either
, ,  or X1, . . . , X to denote variables.</p>
      <p>The set of syntactically correct terms is denoted by Term(F , V ), where Term(F , V ) is the
smallest set that satisfies the following conditions:
1.  ∈ Term(F , V ) for all  ∈ 
2.  / ∈  , 1, . . . ,  ∈ Term(F , V ) implies  (1, . . . , ) ∈ Term(F , V )
An (equational) atom is an unordered pair of terms, written as  ≃ . Observe that we handle
the non-equational case as a special case where we encode non-equational atoms as equalities
with the reserved constant $true, e.g. () ≃ $true. We will typically write non-equational
literals in the conventional manner for convenience (e.g. ()). A literal is either an atom, or a
negated atom. We write a negative literal as  ̸≃  and define a negation operator on literals
as  ≃  =  ̸≃  and  ̸≃  =  ≃ . We use  ≃˙  if we do not want to specify the polarity of a
literal, or, in a less precise way, let , 1, 2, . . . stand for arbitrary literals. In this notation ≃ is
commutative.</p>
      <p>A clause is a multiset of literals {1, 2, ..., }, usually written and always interpreted as a
disjunction 1 ∨ 2 ∨ ... ∨ . A unit clause is a clause containing only one literal. We denote the
set of all clauses as Clauses(F , V ) and the empty clause as □ .</p>
      <p>A substitution is a mapping  :  → Term(F , V ) with the property that Dom( ) = { ∈
 |  () ̸= } is finite. This mapping can be extended to terms, atoms, literals and clauses in
the obvious way. If  is a substitution, we call  (),  (),  () instances of , , or .</p>
      <p>Similarly, a match from a term (atom, literal, clause)  to another  is a substitution  such
that  () ≡ , where ≡ is the syntactic identity.</p>
      <p>In most theorem provers for classical first-order logic, proofs are found via contradiction. In
other words proof search tries to establish if a given set of clauses is unsatisfiable . For generating
calculi, new clauses are deduced via a set of inference rules that take one or more (most often
two) clauses as premises, and generate a new clause entailed by these premises. If this process
eventually derives the empty clause, unsatisfiability has been established (the empty clause is
inherently unsatisfiable, and so is any set of clauses that entails it).</p>
      <p>Subsumption is a syntactic relation between two clauses. A clause  subsumes another clause
, if one of its instances is a multi-subset of the the other, i.e. if  () ⊆ . A subsuming
clause is more general than the subsumed clause, i.e. the subsumed clause is entailed by the
subsuming clause (but not, in general, the other way round). Subsumption plays a double role in
this work. On the one hand, in most calculi we can ignore subsumed clauses, and subsumption
(the removal of subsumed clauses from the proof search) is a major and important optimisation
technique. On the other hand, if a clause subsumed another, it is considered “at least as good”
as the first one. The original notion of a clause matching a hint is based on subsumption. We do
not require a clause to be identical to a hint to prefer it, but we also prefer clauses that subsume
a hint (but note that we further generalise this relation later).</p>
      <p>Positions in a term A potential position  ∈ N* in a term is defined as a sequence over
natural numbers. The empty position is denoted with the special symbol  .</p>
      <p>The set of positions in a term  is denoted with pos() and defined recursively by case
distinction: If  only consists of a variable symbol  ∈  then pos() =  . If on the other hand
 ≡  (1, ..., ) then pos() = { } ∪ {. | 1 ≤  ≤ ,  ∈ pos()} with  ∈ N* . A position
 ∈ pos() of a term  can be used to refer to the subterm of  at . To be more exact: if  =  ,
then | = . Otherwise,  ≡ .′ and  ≡  (1, . . . , ). In that case,  = |′ . The top symbol
of  ∈  is top() =  and the top symbol of  (1, . . . , ) is top( (1, . . . , )) =  .</p>
      <sec id="sec-2-1">
        <title>2.1. Proof search</title>
        <p>
          E is a saturating theorem prover based on the superposition calculus [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. To prove a conjecture,
axioms and the negated conjecture are converted to clause normal form, resulting in a set of
clauses that is unsatisfiable if and only if the conjecture holds. The proof state thus is a set
of clauses, and the proof search is realised by saturating this set of clauses by adding logical
consequences that can be deduced from existing clauses by application of a number of inference
rules. If this process generates the empty clause as an explicit witness of unsatisfiability, the
proof has been concluded.
        </p>
        <p>In practice, this proof search is realised via the given-clause algorithm. The proof state is
represented by two disjoint sets of clauses, the set  of unprocessed clauses, and the set  of
processed clauses. The algorithm repeatedly picks a clause  from  , computes all possible
consequences between this given clause and all clauses in  , and adds them to  . It then adds
 (the given clause) to  . This maintains the invariant that all direct consequences of clauses
in  have been computed. In addition to these generating inferences, the algorithm can use
simplifying rules to replace clauses by simpler clauses, or to delete redundant clauses.</p>
        <p>
          The most critical choice point for the given-clause algorithm is the selection of the given
clause for each iteration of the main loop. This is traditionally controlled by heuristic evaluations,
based on symbol counting (smaller clauses are preferred), clause age (older clauses are preferred),
and various combinations and refinements of these measures (compare [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]).
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Watchlist</title>
        <p>
          Large parts of this work focus around the watchlist technique which was originally developed
by Robert Verof who named it the hint strategy [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. The strategy was developed for guiding
ATP programs in their proof search by comparing newly generated clauses against a list of hints.
Such a list of hints is user-provided and usually contains lemmas, facts or otherwise clauses the
user suspects might be relevant to the given problem. This technique was first implemented
into Otter [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <p>
          In the E ATP system the watchlist mechanism is implemented two-fold as a dynamic and
a static variant [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Regardless of the variant used the list is loaded on start-up and stored
as a Clause-Set where it is simplified like processed clauses. A Clause-Set is a internal data
structure in E that stores clauses using a doubly linked list and provides access to its members
via various indices. Every newly generated or processed clause is compared against the watchlist
by checking whether or not the new clause matches one or more clauses in the watchlist. If it
does it is prioritized for processing.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Indexing techniques</title>
        <p>
          One of the most important factors when it comes to the performance of ATP systems is eficient
indexing. Indexing helps to avoid, or at least reduce, time spent on sequential search within
large sets of clauses or terms. E has included several diferent indices for a while, including
(perfect) discrimination tree indexing [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], feature vector indexing [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] and fingerprint indexing
[
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>
          E has been using feature vector indexing for non-unit subsumption [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], and indexes only
the processed set of clauses  . Feature vector indexing is particularly suitable for indexing
relatively large multi-literal clauses, since it handles the complexity of equation- and literal
permutation by using features that are invariant under these permutations. This is a major
advantage compared to other approaches of handling subsumption via indices. It does not,
however, come into play for relatively small unit clauses, for which better indices exist. One
of which is fingerprint indexing which is a technique that samples positions in terms for its
indexing representation. We can adjust these sampled positions in a way that takes advantage
of the fact that every unit clause exactly consist out two terms (more on that later). First, we
will introduce fingerprint indexing for this purpose.
        </p>
        <p>
          Fingerprint Indexing A fingerprint index [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] is as trie over fingerprints fp of terms. The
general fingerprint feature function gfpf : Term(F , V ) × N* →  ′ where  ′ =  ⊎ {A, B, N}
is defined by case distinction:
gfpf(, ) =
⎧
⎪A
⎪
⎪
⎪⎨top(|)
⎪B
⎪
⎪
⎪⎩N
if  ∈ pos(), | ∈ 
if  ∈ pos(), | ̸∈ 
if  = .,  ∈ pos() and | ∈  for some 
otherwise
Here top() is  if  ∈  and  if  ≡  (1, ..., ). Given that the fingerprint feature function
is a function fpf :  (,  ) →  ′ and is defined by fpf() = gfpf(, ) for a fixed  ∈ N* .
Lastly the fingerprint function is defined by fp : Term(F , V ) → ( ′) for a fixed  ∈ N. A
ifngerprint is a vector of elements of  ′ and is calculated by fp() for a given term .
        </p>
        <p>For an arbitrary fpf and two terms  and  assume two values  = fpf() and  = fpf(). An
overview for the compatibility of unification and matching from  onto , given  and , is
presented in Figure 1.</p>
        <sec id="sec-2-3-1">
          <title>Unification</title>
          <p>1 2 A
1
2
A
B
N</p>
          <p>Y
N
Y
Y
N</p>
          <p>N
Y
Y
Y
N</p>
          <p>Y
Y
Y
Y
N</p>
          <p>B
Y
Y
Y
Y
Y</p>
          <p>N
N
N
N
Y
Y</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>Matching</title>
          <p>1 2 A
1
2
A
B
N</p>
          <p>Y
N
Y
Y
N</p>
          <p>N
Y
Y
Y
N</p>
          <p>N
N
Y
Y
N</p>
          <p>B
N
N
N
Y
N</p>
          <p>N
N
N
N
Y
Y</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Implementation</title>
      <p>
        When we originally introduced the watchlist feature, we expected to work with fairly small
watchlists, and decided to use feature vector indexing for all hint matching. However, watchlists
now contain several hundred thousand clauses, and evaluating new clauses against the watchlist
has become a major bottleneck. To reduce this bottleneck, we have split the watchlist index into
a pair of unit and non-unit clause indices, to decrease access times by storing fewer clauses in
either by using the most appropriate indexing technique for either set. For a similar idea see [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
Unit clause index We have implemented a new unit clause index in E based on fingerprint
indexing. Since this is a technique for term and not clause indexing we exploit the structure of
unit clauses to generate an indexing representation for the given unit clause. We use the fact
that all unit clauses are of the form {lterm ≃˙ rterm} to construct a new term, represented by
a $ cell of the form ≃˙ (lterm, rterm), over which we calculate the fingerprint. We do that
by alternating between lterm and rterm for sampling positions. Since all indexed terms start
with one kind of equality symbol (e.g. ≃ or ̸≃) we can skip it when constructing the fingerprint
for a term. The  position is therefore never sampled.
      </p>
      <p>Clauses that are not orientable are inserted twice into the index since changing the orientation
of a clause also changes its fingerprint. In the worst-case this would lead to the size of index
doubling. We therefore checked old runs of E and found that around 15% of clauses were
not orientable and would therefore be inserted twice. Although this is a considerate increase
we guess that this would have a negligible impact on performance while designing this data
structure. Inserted clauses are simply stored as a pointer in the leaf of the fingerprint trie using
a splay tree.</p>
      <p>We present an example index in Figure 2 which assumes an example fingerprint function
FPW4 that samples at (1, 2, 1.1, 2.1) and  = { /2, /1, /0, /0, /0}.</p>
      <p>g
f
g
f
f
A
c
g,f
g,A
f,c</p>
      <p>A
A
a
g,f,A
g,A,A
f,c,a
b
A
N
N
g,f,A,b</p>
      <p>(X1) ≃  (, )
g,f,A,A</p>
      <p>(X1) ≃  (X2, )
g,A,A,N
f,c,a,N
(X1) ≃ X2
(X1) ≃ X1
 () ≃</p>
      <p>We have implemented several fingerprint functions to cover a wide variety of needs. We
started with functions that assume full equality in the terms they sample which means that
they sample both sides equally: NoIdx (no unit clause index), FPW2, FPW4, FPW6, FPW8 and
FPW10 (see Table 1 for details). Although E is an equality based ATP system not all problems are
purely equational or equational at all. The same also applies to watchlists. E already categorizes
problems based on whether they are non-equational  , somewhat equational  and purely
equational  . We use the same mechanism to classify the degree of equality in the given
watchlists.</p>
      <p>Based on that we alter the strategy used to sample the positions. This is since with increasingly
less equational watchlist the right side of a term is more likely to simply be $ and there is no
useful information to be sampled. To address this we also introduced a left only (marked with
"L", e.g. FPW2L) and a left leaning (marked with "LL", e.g. FPW2LL) version of each fingerprint
function. The left only version will skip position  and continue to only sample positions on
the left side, e.g. FPW2L samples at 1, 1.1 and FPW6L at 1, 1.1, 1.1, 1.1.1, 1.2.1, 1.1.2. The left
leaning version will sample roughly between 2/3 and 3/4 of the positions from the left side,
depending on the size of the fingerprint function. Since FPW2 samples so few positions FPW2LL
is the same function as FPW2L. For an overview over all strategies please consult Table 1.</p>
      <p>Strategy name</p>
      <sec id="sec-3-1">
        <title>Positions sampled NoIdx FPW2 FPW2L</title>
        <p>FPW2Flex
FPW4
FPW4L
FPW4LL
FPW6
FPW6L
FPW6LL
FPW6Flex
FPW8
FPW8L
FPW8LL
FPW10
FPW10L
FPW10LL</p>
        <p>While the left leaning version surely is more useful for somewhat equational clause sets,
using the strategy will still result in many sampled positions that are non-existent and therefore
not useful when it comes to matching. We therefore propose yet another strategy to be used
for partly equational clause sets which we will denote by "*Flex" (e.g. FPW2Flex). A flex type
strategy is one where we first classify the input based on whether or not the right side of the
term is $. We then use an L type sampling method on the term if it is or a balanced one
if it is not. On the one hand this allows us to better exploit the structure of the given term
while on the other hand this will result in the index returning some terms that are not actually
a match if they had been sampled with the same fingerprint function. This is since now their
ifngerprints might be sampled from diferent positions. While that seems troublesome at first
this is not really an issue for two reasons: (i) It is unlikely that this will afect many terms
since one unmatchable symbol in the fingerprint will already reject the match and (ii) since
ifngerprint indexing is an non-perfect indexing method to begin with we need to check whether
the given clause subsumes the every returned clause anyway. In the worst case we will need to
check slightly more results for subsumption.</p>
        <p>We implemented all these options into E and made them available through Es domain-specific
language (DSL). On top of that we also implemented an automatic mode (available as "auto")
that maps a  watchlist to an "LL" type function, an  watchlist to an "L" type function and
a  watchlist to a normal strategy. As a basis for that we used "FPW6" since we expect it to
perform well across many diferent watchlists.</p>
        <p>Clause abstraction We have also implemented a clause abstraction mechanism in E for
the watchlist feature. The mechanism supports two modes of operation: One abstracting
constants and one abstracting skolem symbols. If turned on our implementation will rewrite all
clauses that are inserted into or checked against the watchlist to adhere to the abstraction. This
efectively allows for less precise matches against the watchlist.</p>
        <p>If constants are to be abstracted all constants are rewritten to the first constant met
during the proof for an untyped problem and for a typed one to the first constant met with
the appropriate sort. That is given a clause 1 = { (, ) ≃˙ ( 1), ( 1) ≃˙ ()},  =
{ /2, /1, /0, /0, /0} and the first met constant  we will rewrite the clause 1 to ′1 =
{ (, ) ≃˙ ( 1), ( 1) ≃˙ ()} assuming , ,  share the same sort.</p>
        <p>The mechanism works similarly for abstracting skolem symbols where we rewrite them to
the first met skolem symbol with the same type. We have also made this an available option to
turn on through Es DSL.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Results</title>
      <p>We tested on Intel Xeon E5-2698 v3 CPUs at 2.30 GHz using the Linux 3.19.0-25-generic kernel
in 64-bit mode. All tests were run with a time limit of 720 seconds and a limit of 10.000 generated
clauses. For orchestrating the experiments we used the ATPy library1.</p>
      <sec id="sec-4-1">
        <title>4.1. Unit clause index</title>
        <p>
          For testing we used a strategy2 provided by the automated reasoning group at Czech Technical
University in Prague who also provided a watchlist based on previous runs of the system. The
watchlist contained 367.408 clauses of which 153.997 are unit. The strategy was run against a
tenth of the Mizar 40 project [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] amounting to 5787 problems. We have made all of this data
and the E version used available at http://eprover.eu/E-eu/SoftWatch.html.
        </p>
        <p>Table 2 shows diferent versions of the index and their performance. Observe that we compare
our implementation against a version of E that only uses feature vector indexing as an indexing
technique for the watchlist. This version is refereed to as “Conventional” or “NoIdx” (no index)
since it is missing the unit clause index. To measure performance we observe the runtime E
given the set of problems. We chose to compare runtimes since the proof search for any given
problem nearly always stays the same between diferent indexing strategies. To verify that
1Written by Jan Jakubův; Online accessible at https://github.com/ai4reason/pyprove
2The exact options given were: − definitional − cnf=24 − split− aggressive − simul− paramod −
forward− context− sr − destructive− er− aggressive − destructive− er − prefer− initial− clauses − tKBO6 −
winvfreqrank − c1 − Ginvfreq − F1 − delete− bad− limit=150000000 − WSelectMaxLComplexAvoidPosPred − H
’(1∗ConjectureTermPrefixWeight(PreferProcessed,1,3,0.1,5,0,0.1,1,4),1 ∗ConjectureTermPrefixWeight(
PreferWatchlist,1,3,0.5,100,0,0.2,0.2,4),1∗Refinedweight(PreferWatchlist ,4,300,4,4,0.7) ,1 ∗ RelevanceLevelWeight2(
PreferWatchlist ,0,1,2,1,1,1,200,200,2.5,9999.9,9999.9) ,1∗ StaggeredWeight( PreferWatchlist ,1) ,1∗
SymbolTypeweight(PreferWatchlist ,18,7,− 2,5,9999.9,2,1.5) ,2∗ Clauseweight( PreferWatchlist ,20,9999,4) ,2∗
ConjectureSymbolWeight(PreferWatchlist ,9999,20,50,− 1,50,3,3,0.5) ,2∗ StaggeredWeight( PreferWatchlist ,2) ) ’
− free− numbers</p>
        <sec id="sec-4-1-1">
          <title>All runs</title>
          <p>the runs indeed stay the same we compared a random subsample of proof searches. In our
comparisons we will diferentiate between the runtime for all problems and only those that
were deemed successful 3.</p>
          <p>
            We expected to find similar results compared with the original fingerprint paper [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ]. Meaning
that we expected to find that a fingerprint size of 6 is a good balance of trie depth and clause
distribution. Although comparing FPW6 with the no indexed version yields an improvement of
16.35% for all runs and 19.65% for all runs that were successful we find that other strategies were
even more successful. Surprisingly FPW2L yielded the best performance performing 28.86% for
all runs and 44.87% for all successful runs.
          </p>
          <p>While this index increases performance on average Figure 3 shows that the actual performance
is dependent on the problem itself. Please note the figures’ logarithmic scale. Notice that most
strategies perform very similar. This very likely is an efect of testing on the same watchlist
where all strategies perform well, but one can exploit some inherent structure of the watchlist
better. The “auto” strategy is not listed since for the watchlist tested it would evaluate to the
performance of FPW6LL.</p>
          <p>Lastly, we were interested in examining wether the diferent strategies solve the same
problems or instead if they are proving diferent problems. We found that they overwhelmingly
do. That is to say the intersection of problems solved by all strategies, which is a very limiting
factor, is 1668 problems big (includes the baseline). Given that most strategies solve around
1680 problems this means that the overlap of solved problems is 99%. We compiled the runtimes
3Runs with exit status "Theorem" or "CounterSatisfiable"
(a) FPW2L runtime comparison (in seconds).
(b) FPW6 runtime comparison (in seconds).
on these 1668 problems in Table 3.</p>
          <p>The problems solved by all strategies are very likely the easier problems in the whole set. We
therefore might expect the performance timings to be better than the the runtimes observed in
Table 2 but this is not the case.
Index
FVI (baseline)
FPW2
FPW2L
FPW4
FPW4L
FPW4LL
FPW6
FPW6L
FPW6LL
FPW8
FPW8L
FPW8LL
FPW10
FPW10L
FPW10LL</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Validating the use of fingerprint indexing</title>
        <p>One central claim of this paper is that exploiting the structure of unit clauses leads to better
performance compared to a standard feature vector index. This claim can easily be verified
by comparing the performance of E when either indexing technique is only filled with unit
clauses. This can be achieved by using a watchlist that only contains unit clauses. We do that
by removing all non-unit clauses from the watchlist described above. We did not alter the set of
problems.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Clause Abstraction</title>
        <p>We used a similar test setup for determining performance of the clause abstraction feature.
We have used the previous options for the prover (as stated in footnote 2) together with
either adding the flag − watchlist− clause− abstraction=constant or − watchlist− clause−
abstraction=skolem respectively. The results are shown in Table 5.</p>
        <p>We find that the performance varies from problem to problem. This is especially true when
(a) All runs (in seconds).</p>
        <p>(b) Successful runs (in seconds).
abstracting constant symbols where some problems started to run out of time. Compare this to
just one problem for all other strategies tested (see Figure 3). This efect is clearly visible in the
scatter plot Figure 5a.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Future Work</title>
      <p>We have identified at least two more interesting areas of study. Firstly instead of rewriting a
skolem symbol to one of the same arity, we would also be interested in rewriting complete
skolem terms to a constant. Secondly, we are also interested in generalizing the idea of splitting
the watchlist indices into even more smaller ones to increase performance.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this work, we have presented a special unit clause index for the watchlist feature based on
ifngerprint indexing. We explored the performance for several strategies with that index given
a large watchlist of 300 000 clauses and showed that the index largely increases performance
(a) Constant abstraction (in seconds).</p>
      <p>(b) Skolem symbol abstraction (in seconds).
compared to a version without the index. We conclude that the performance of the index is
dependent on the watchlist, its structure and the strategy used. We believe that most watchlists
can benefit from this index.</p>
      <p>We have also introduced some mechanism to E that allow for less precise matches on the
watchlist. While that showed more mixed results in terms of performance it is an interesting
topic that would benefit from additional exploration.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>Special thanks to the Automated Reasoning Group at Czech Technical University in Prague for
providing the watchlist, the problem files and the experimental environment.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Schulz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Möhrmann</surname>
          </string-name>
          ,
          <article-title>Performance of clause selection heuristics for saturation-based theorem proving</article-title>
          , in: N.
          <string-name>
            <surname>Olivetti</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Tiwari (Eds.),
          <source>Proc. of the 8th IJCAR, Coimbra</source>
          , volume
          <volume>9706</volume>
          <source>of LNAI</source>
          , Springer,
          <year>2016</year>
          , pp.
          <fpage>330</fpage>
          -
          <lpage>345</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Verof</surname>
          </string-name>
          ,
          <article-title>Using hints to increase the efectiveness of an automated reasoning program: Case studies</article-title>
          ,
          <source>Journal of Automated Reasoning</source>
          <volume>16</volume>
          (
          <year>1996</year>
          )
          <fpage>223</fpage>
          -
          <lpage>239</lpage>
          . URL: https://doi.org/10. 1007/BF00252178. doi:
          <volume>10</volume>
          .1007/BF00252178.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Goertzel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jakubuv</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Urban</surname>
          </string-name>
          , Enigmawatch: Proofwatch meets
          <string-name>
            <surname>ENIGMA</surname>
          </string-name>
          , CoRR abs/
          <year>1905</year>
          .09565 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1905</year>
          .09565.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Schulz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E - A Brainiac</given-names>
            <surname>Theorem</surname>
          </string-name>
          <string-name>
            <surname>Prover</surname>
          </string-name>
          ,
          <source>Journal of AI Communications</source>
          <volume>15</volume>
          (
          <year>2002</year>
          )
          <fpage>111</fpage>
          -
          <lpage>126</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Schulz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cruanes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vukmirović</surname>
          </string-name>
          , Faster, higher,
          <source>stronger: E 2</source>
          .3, in: P. Fontaine (Ed.),
          <source>Proc. of the 27th CADE</source>
          , Natal, Brasil, number 11716 in LNAI, Springer,
          <year>2019</year>
          , pp.
          <fpage>495</fpage>
          -
          <lpage>507</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Bachmair</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ganzinger</surname>
          </string-name>
          ,
          <article-title>Rewrite-Based Equational Theorem Proving with Selection and Simplification</article-title>
          ,
          <source>Journal of Logic and Computation</source>
          <volume>3</volume>
          (
          <year>1994</year>
          )
          <fpage>217</fpage>
          -
          <lpage>247</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Schulz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Möhrmann</surname>
          </string-name>
          ,
          <article-title>Performance of clause selection heuristics for saturation-based theorem proving</article-title>
          ,
          <source>in: Proceedings of the 8th International Joint Conference on Automated Reasoning - Volume 9706</source>
          , Springer-Verlag, Berlin, Heidelberg,
          <year>2016</year>
          , p.
          <fpage>330</fpage>
          -
          <lpage>345</lpage>
          . doi:
          <volume>10</volume>
          . 1007/978-3-
          <fpage>319</fpage>
          -40229-1_
          <fpage>23</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>W.</given-names>
            <surname>McCune</surname>
          </string-name>
          ,
          <source>Otter</source>
          <volume>2</volume>
          .0, in: M. E. Stickel (Ed.),
          <source>10th International Conference on Automated Deduction</source>
          , Springer Berlin Heidelberg, Berlin, Heidelberg,
          <year>1990</year>
          , pp.
          <fpage>663</fpage>
          -
          <lpage>664</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Goertzel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jakubův</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schulz</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Urban,</surname>
          </string-name>
          <article-title>ProofWatch: Watchlist guidance for large theories in E</article-title>
          , in: J.
          <string-name>
            <surname>Avigad</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Mahboubi (Eds.),
          <source>Interactive Theorem Proving: 9th International Conference</source>
          , Oxford, UK, Springer,
          <year>2018</year>
          , pp.
          <fpage>270</fpage>
          -
          <lpage>288</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>W.</given-names>
            <surname>McCune</surname>
          </string-name>
          ,
          <article-title>Experiments with discrimination-tree indexing and path indexing for term retrieval</article-title>
          ,
          <source>J. Autom. Reason</source>
          .
          <volume>9</volume>
          (
          <year>1992</year>
          )
          <fpage>147</fpage>
          -
          <lpage>167</lpage>
          . URL: https://doi.org/10.1007/BF00245458. doi:
          <volume>10</volume>
          .1007/BF00245458.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Schulz</surname>
          </string-name>
          ,
          <article-title>Simple and Eficient Clause Subsumption with Feature Vector Indexing</article-title>
          , in: M. P. Bonacina, M. E. Stickel (Eds.),
          <source>Automated Reasoning and Mathematics: Essays in Memory of William W. McCune</source>
          , volume
          <volume>7788</volume>
          <source>of LNAI</source>
          , Springer,
          <year>2013</year>
          , pp.
          <fpage>45</fpage>
          -
          <lpage>67</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Schulz</surname>
          </string-name>
          ,
          <article-title>Fingerprint Indexing for Paramodulation and Rewriting</article-title>
          , in: B.
          <string-name>
            <surname>Gramlich</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          <string-name>
            <surname>Sattler</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Miller</surname>
          </string-name>
          (Eds.),
          <source>Proc. of the 6th IJCAR, Manchester</source>
          , volume
          <volume>7364</volume>
          <source>of LNAI</source>
          , Springer,
          <year>2012</year>
          , pp.
          <fpage>477</fpage>
          -
          <lpage>483</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>C.</given-names>
            <surname>Kaliszyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Urban</surname>
          </string-name>
          , Mizar 40 for mizar 40, CoRR abs/1310.2805 (
          <year>2013</year>
          ). URL: http: //arxiv.org/abs/1310.2805.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>