Learning Precedences from Simple Symbol Features
Filip Bárteka,b , Martin Sudaa
a
  Czech Technical University in Prague – Czech Institute of Informatics, Robotics and Cybernetics. Jugoslávských
partyzánů 1580/3, 160 00 Praha 6 – Dejvice, Czech Republic
b
  Czech Technical University in Prague – Faculty of Electrical Engineering. Technická 2, 166 27 Praha 6 – Dejvice,
Czech Republic


                                         Abstract
                                         A simplification ordering, typically specified by a symbol precedence, is one of the key parameters
                                         of the superposition calculus, contributing to shaping the search space navigated by a saturation-based
                                         automated theorem prover. Thus the choice of a precedence can have a great impact on the prover’s
                                         performance. In this work, we design a system for proposing symbol precedences that should lead to
                                         solving a problem quickly. The system relies on machine learning to extract this information from past
                                         successful and unsuccessful runs of a theorem prover over a set of problems and randomly sampled
                                         precedences. It uses a small set of simple human-engineered symbol features as the sole basis for dis-
                                         criminating the symbols. This allows for a direct comparison with precedence generation schemes
                                         designed by prover developers.

                                         Keywords
                                         saturation-based theorem proving, simplification orderings, symbol precedences, machine learning


1. Introduction
Modern saturation-based automated theorem provers (ATPs) such as E [1], SPASS [2] or Vam-
pire [3] use the superposition calculus [4] as their underlying inference system. Superposition is
built around the paramodulation inference [5] crucially constrained by simplification ordering
on terms and literals, which is supplied as a parameter of the calculus. Both of the two main
classes of simplification orderings used in practice, i.e., the Knuth-Bendix Ordering [6] and
the Lexicographic Path Ordering [7], are mainly determined by a symbol precedence, a (partial)
ordering on the signature symbols.1
   While the superposition calculus is known [9] to be refutationally complete for any sim-
plification ordering, the choice of the precedence may have a significant impact on how long
it takes to solve a given problem. In a well-known example, prioritizing in the precedence
the predicates introduced during the Tseitin transformation of an input formula [10] exposes
the corresponding literals to resolution inference during early stages of the proof search, with
the effect of essentially undoing the transformation and thus threatening with an exponential
blow-up that the transformation is designed to prevent [11]. ATPs typically offer a few heuristic

PAAR 2020: Seventh Workshop on Practical Aspects of Automated Reasoning, June 29–30, 2020, Paris, France (virtual)
  filip.bartek@cvut.cz (F. Bártek); martin.suda@cvut.cz (M. Suda)
 0000-0002-1822-2651 (F. Bártek); 0000-0003-0989-5800 (M. Suda)
                                       © 2020 Copyright for this paper by its authors.
                                       Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR

          CEUR Workshop Proceedings (CEUR-WS.org)
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073


                  1
      KBO is further parameterized by symbol weights, but our reference implementation in Vampire [3] uses for
efficiency reasons only weights equal to one [8] and so we do not consider this parameter here.


                                                                                                         21
Filip Bártek et al. CEUR Workshop Proceedings                                                            21–33


schemes for generating the symbol precedences. For example, the successful invfreq scheme
in E [12] orders the symbols by the number of occurrences in the input problem, prioritizing
symbols that occur the least often for early inferences. Experiments with random precedences
have shown that the existing schemes often fail to come close to the optimum precedence [13],
revealing there is a large potential for further improvements.
   In this work, we design a system that, when presented with a First-Order Logic (FOL) problem,
proposes a symbol precedence that will likely lead to solving the problem quickly. The system
relies on the techniques of supervised machine learning and extracts such theorem-proving
knowledge from successful (and unsuccessful) runs of the Vampire theorem prover [3] when
run over a variety of FOL problems equipped with randomly sampled symbol precedences.
We assume that by learning to solve already solvable problems quickly, the acquired knowledge
will generalize and help solving problems previously out of reach. As a first step in a more
ambitious project, we focus here on representing the symbols in a problem by a fixed set
of simple human-engineered features (such as the number of occurrences used by invfreq
scheme mentioned above)2 and, to simplify the experimental setup, we restrict our attention
to learning precedences for predicate symbols only.3
   Learning to predict good precedences poses several interesting challenges that we address
in this work. First, it is not immediately clear how to characterize a precedence, a permutation
of a finite set of symbols, by a real-valued feature vector to serve as an input for a learning
algorithm. Additionally, to be able to generalize across problems we need to do it in a way
which does not presuppose a fixed signature size. There is also a complication, when sampling
different problems, that some problems may be easy to solve for almost every precedence and
others hard. In theorem proving, running times typically vary considerably. Finally, even with
a regression model ready to predict the prover’s performance under a particular precedence 𝜋,
we still need to solve the task of finding an optimum precedence 𝜋 * according to this model,
which cannot be simply solved by enumerating all the permutations and running the prediction
for each due to their huge number.
   Our way of addressing the above sketched challenges lies in using pairwise symbol preferences
to characterize a precedence, normalizing the target prover run times on a per problem basis,
and in the use of “second-order” learning of the preferences for symbols abstracted by their
features. These concepts are introduced in Section 3 and later formalized in Section 4. Section 5
presents the results of our experimental evaluation of the proposed technique over the TPTP [14]
benchmark. We start our exposition by fixing the notation and basic concepts in Section 2.


2. Preliminaries
We assume that the reader is familiar with basic concepts used in first order logic (FOL) theorem
proving. We use this section to recall and formalize the key notions relevant for our work.

Problem A (first-order) problem is a pair 𝑃 = (Σ, Cl ), where Σ = (𝑠1 , 𝑠2 , . . . , 𝑠𝑛 ) is a list
of (predicate and function) symbols called the signature, and Cl is a set of first-order clauses
   2
       Automatic feature extraction using neural networks is planned for future work.
   3
       Our theoretical considerations, however, apply equally to learning function symbol precedences.


                                                        22
Filip Bártek et al. CEUR Workshop Proceedings                                                             21–33


built over the symbols of Σ.
   The problem is either given directly by the user or could be the result of clausifying a general
FOL formula 𝜙, in which case we know which of the symbols were introduced during the clausi-
fication (namely during Tseitin transformation and skolemization; see e.g. Nonnengart and
Weidenbach [15]) and which occurred in the conjecture (if it was present).

Precedence Given a problem 𝑃 = (Σ, Cl ) with Σ = (𝑠1 , 𝑠2 , . . . , 𝑠𝑛 ), a precedence 𝜋𝑃 is
a permutation, i.e. a bijective mapping, of the set of indices {1, . . . , 𝑛}. A precedence 𝜋𝑃
determines a (total) ordering on Σ as follows: 𝑠𝜋𝑃 (1) < 𝑠𝜋𝑃 (2) < . . . < 𝑠𝜋𝑃 (𝑛) .

Simplification Orderings are orderings on terms used to parameterize the superposition
calculus [4] employed by modern saturation-based theorem provers. The two classes of sim-
plification orderings most commonly used in practice, the Knuth-Bendix Orderings [6] and
the Lexicographic Path Orderings [7], are both defined in terms of a user-supplied (possibly
partial) ordering < on the given problem’s signature Σ. In this work, we assume that the theo-
rem prover uses a simplification ordering from one of these two classes relying on the ordering
on Σ determined by a precedence 𝜋𝑃 to construct such a simplification ordering.

Performance measure A saturation-based ATP solves a problem 𝑃 = (Σ, Cl ) (under a par-
ticular fixed strategy and a determined symbol precedence 𝜋𝑃 ) by either

    • deriving from Cl a contradiction in the form of the empty clause, in which case 𝑃 is
      shown unsatisfiable, or
    • finitely saturating the set of clauses Cl without deriving the contradiction, in which case
      𝑃 is shown satisfiable.4

In both cases, we take the number of iterations of the employed saturation algorithm (see,
e.g., Riazanov and Voronkov [16] for an overview) as a measure of the effort that the ATP
took to solve the problem. We refer to this measure as the abstract solving time and denote it
ast(𝑃, 𝜋𝑃 ).5
   In practice, an ATP can also run out of resources, typically out of the allocated time. In that
case, the abstract solving time is undefined: ast(𝑃, 𝜋𝑃 ) = ⊥. While it may happen that running
an ATP with the same problem and symbol precedence two times yields a different result each
time (namely succeeding one time and failing another time), such cases are rare and we ensure
they do not interfere with the learning process by caching the results.

Order matrix Given a permutation 𝜋 of the set of indices {1, . . . , 𝑛}, the order matrix O(𝜋)
is a binary matrix of size 𝑛 × 𝑛 defined in the following manner:

                                O(𝜋)𝑖,𝑗 = 𝜋 −1 (𝑖) < 𝜋 −1 (𝑗) ,
                                           q                 y


    4
      We assume a refutationally complete calculus and saturation strategy.
    5
      The advantage of using abstract solving time is that it does not depend on the hardware used for the compu-
tation.


                                                       23
Filip Bártek et al. CEUR Workshop Proceedings                                                    21–33


where we use J𝑃 K to denote the Iverson bracket [17] applied to a proposition 𝑃 , evaluating to
1 if 𝑃 is true, and 0 otherwise. In other words, for a symbol precedence 𝜋𝑃 , 𝑂(𝜋𝑃 )𝑖,𝑗 = 1 if
the precedence 𝜋𝑃 orders the symbol 𝑠𝑖 before the symbol 𝑠𝑗 , and 𝑂(𝜋𝑃 )𝑖,𝑗 = 0 otherwise.

                                                 −
                                                 →
Flattened matrix Given a matrix 𝑀 of size 𝑛 × 𝑛, 𝑀 is the vector of length 𝑛2 obtained by
flattening 𝑀 :
                                         −
                                         →
                                         𝑀 (𝑖−1)𝑛+𝑗 = 𝑀𝑖,𝑗

for every 𝑖, 𝑗 ∈ {1, . . . , 𝑛}. For our use the exact way of mapping the matrix elements to
the vector indices is not important. We mostly just need a vector representation of the data
contained in the given matrix to have access to the dot product operation.

Linear regression is an approach to modeling the relationship between scalar target values
𝑦𝑖 ∈ R and one or more input variables x𝑖 = (𝑥1𝑖 , . . . , 𝑥𝑘𝑖 ), 𝑖 = 1, . . . , 𝑛. The relationship is
modeled using a linear predictor function:

                                          𝑦^𝑖 = x𝑖 · w + 𝑏,

whose unknown model parameters w ∈ R𝑘 and 𝑏 ∈ R are estimated from the data. We call
the vector w the coefficients of the model and 𝑏 the intercept. Most commonly, the parameters
are picked to minimize the so-called mean squared error:
                                                   𝑛
                                           1 ∑︁
                                     MSE =      (𝑦^𝑖 − 𝑦𝑖 )2 ,
                                           𝑛
                                                  𝑖=1

but other norms are also possible [18].

Basic assumptions For the discussion that follows, we assume a fixed ATP that uses the su-
perposition calculus with a simplification ordering parameterized by a symbol precedence.
While the practical experiments described in Section 5 use the ATP Vampire [3], the model
architecture does not assume a particular ATP and is compatible with any superposition-based
ATP such as E [1], SPASS [2] or Vampire. Within the prover a particular saturation strategy is
fixed including a time limit.


3. General considerations and overview
The aim of this work is to design a system that learns to suggest good symbol precedences
to an ATP from observations of the ATP’s performance on a class 𝒫train of problems with
randomly sampled precedences. Given a problem 𝑃 = (Σ, Cl ) with | Σ | = 𝑛, we consider
a precedence 𝜋𝑃 good, if it leads to a low ast(𝑃, 𝜋𝑃 ) among the 𝑛! possible precedences for
𝑃 . Note that for problems with a signature with more than a few symbols, repeatedly running
the prover with random precedences represents an effectively infinite source of training data.


                                                  24
Filip Bártek et al. CEUR Workshop Proceedings                                                              21–33


  Ideally, we would like to learn general theorem proving knowledge, not too dependent
on 𝒫train , which could be later explained and compared to precedence generation schemes
manually designed by the prover developers. Let us quickly recall one such scheme, already
mentioned in the introduction, called invfreq in E [1]. The prover’s manual [12] explains:

       Sort symbols by frequency (frequently occurring symbols are smaller).

What is common to basically all manually designed schemes, is that they pick a certain scalar
property of symbols (here it is the symbol frequency, i.e. the number of occurrences of the symbol
in the given problem) and obtain a precedence by sorting the symbols using that property.

Decomposing We might want our system to also learn a certain property of symbols and
use sorting to generate and suggest a precedence. However, it is not clear how to “extract”
such property from the observed data, since we only have access to the target values for
full precedences. Our idea for “decomposing” these values into pieces that somehow relate
to individual symbols (and can thus be “transferred” across problems) is to take a detour using
symbol pairs: we assume that the performance of the ATP on 𝑃 given 𝜋𝑃 , i.e. our measure
ast(𝑃, 𝜋𝑃 ), can be predicted from a sum of individual contributions corresponding to facts
of the form
                        𝜋𝑃 orders the symbol 𝑠𝑖 before the symbol 𝑠𝑗 .
This is in line with how a prover developer could reason about a precedence generating scheme:
Even when it is not clear how good or bad a symbol is in absolute terms, one might have
an intuition that a symbol from a certain class should preferably come before a symbol from
another class in the precedence (e.g., symbols introduced during clausification should typically
be smaller than others) and assign some weight to this piece of intuition.
   In Section 4.2 we formalize this idea using the notion of preference matrix and show how, for
each problem in isolation, such preference matrix can be obtained in the form of coefficients
learned by linear regression.

Learning across problems Symbol preferences learned on a particular problem are inher-
ently tied to that problem and do not immediately carry over to other problems. The main reason
for this is that symbols themselves only appear in the context of a particular problem.6 That is
why we resort to representing symbols by their features (cf. Section 4.3.2) when aggregating
the learned preferences across different problems. This is in more detail explained further below
and, formally, in Section 4.3.
   We also strive to ensure that the preference values across problems have possibly the same
magnitude. Note that ast(𝑃, 𝜋𝑃 ) may vary a lot for a fixed problem 𝑃 but all the more so
across problems. To obtain commensurable values, we normalize (see Section 4.1) the prover
performance data on a per problem basis before learning the preferences. Normalization also
deals with supplying a concrete value to those runs which did not finish, i.e. have ast(𝑃, 𝜋𝑃 ) =
⊥.
     6
       On certain benchmarks, such as those coming from translations of mathematical libraries [19], symbols main-
tain identity and meaning across individual problems. However, since our goal in this work is to learn general
theorem proving knowledge, we do not use the assumption of aligned signatures.


                                                       25
Filip Bártek et al. CEUR Workshop Proceedings                                                21–33


“Second-order” regression Once the symbols are abstracted by their feature vectors, we
can collect symbol preferences from all the tested problems and turn this collection into another
regression task. Note that at this moment, the preferences, which were obtained as the coeffi-
cients learned by linear regression, themselves become the regression target. Thus, in a certain
sense, we now do second-order learning. It should be stressed though, that while the learning
of the preferences requires a linear regression model by design, this second-order regression
does not need to be linear and more sophisticated models can be experimented with.
   The details of this step are given in Section 4.3.

Preference prediction and optimization Once the second-order model has been learned,
we can predict preferences for any pair of symbols based on their feature vectors and thus also
predict, given a problem 𝑃 , how many steps will a prover require to solve it using a particular
precedence 𝜋𝑃 . (For this second step, we reverse the idea of decomposition: we sum up those
predicted preferences that correspond to pairs of symbols 𝑠𝑖 , 𝑠𝑗 such that 𝜋𝑃 orders the symbol
𝑠𝑖 before the symbol 𝑠𝑗 – see Section 4.4 for details).
   Having access to an estimate of performance for each precedence 𝜋𝑃 , the final step is to look
for a precedence 𝜋𝑃* that ideally minimizes the predicted performance measure over all the 𝑛!
possible precedences on 𝑃 ’s signature. Since finding the true optimum could be computationally
hard, we resort to using an approximation algorithm by Cohen et al. [20].
   The algorithm is recalled in Section 4.4.2.


4. Architecture
4.1. Values of precedences
We define the base cost value cost base (𝑃, 𝜋𝑃 ) of precedence 𝜋𝑃 on problem 𝑃 according to
the outcome of the proof search configured to use this precedence:

    • If the proof terminates successfully, cost base (𝜋𝑃 ) is the number of iterations of the satu-
      ration loop started during the proof search: cost base (𝜋𝑃 ) = ast(𝑃, 𝜋𝑃 ).
    • If the proof search fails (meaning that ast(𝑃, 𝜋𝑃 ) = ⊥), then cost base (𝜋𝑃 ) is the max-
      imum number of saturation loop iterations encountered in successful training proof
      searches on this problem: cost base (𝜋𝑃 ) = max𝜋′ ∈Π+ ast(𝑃, 𝜋𝑃′ ), where Π+   𝑃 is the set of
                                                         𝑃     𝑃
      all training precedences on problem 𝑃 that yield a successful proof search.

We further normalize the cost values by the following operations:
   1. Logarithmic scaling: For each solvable problem, running proof search with uniformly
      random predicate precedences reveals a distribution of abstract solving times on successful
      executions. Examining these distributions for various problems suggests that they are
      usually approximately log-normal. To make further scaling by standardization reasonable,
      we first transform the base costs by taking their logarithm.
   2. Standardization: Independently for each problem, we apply an affine transformation
      so that the resulting cost values have the mean 0 and standard deviation 1. This ensures
      that the values are comparable across problems.


                                                26
Filip Bártek et al. CEUR Workshop Proceedings                                                             21–33


Let cost std (𝜋𝑃 ) denote the resulting cost value of precedence 𝜋𝑃 after the scaling and standard-
ization.

4.2. Problem preference matrix learning
Given a problem 𝑃 with 𝑛 symbols, a preference matrix 𝑊𝑃 is any matrix over R of size 𝑛 × 𝑛.
We define the proxy cost of precedence 𝜋𝑃 under preference 𝑊𝑃 to be the sum of the preference
values 𝑊𝑃 𝑖,𝑗 of all symbol pairs 𝑠𝑖 , 𝑠𝑗 ordered by 𝜋𝑃 such that 𝑠𝑖 comes before 𝑠𝑗 :
                                      ∑︁ q                               −−−−→ −−→
                                            𝜋𝑃−1 (𝑖) < 𝜋𝑃−1 (𝑗) 𝑊𝑃 𝑖,𝑗 = O(𝜋𝑃 ) · 𝑊𝑃
                                                               y
             cost proxy (𝜋𝑃 , 𝑊𝑃 ) =
                                              𝑖,𝑗
       −−−−→ −−→
where O(𝜋𝑃 ) · 𝑊𝑃 is the dot product of the flattened matrices O(𝜋𝑃 ) and 𝑊𝑃 .
   For any given problem we can uniformly sample precedences 𝜋𝑃 to form the training set
𝑇 = {(𝜋𝑃1 , cost std (𝜋𝑃1 )), (𝜋𝑃2 , cost std (𝜋𝑃2 )), . . . , (𝜋𝑃𝑚 , cost std (𝜋𝑃𝑚 ))}. Having such training set
                               −−→
allows us to find a vector 𝑊𝑃 that minimizes the mean square error
                      1           ∑︁
                                               (cost proxy (𝜋𝑃 , 𝑊𝑃 ) − cost std (𝜋𝑃 ))2
                      𝑚
                            (𝜋𝑃 ,cost std (𝜋𝑃 ))∈𝑇

by linear regression.
   Minimizing the mean square error directly may lead to overfitting to the training set, especially
in problems whose signature is relatively large in comparison to the size of the training set.
To improve generalization, we use the Lasso regression algorithm [21] instead of standard linear
regression. We use cross-validation to set the value of the regularization hyperparameter.7
   Another reason to use the Lasso algorithm is that it performs regularization by imposing
a penalty on coefficients with large absolute value, effectively shrinking the coefficients that
correspond to symbol pairs whose mutual order does not affect the cost std (𝜋𝑃 ). We can use
this property to interpret the absolute value of preference value as a measure of the importance
of a given symbol pair.
   In the following sections we assume that the preference matrix 𝑊𝑃 we find by Lasso regression
yields cost proxy that approximates cost std well.

4.3. General preference matrix learning
We proceed to cast the task of finding a good preference matrix 𝑊𝑃 for an arbitrary problem
as a regression on feature vector representations of symbol pairs. To accomplish this we need
to be able to represent each pair of symbols by a feature vector and to know target preference
values for pairs of symbols in a training problem set.

4.3.1. Target preference values
For each problem 𝑃 in the training problem set, we find a problem preference matrix by
the method outlined in Section 4.2. The target value of an arbitrary pair of symbols 𝑠𝑖 , 𝑠𝑗 in 𝑃
is 𝑊𝑃 𝑖,𝑗 .
    7
        See the model LassoCV in the machine learning library scikit-learn [22].


                                                         27
Filip Bártek et al. CEUR Workshop Proceedings                                                  21–33


4.3.2. Symbol pair embedding
We represent each symbol by a numeric feature vector that consists of the following compo-
nents: symbol arity, the number of symbol occurrences in the problem, the number of clauses
in the problem that contain at least one occurrence of the symbol, an indicator of occurrence
in a conjecture clause, an indicator of occurrence in a unit clause, and an indicator of being
introduced during clausification. This choice of symbol features is motivated by the fact that
they are readily available in Vampire and that they suffice as a basis for common precedence
generation schemes, such as the invfreq scheme. We denote the feature vector corresponding
to symbol 𝑠 as fv (𝑠).
   We represent a pair of symbols 𝑠, 𝑡 by the concatenation of their feature vectors [fv (𝑠), fv (𝑡)].

4.3.3. Training data
The general preference regressor is trained on samples of the following structure:
    • the input: [fv (𝑠𝑖 ), fv (𝑠𝑗 )] – the embedding of a symbol pair 𝑠𝑖 , 𝑠𝑗 in problem 𝑃 ,
    • the target: 𝑊𝑃 𝑖,𝑗 – an element of the preference matrix we learned for problem 𝑃
      corresponding to the symbol pair (𝑠𝑖 , 𝑠𝑗 ).
We sample problem 𝑃 from the training problem set with uniform probability.
  Thanks to how 𝑊𝑃 is constructed (see Section 4.2), preference values close to 0 are associated
with symbol pairs whose mutual order has little effect on the outcome of the proof search.
To focus the training on the symbol pairs whose order does matter, we weight the samples
by the absolute value of the target. More precisely, given a problem 𝑃 , the probability of
sampling the symbol pair 𝑖, 𝑗 is proportional to the absolute target value |𝑊𝑃 𝑖,𝑗 |. Experiments
have shown that using sample weighting improves the performance of the resulting model (see
Section 5.2).
  We denote the trained model as 𝑀 and its prediction of the preference value of the symbol
pair 𝑠𝑖 , 𝑠𝑗 as 𝑀 ([fv (𝑠𝑖 ), fv (𝑠𝑗 )]).

4.4. Precedence construction
When presented a new problem 𝑃 = (Σ, Cl ), we propose a symbol precedence by taking
the following steps:
   1. Estimate a preference matrix 𝑊
                                   ̂︂ 𝑃.
   2. Construct a precedence 𝜋̂︁
                               𝑃 that approximately minimizes cost proxy (𝜋̂︁
                                                                            𝑃 , 𝑊𝑃 ).
                                                                                ̂︂

4.4.1. Preference matrix construction
To construct a preference matrix 𝑊̂︂𝑃 for a new problem 𝑃 , we evaluate the general preference
regressor on the feature vectors of all symbol pairs in 𝑃 . More specifically,
                                   𝑊
                                   ̂︂𝑃 𝑖,𝑗 = 𝑀 ([fv (𝑠𝑖 ), fv (𝑠𝑗 )])

for all 𝑠𝑖 , 𝑠𝑗 ∈ Σ.
  At this moment, one can use 𝑊
                              ̂︂𝑃 to estimate the cost of an arbitrary symbol precedence.


                                                  28
Filip Bártek et al. CEUR Workshop Proceedings                                                           21–33


4.4.2. Precedence construction from preference matrix
The remaining task is, given a preference matrix 𝑊      ̂︂𝑃 , to find a precedence 𝜋
                                                                                   ̂︁𝑃 that minimizes
              𝑃 , 𝑊𝑃 ). Since this task is NP-hard in general [20], we rely in this work on a greedy
cost proxy (𝜋̂︁   ̂︂
2-approximation algorithm proposed by Cohen et al. [20]. The rest of this section provides
a brief description of the algorithm.
   The algorithm maintains a partially constructed symbol precedence p ∈ N* (a finite sequence
over N; initially empty), a set of available symbols Σavail ⊆ Σ (initially the whole Σ) and
a potential value for each of the symbols 𝑐 : Σavail → R. The potential value of a symbol
corresponds to the relative increase in proxy cost associated with selecting the symbol as the next
to append to the partial precedence:
                                           ∑︁                 ∑︁
                               𝑐(𝑠𝑖 ) =         𝑊  𝑃 𝑖,𝑗 −           𝑊  𝑃 𝑗,𝑖
                                                ̂︂                   ̂︂
                                         𝑠𝑗 ∈Σavail          𝑠𝑗 ∈Σavail

   In each iteration, a symbol 𝑠𝑖 with the smallest potential is selected from Σavail . This symbol
is removed from Σavail and its index 𝑖 is appended to the partial precedence p. The potentials
of the remaining symbols in Σavail are updated. This process is repeated until all symbols have
been selected, yielding the final p as 𝜋̂︁
                                         𝑃.


5. Evaluation
5.1. Setup
Since the simplification orderings under consideration (LPO and KBO) never use the symbol
precedence to compare a predicate symbol with a function symbol, we can break down the sym-
bol precedence into a predicate precedence and a function precedence. In this paper, we restrict
our attention to predicate precedences, leaving function symbols to be ordered by the invfreq
scheme. A more thorough evaluation of both predicate and function precedences and their
interaction is left for future work.
   We use problems from the TPTP library v7.2.0 [14] for the evaluation. Let 𝒫train be the set
of all FOL and CNF problems in TPTP with at most 200 predicate symbols such that at least
1 out of 24 random predicate precedences leads to a successful proof search (|𝒫train | = 8217).
Let 𝒫test be the set of all FOL and CNF problems in TPTP with at most 1024 predicate symbols
(|𝒫test | = 15751). In each of 5 evaluation iterations (splits), we sample 1000 training problems
from 𝒫train and 1000 test problems from 𝒫test uniformly in a way that ensures that the sets
do not overlap. We repeat the evaluation 5 times to evaluate the stability of the training.
   On each training problem we run Vampire with 100 uniformly random predicate precedences
and a strategy fixed up to the predicate precedence.8 We limit the time to 10 seconds per execu-
tion which is in our experience with Vampire sufficient to exhibit interesting behavior. Note that
we use a customized version of Vampire to extract a symbol table from each of the problems.9

    8
      Time limit: 10 seconds, memory limit: 8192 MB, literal comparison mode: predicate, function symbol prece-
dence: invfreq, saturation algorithm: discount, age-weight ratio: 1:10, AVATAR: disabled.
    9
      https://github.com/filipbartek/vampire/tree/926154f2


                                                      29
Filip Bártek et al. CEUR Workshop Proceedings                                                    21–33


Table 1
Experiment results
       Case                                            Successes out of 1000 per split
                                                  0     1     2     3      4 Mean         Std
       Best of 10 random                        514   499   509   511    476    501.8    13.85
       invfreq                                  494   472   480   481    452    475.8    13.83
       Elastic-Net                              484   471   479   470    454    471.6    10.21
       Gradient Boosting                        473   462   475   475    439    464.8    13.78
       Elastic-Net without sample weighting     475   454   465   459    453    461.2     8.11
       Random                                   454   455   457   456    430    450.4    10.25


   After we fit a preference matrix on each of the training problems (see Section 4.2), we create
a batch of 106 symbol pair feature vectors with target values to train the general preference
regressor (see Section 4.3). We evaluate the trained model by running Vampire on the test
problem set with predicate precedences proposed by the trained model, counting the number
of successfully solved problems.
   A collection of scripts created for the experimental evaluation can be found in the Git reposi-
tory at https://github.com/filipbartek/vampire-ml/tree/75c693f3. The measurements presented
below can be performed by running the script map-reduce/paar2020/run.sh.

5.2. Experimental results
We trained two types of general preference regressors (see Section 4.3):

    • Elastic-Net – a linear regression model with L1 and L2 norm regularization;
      see ElasticNetCV in Pedregosa et al. [22]
    • Gradient Boosting regressor – see GradientBoostingRegressor in Pedregosa et al.
      [22]

   We compared the performance of the regressors with three baseline precedence generation
schemes – random precedence, best of 10 random precedences and the invfreq scheme. Table 1
shows the results of evaluation on 1000 problems for 5 random choices of training and test
problem set (splits).
   The case “Elastic-Net without sample weighting” shows the effect of sampling the symbol
pairs uniformly. Inspection of the trained feature coefficients reveals that the fitting ends up
with an all-zero feature weight vector on splits 1 and 2, signifying a complete failure to learn
on these training sets.
   Using Elastic-Net for general preference prediction on average nearly matches the perfor-
mance of Vampire with the invfreq precedence scheme. While Elastic-Net performs signifi-
cantly better than a random precedence generator, it still performs significantly worse than
a generator that, given a problem, tries 10 random precedences and chooses the best of these.
This suggests that there is space for improvement, possibly with a more sophisticated, non-
linear model. Plugging in a Gradient Boosting regressor does not show immediate improvement
so more elaborate feature extraction may be necessary.


                                                30
Filip Bártek et al. CEUR Workshop Proceedings                                                    21–33


Table 2
Elastic-Net feature coefficients after fitting on each of the 5 training sets of 1000 problems and
on the whole 𝒫train . “Frequency” is the number of occurrences of the symbol in the problem. “Unit
frequency” is the number of clauses in the problem that contain at least one occurrence of the symbol.
     Training set               Left symbol                             Right symbol
                    Arity   Frequency Unit frequency         Arity   Frequency Unit frequency
          0                     −0.01                       −0.98         0.01
          1                     −0.48                                     0.08               0.44
          2                                       −0.64                   0.36
          3          0.88       −0.03              0.01                  −0.03               0.05
          4                     −0.62                                     0.30               0.07
        𝒫train                                    −0.57                    0.43


5.3. Feature coefficients
Since Elastic-Net is a linear regression model, we can easily inspect the coefficients it assigns to
the input features (see Section 4.3.2). In each of the five splits, the final coefficients of the three
indicator features (namely the indicators of presence in a conjecture clause, presence in a unit
clause and being introduced during clausification) are 0. Table 2 shows the fitted non-zero
coefficients of the remaining features. The coefficients were scaled so that their absolute values
sum up to 1. Note that scaling the coefficients by a constant does not affect the precedence
constructed using the greedy algorithm presented in Section 4.4.2.
   It is worth pointing out that the regressor fitted on the whole 𝒫train and on the training sets 1, 2
and 4 assigns a high preference value to symbol pairs (𝑠, 𝑡) such that 𝑡 has a higher frequency and
unit frequency than 𝑠. Since unit frequency is positively correlated with frequency, minimizing
cost proxy using this fitted regressor is consistent with the invfreq precedence generating
scheme (ordering the symbols by frequency in descending order). Similarly, the model fitted on
training sets 0 and 3 corresponds to ordering the symbols by arity in ascending order.


6. Conclusion
This paper is, to the best of our knowledge, a first attempt to use machine learning for proposing
symbol precedences for an ATP. This appears to be a potentially highly rewarding task with
an access to effectively unlimited amount of training data generated on demand. Nevertheless,
the journey from evaluating the prover on random precedences to proposing a good precedence
when presented with a new problem is not straightforward and several conceptual gaps need
to be bridged to connect these two tasks algorithmically.
   In this paper, we proposed a connection using the concept of pairwise symbol preferences
that, as we have shown, can be learned as the coefficients of a linear regression model for which
an order matrix provides the features of a precedence understood as a permutation. In a second
stage, in which symbols are abstracted by their features, the preferences themselves become
regression targets.
   In our initial experiments reported in this paper, the performance of our system does not


                                                  31
Filip Bártek et al. CEUR Workshop Proceedings                                              21–33


yet reach that of the human-designed heuristic invfreq. We believe, however, that further
improvements are possible by using a more advanced regression model for the second stage
and/or by further hyper-parameter tuning (e.g. of the Gradient Boosting model). Ultimately,
we expect to gain the most by using a richer set of symbol features, ideally automatically
extracted from the problems using graph neural networks [23].


Acknowledgments
Supported by the ERC Consolidator grant AI4REASON no. 649043 under the EU-H2020 pro-
gramme, the Czech Science Foundation project 20-06390Y and the Grant Agency of the Czech
Technical University in Prague, grant no. SGS20/215/OHK3/3T/37.


References
 [1] S. Schulz, S. Cruanes, P. Vukmirović, Faster, higher, stronger: E 2.3, in: P. Fontaine (Ed.),
     Automated Deduction - CADE 27, number 11716 in Lecture Notes in Computer Science,
     Springer, 2019, pp. 495–507. doi:10.1007/978-3-030-29436-6_29.
 [2] C. Weidenbach, D. Dimova, A. Fietzke, R. Kumar, M. Suda, P. Wischnewski, SPASS version
     3.5, in: R. A. Schmidt (Ed.), Automated Deduction - CADE-22, number 5663 in Lecture
     Notes in Computer Science, 2009, pp. 140–145. doi:10.1007/978-3-642-02959-2_10.
 [3] L. Kovács, A. Voronkov, First-order theorem proving and Vampire, in: N. Sharygina,
     H. Veith (Eds.), Computer Aided Verification, Springer Berlin Heidelberg, Berlin, Heidel-
     berg, 2013, pp. 1–35. doi:10.1007/978-3-642-39799-8_1.
 [4] R. Nieuwenhuis, A. Rubio, Paramodulation-based theorem proving, in: J. A. Robinson,
     A. Voronkov (Eds.), Handbook of Automated Reasoning (in 2 volumes), Elsevier and MIT
     Press, 2001, pp. 371–443. doi:10.1016/b978-044450813-3/50009-6.
 [5] G. Robinson, L. Wos, Paramodulation and theorem-proving in first-order theories with
     equality, in: J. H. Siekmann, G. Wrightson (Eds.), Automation of Reasoning: 2: Classical
     Papers on Computational Logic 1967–1970, Springer Berlin Heidelberg, Berlin, Heidelberg,
     1983, pp. 298–313. doi:10.1007/978-3-642-81955-1_19.
 [6] D. E. Knuth, P. B. Bendix, Simple word problems in universal algebras, in: J. H. Siekmann,
     G. Wrightson (Eds.), Automation of Reasoning: 2: Classical Papers on Computational
     Logic 1967–1970, Springer Berlin Heidelberg, Berlin, Heidelberg, 1983, pp. 342–376. doi:10.
     1007/978-3-642-81955-1_23.
 [7] S. N. Kamin, J. Lévy, Two generalizations of the recursive path ordering, 1980. URL: http:
     //www.cs.tau.ac.il/~nachumd/term/kamin-levy80spo.pdf, unpublished letter to Nachum
     Dershowitz.
 [8] L. Kovács, G. Moser, A. Voronkov, On transfinite Knuth-Bendix orders, in: N. Bjørner,
     V. Sofronie-Stokkermans (Eds.), Automated Deduction – CADE-23, number 6803 in Lecture
     Notes in Computer Science, Springer Berlin Heidelberg, Berlin, Heidelberg, 2011, pp. 384–
     399. doi:10.1007/978-3-642-22438-6_29.
 [9] L. Bachmair, H. Ganzinger, Rewrite-based equational theorem proving with selection


                                                32
Filip Bártek et al. CEUR Workshop Proceedings                                              21–33


     and simplification, Journal of Logic and Computation 4 (1994) 217–247. doi:10.1093/
     logcom/4.3.217.
[10] G. S. Tseitin, On the complexity of derivation in propositional calculus, in: J. H. Siekmann,
     G. Wrightson (Eds.), Automation of Reasoning: 2: Classical Papers on Computational
     Logic 1967–1970, Springer Berlin Heidelberg, Berlin, Heidelberg, 1983, pp. 466–483. doi:10.
     1007/978-3-642-81955-1_28.
[11] G. Reger, M. Suda, A. Voronkov, New techniques in clausal form generation, in:
     C. Benzmüller, G. Sutcliffe, R. Rojas (Eds.), GCAI 2016. 2nd Global Conference on Ar-
     tificial Intelligence, volume 41 of EPiC Series in Computing, EasyChair, 2016, pp. 11–23.
     URL: https://easychair.org/publications/paper/XncX. doi:10.29007/dzfz.
[12] S. Schulz, E 2.4 User Manual, 2019. URL: http://wwwlehre.dhbw-stuttgart.de/~sschulz/
     WORK/E_DOWNLOAD/V_2.4/eprover.pdf.
[13] G. Reger, M. Suda, Measuring progress to predict success: Can a good proof strat-
     egy be evolved?, in: AITP 2017, 2017, pp. 20–21. URL: http://aitp-conference.org/2017/
     aitp17-proceedings.pdf.
[14] G. Sutcliffe, The TPTP problem library and associated infrastructure. From CNF to
     TH0, TPTP v6.4.0, Journal of Automated Reasoning 59 (2017) 483–502. doi:10.1007/
     s10817-017-9407-7.
[15] A. Nonnengart, C. Weidenbach, Computing small clause normal forms, in: J. A. Robinson,
     A. Voronkov (Eds.), Handbook of Automated Reasoning (in 2 volumes), Elsevier and
     MIT Press, 2001, pp. 335–367. URL: https://doi.org/10.1016/b978-044450813-3/50008-4.
     doi:10.1016/b978-044450813-3/50008-4.
[16] A. Riazanov, A. Voronkov, Limited resource strategy in resolution theorem proving, Journal
     of Symbolic Computation 36 (2003) 101–115. doi:10.1016/S0747-7171(03)00040-3.
[17] K. E. Iverson, A Programming Language, John Wiley & Sons, Inc., New York, NY, USA,
     1962. URL: https://dl.acm.org/doi/book/10.5555/1098666.
[18] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning, Springer Series
     in Statistics, Springer New York Inc., New York, NY, USA, 2009. doi:10.1007/b94608.
[19] C. Kaliszyk, J. Urban, MizAR 40 for Mizar 40, Journal of Automated Reasoning 55 (2015)
     245–256. doi:10.1007/s10817-015-9330-8.
[20] W. W. Cohen, R. E. Schapire, Y. Singer, Learning to order things, Journal Of Artificial
     Intelligence Research 10 (1999) 243–270. doi:10.1613/jair.587. arXiv:1105.5464.
[21] R. Tibshirani, Regression shrinkage and selection via the Lasso, Journal of the Royal
     Statistical Society. Series B (Methodological) 58 (1996) 267–288. URL: http://www.jstor.org/
     stable/2346178.
[22] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
     P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,
     M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine
     Learning Research 12 (2011) 2825–2830. URL: https://scikit-learn.org/.
[23] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, P. S. Yu, A comprehensive survey on graph
     neural networks, IEEE Transactions on Neural Networks and Learning Systems (2020)
     1–21. doi:10.1109/tnnls.2020.2978386.


                                                33