Learning Precedences from Simple Symbol Features Filip Bárteka,b , Martin Sudaa a Czech Technical University in Prague – Czech Institute of Informatics, Robotics and Cybernetics. Jugoslávských partyzánů 1580/3, 160 00 Praha 6 – Dejvice, Czech Republic b Czech Technical University in Prague – Faculty of Electrical Engineering. Technická 2, 166 27 Praha 6 – Dejvice, Czech Republic Abstract A simplification ordering, typically specified by a symbol precedence, is one of the key parameters of the superposition calculus, contributing to shaping the search space navigated by a saturation-based automated theorem prover. Thus the choice of a precedence can have a great impact on the prover’s performance. In this work, we design a system for proposing symbol precedences that should lead to solving a problem quickly. The system relies on machine learning to extract this information from past successful and unsuccessful runs of a theorem prover over a set of problems and randomly sampled precedences. It uses a small set of simple human-engineered symbol features as the sole basis for dis- criminating the symbols. This allows for a direct comparison with precedence generation schemes designed by prover developers. Keywords saturation-based theorem proving, simplification orderings, symbol precedences, machine learning 1. Introduction Modern saturation-based automated theorem provers (ATPs) such as E [1], SPASS [2] or Vam- pire [3] use the superposition calculus [4] as their underlying inference system. Superposition is built around the paramodulation inference [5] crucially constrained by simplification ordering on terms and literals, which is supplied as a parameter of the calculus. Both of the two main classes of simplification orderings used in practice, i.e., the Knuth-Bendix Ordering [6] and the Lexicographic Path Ordering [7], are mainly determined by a symbol precedence, a (partial) ordering on the signature symbols.1 While the superposition calculus is known [9] to be refutationally complete for any sim- plification ordering, the choice of the precedence may have a significant impact on how long it takes to solve a given problem. In a well-known example, prioritizing in the precedence the predicates introduced during the Tseitin transformation of an input formula [10] exposes the corresponding literals to resolution inference during early stages of the proof search, with the effect of essentially undoing the transformation and thus threatening with an exponential blow-up that the transformation is designed to prevent [11]. ATPs typically offer a few heuristic PAAR 2020: Seventh Workshop on Practical Aspects of Automated Reasoning, June 29–30, 2020, Paris, France (virtual) filip.bartek@cvut.cz (F. Bártek); martin.suda@cvut.cz (M. Suda)  0000-0002-1822-2651 (F. Bártek); 0000-0003-0989-5800 (M. Suda) © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR CEUR Workshop Proceedings (CEUR-WS.org) Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 1 KBO is further parameterized by symbol weights, but our reference implementation in Vampire [3] uses for efficiency reasons only weights equal to one [8] and so we do not consider this parameter here. 21 Filip Bártek et al. CEUR Workshop Proceedings 21–33 schemes for generating the symbol precedences. For example, the successful invfreq scheme in E [12] orders the symbols by the number of occurrences in the input problem, prioritizing symbols that occur the least often for early inferences. Experiments with random precedences have shown that the existing schemes often fail to come close to the optimum precedence [13], revealing there is a large potential for further improvements. In this work, we design a system that, when presented with a First-Order Logic (FOL) problem, proposes a symbol precedence that will likely lead to solving the problem quickly. The system relies on the techniques of supervised machine learning and extracts such theorem-proving knowledge from successful (and unsuccessful) runs of the Vampire theorem prover [3] when run over a variety of FOL problems equipped with randomly sampled symbol precedences. We assume that by learning to solve already solvable problems quickly, the acquired knowledge will generalize and help solving problems previously out of reach. As a first step in a more ambitious project, we focus here on representing the symbols in a problem by a fixed set of simple human-engineered features (such as the number of occurrences used by invfreq scheme mentioned above)2 and, to simplify the experimental setup, we restrict our attention to learning precedences for predicate symbols only.3 Learning to predict good precedences poses several interesting challenges that we address in this work. First, it is not immediately clear how to characterize a precedence, a permutation of a finite set of symbols, by a real-valued feature vector to serve as an input for a learning algorithm. Additionally, to be able to generalize across problems we need to do it in a way which does not presuppose a fixed signature size. There is also a complication, when sampling different problems, that some problems may be easy to solve for almost every precedence and others hard. In theorem proving, running times typically vary considerably. Finally, even with a regression model ready to predict the prover’s performance under a particular precedence 𝜋, we still need to solve the task of finding an optimum precedence 𝜋 * according to this model, which cannot be simply solved by enumerating all the permutations and running the prediction for each due to their huge number. Our way of addressing the above sketched challenges lies in using pairwise symbol preferences to characterize a precedence, normalizing the target prover run times on a per problem basis, and in the use of “second-order” learning of the preferences for symbols abstracted by their features. These concepts are introduced in Section 3 and later formalized in Section 4. Section 5 presents the results of our experimental evaluation of the proposed technique over the TPTP [14] benchmark. We start our exposition by fixing the notation and basic concepts in Section 2. 2. Preliminaries We assume that the reader is familiar with basic concepts used in first order logic (FOL) theorem proving. We use this section to recall and formalize the key notions relevant for our work. Problem A (first-order) problem is a pair 𝑃 = (Σ, Cl ), where Σ = (𝑠1 , 𝑠2 , . . . , 𝑠𝑛 ) is a list of (predicate and function) symbols called the signature, and Cl is a set of first-order clauses 2 Automatic feature extraction using neural networks is planned for future work. 3 Our theoretical considerations, however, apply equally to learning function symbol precedences. 22 Filip Bártek et al. CEUR Workshop Proceedings 21–33 built over the symbols of Σ. The problem is either given directly by the user or could be the result of clausifying a general FOL formula 𝜙, in which case we know which of the symbols were introduced during the clausi- fication (namely during Tseitin transformation and skolemization; see e.g. Nonnengart and Weidenbach [15]) and which occurred in the conjecture (if it was present). Precedence Given a problem 𝑃 = (Σ, Cl ) with Σ = (𝑠1 , 𝑠2 , . . . , 𝑠𝑛 ), a precedence 𝜋𝑃 is a permutation, i.e. a bijective mapping, of the set of indices {1, . . . , 𝑛}. A precedence 𝜋𝑃 determines a (total) ordering on Σ as follows: 𝑠𝜋𝑃 (1) < 𝑠𝜋𝑃 (2) < . . . < 𝑠𝜋𝑃 (𝑛) . Simplification Orderings are orderings on terms used to parameterize the superposition calculus [4] employed by modern saturation-based theorem provers. The two classes of sim- plification orderings most commonly used in practice, the Knuth-Bendix Orderings [6] and the Lexicographic Path Orderings [7], are both defined in terms of a user-supplied (possibly partial) ordering < on the given problem’s signature Σ. In this work, we assume that the theo- rem prover uses a simplification ordering from one of these two classes relying on the ordering on Σ determined by a precedence 𝜋𝑃 to construct such a simplification ordering. Performance measure A saturation-based ATP solves a problem 𝑃 = (Σ, Cl ) (under a par- ticular fixed strategy and a determined symbol precedence 𝜋𝑃 ) by either • deriving from Cl a contradiction in the form of the empty clause, in which case 𝑃 is shown unsatisfiable, or • finitely saturating the set of clauses Cl without deriving the contradiction, in which case 𝑃 is shown satisfiable.4 In both cases, we take the number of iterations of the employed saturation algorithm (see, e.g., Riazanov and Voronkov [16] for an overview) as a measure of the effort that the ATP took to solve the problem. We refer to this measure as the abstract solving time and denote it ast(𝑃, 𝜋𝑃 ).5 In practice, an ATP can also run out of resources, typically out of the allocated time. In that case, the abstract solving time is undefined: ast(𝑃, 𝜋𝑃 ) = ⊥. While it may happen that running an ATP with the same problem and symbol precedence two times yields a different result each time (namely succeeding one time and failing another time), such cases are rare and we ensure they do not interfere with the learning process by caching the results. Order matrix Given a permutation 𝜋 of the set of indices {1, . . . , 𝑛}, the order matrix O(𝜋) is a binary matrix of size 𝑛 × 𝑛 defined in the following manner: O(𝜋)𝑖,𝑗 = 𝜋 −1 (𝑖) < 𝜋 −1 (𝑗) , q y 4 We assume a refutationally complete calculus and saturation strategy. 5 The advantage of using abstract solving time is that it does not depend on the hardware used for the compu- tation. 23 Filip Bártek et al. CEUR Workshop Proceedings 21–33 where we use J𝑃 K to denote the Iverson bracket [17] applied to a proposition 𝑃 , evaluating to 1 if 𝑃 is true, and 0 otherwise. In other words, for a symbol precedence 𝜋𝑃 , 𝑂(𝜋𝑃 )𝑖,𝑗 = 1 if the precedence 𝜋𝑃 orders the symbol 𝑠𝑖 before the symbol 𝑠𝑗 , and 𝑂(𝜋𝑃 )𝑖,𝑗 = 0 otherwise. − → Flattened matrix Given a matrix 𝑀 of size 𝑛 × 𝑛, 𝑀 is the vector of length 𝑛2 obtained by flattening 𝑀 : − → 𝑀 (𝑖−1)𝑛+𝑗 = 𝑀𝑖,𝑗 for every 𝑖, 𝑗 ∈ {1, . . . , 𝑛}. For our use the exact way of mapping the matrix elements to the vector indices is not important. We mostly just need a vector representation of the data contained in the given matrix to have access to the dot product operation. Linear regression is an approach to modeling the relationship between scalar target values 𝑦𝑖 ∈ R and one or more input variables x𝑖 = (𝑥1𝑖 , . . . , 𝑥𝑘𝑖 ), 𝑖 = 1, . . . , 𝑛. The relationship is modeled using a linear predictor function: 𝑦^𝑖 = x𝑖 · w + 𝑏, whose unknown model parameters w ∈ R𝑘 and 𝑏 ∈ R are estimated from the data. We call the vector w the coefficients of the model and 𝑏 the intercept. Most commonly, the parameters are picked to minimize the so-called mean squared error: 𝑛 1 ∑︁ MSE = (𝑦^𝑖 − 𝑦𝑖 )2 , 𝑛 𝑖=1 but other norms are also possible [18]. Basic assumptions For the discussion that follows, we assume a fixed ATP that uses the su- perposition calculus with a simplification ordering parameterized by a symbol precedence. While the practical experiments described in Section 5 use the ATP Vampire [3], the model architecture does not assume a particular ATP and is compatible with any superposition-based ATP such as E [1], SPASS [2] or Vampire. Within the prover a particular saturation strategy is fixed including a time limit. 3. General considerations and overview The aim of this work is to design a system that learns to suggest good symbol precedences to an ATP from observations of the ATP’s performance on a class 𝒫train of problems with randomly sampled precedences. Given a problem 𝑃 = (Σ, Cl ) with | Σ | = 𝑛, we consider a precedence 𝜋𝑃 good, if it leads to a low ast(𝑃, 𝜋𝑃 ) among the 𝑛! possible precedences for 𝑃 . Note that for problems with a signature with more than a few symbols, repeatedly running the prover with random precedences represents an effectively infinite source of training data. 24 Filip Bártek et al. CEUR Workshop Proceedings 21–33 Ideally, we would like to learn general theorem proving knowledge, not too dependent on 𝒫train , which could be later explained and compared to precedence generation schemes manually designed by the prover developers. Let us quickly recall one such scheme, already mentioned in the introduction, called invfreq in E [1]. The prover’s manual [12] explains: Sort symbols by frequency (frequently occurring symbols are smaller). What is common to basically all manually designed schemes, is that they pick a certain scalar property of symbols (here it is the symbol frequency, i.e. the number of occurrences of the symbol in the given problem) and obtain a precedence by sorting the symbols using that property. Decomposing We might want our system to also learn a certain property of symbols and use sorting to generate and suggest a precedence. However, it is not clear how to “extract” such property from the observed data, since we only have access to the target values for full precedences. Our idea for “decomposing” these values into pieces that somehow relate to individual symbols (and can thus be “transferred” across problems) is to take a detour using symbol pairs: we assume that the performance of the ATP on 𝑃 given 𝜋𝑃 , i.e. our measure ast(𝑃, 𝜋𝑃 ), can be predicted from a sum of individual contributions corresponding to facts of the form 𝜋𝑃 orders the symbol 𝑠𝑖 before the symbol 𝑠𝑗 . This is in line with how a prover developer could reason about a precedence generating scheme: Even when it is not clear how good or bad a symbol is in absolute terms, one might have an intuition that a symbol from a certain class should preferably come before a symbol from another class in the precedence (e.g., symbols introduced during clausification should typically be smaller than others) and assign some weight to this piece of intuition. In Section 4.2 we formalize this idea using the notion of preference matrix and show how, for each problem in isolation, such preference matrix can be obtained in the form of coefficients learned by linear regression. Learning across problems Symbol preferences learned on a particular problem are inher- ently tied to that problem and do not immediately carry over to other problems. The main reason for this is that symbols themselves only appear in the context of a particular problem.6 That is why we resort to representing symbols by their features (cf. Section 4.3.2) when aggregating the learned preferences across different problems. This is in more detail explained further below and, formally, in Section 4.3. We also strive to ensure that the preference values across problems have possibly the same magnitude. Note that ast(𝑃, 𝜋𝑃 ) may vary a lot for a fixed problem 𝑃 but all the more so across problems. To obtain commensurable values, we normalize (see Section 4.1) the prover performance data on a per problem basis before learning the preferences. Normalization also deals with supplying a concrete value to those runs which did not finish, i.e. have ast(𝑃, 𝜋𝑃 ) = ⊥. 6 On certain benchmarks, such as those coming from translations of mathematical libraries [19], symbols main- tain identity and meaning across individual problems. However, since our goal in this work is to learn general theorem proving knowledge, we do not use the assumption of aligned signatures. 25 Filip Bártek et al. CEUR Workshop Proceedings 21–33 “Second-order” regression Once the symbols are abstracted by their feature vectors, we can collect symbol preferences from all the tested problems and turn this collection into another regression task. Note that at this moment, the preferences, which were obtained as the coeffi- cients learned by linear regression, themselves become the regression target. Thus, in a certain sense, we now do second-order learning. It should be stressed though, that while the learning of the preferences requires a linear regression model by design, this second-order regression does not need to be linear and more sophisticated models can be experimented with. The details of this step are given in Section 4.3. Preference prediction and optimization Once the second-order model has been learned, we can predict preferences for any pair of symbols based on their feature vectors and thus also predict, given a problem 𝑃 , how many steps will a prover require to solve it using a particular precedence 𝜋𝑃 . (For this second step, we reverse the idea of decomposition: we sum up those predicted preferences that correspond to pairs of symbols 𝑠𝑖 , 𝑠𝑗 such that 𝜋𝑃 orders the symbol 𝑠𝑖 before the symbol 𝑠𝑗 – see Section 4.4 for details). Having access to an estimate of performance for each precedence 𝜋𝑃 , the final step is to look for a precedence 𝜋𝑃* that ideally minimizes the predicted performance measure over all the 𝑛! possible precedences on 𝑃 ’s signature. Since finding the true optimum could be computationally hard, we resort to using an approximation algorithm by Cohen et al. [20]. The algorithm is recalled in Section 4.4.2. 4. Architecture 4.1. Values of precedences We define the base cost value cost base (𝑃, 𝜋𝑃 ) of precedence 𝜋𝑃 on problem 𝑃 according to the outcome of the proof search configured to use this precedence: • If the proof terminates successfully, cost base (𝜋𝑃 ) is the number of iterations of the satu- ration loop started during the proof search: cost base (𝜋𝑃 ) = ast(𝑃, 𝜋𝑃 ). • If the proof search fails (meaning that ast(𝑃, 𝜋𝑃 ) = ⊥), then cost base (𝜋𝑃 ) is the max- imum number of saturation loop iterations encountered in successful training proof searches on this problem: cost base (𝜋𝑃 ) = max𝜋′ ∈Π+ ast(𝑃, 𝜋𝑃′ ), where Π+ 𝑃 is the set of 𝑃 𝑃 all training precedences on problem 𝑃 that yield a successful proof search. We further normalize the cost values by the following operations: 1. Logarithmic scaling: For each solvable problem, running proof search with uniformly random predicate precedences reveals a distribution of abstract solving times on successful executions. Examining these distributions for various problems suggests that they are usually approximately log-normal. To make further scaling by standardization reasonable, we first transform the base costs by taking their logarithm. 2. Standardization: Independently for each problem, we apply an affine transformation so that the resulting cost values have the mean 0 and standard deviation 1. This ensures that the values are comparable across problems. 26 Filip Bártek et al. CEUR Workshop Proceedings 21–33 Let cost std (𝜋𝑃 ) denote the resulting cost value of precedence 𝜋𝑃 after the scaling and standard- ization. 4.2. Problem preference matrix learning Given a problem 𝑃 with 𝑛 symbols, a preference matrix 𝑊𝑃 is any matrix over R of size 𝑛 × 𝑛. We define the proxy cost of precedence 𝜋𝑃 under preference 𝑊𝑃 to be the sum of the preference values 𝑊𝑃 𝑖,𝑗 of all symbol pairs 𝑠𝑖 , 𝑠𝑗 ordered by 𝜋𝑃 such that 𝑠𝑖 comes before 𝑠𝑗 : ∑︁ q −−−−→ −−→ 𝜋𝑃−1 (𝑖) < 𝜋𝑃−1 (𝑗) 𝑊𝑃 𝑖,𝑗 = O(𝜋𝑃 ) · 𝑊𝑃 y cost proxy (𝜋𝑃 , 𝑊𝑃 ) = 𝑖,𝑗 −−−−→ −−→ where O(𝜋𝑃 ) · 𝑊𝑃 is the dot product of the flattened matrices O(𝜋𝑃 ) and 𝑊𝑃 . For any given problem we can uniformly sample precedences 𝜋𝑃 to form the training set 𝑇 = {(𝜋𝑃1 , cost std (𝜋𝑃1 )), (𝜋𝑃2 , cost std (𝜋𝑃2 )), . . . , (𝜋𝑃𝑚 , cost std (𝜋𝑃𝑚 ))}. Having such training set −−→ allows us to find a vector 𝑊𝑃 that minimizes the mean square error 1 ∑︁ (cost proxy (𝜋𝑃 , 𝑊𝑃 ) − cost std (𝜋𝑃 ))2 𝑚 (𝜋𝑃 ,cost std (𝜋𝑃 ))∈𝑇 by linear regression. Minimizing the mean square error directly may lead to overfitting to the training set, especially in problems whose signature is relatively large in comparison to the size of the training set. To improve generalization, we use the Lasso regression algorithm [21] instead of standard linear regression. We use cross-validation to set the value of the regularization hyperparameter.7 Another reason to use the Lasso algorithm is that it performs regularization by imposing a penalty on coefficients with large absolute value, effectively shrinking the coefficients that correspond to symbol pairs whose mutual order does not affect the cost std (𝜋𝑃 ). We can use this property to interpret the absolute value of preference value as a measure of the importance of a given symbol pair. In the following sections we assume that the preference matrix 𝑊𝑃 we find by Lasso regression yields cost proxy that approximates cost std well. 4.3. General preference matrix learning We proceed to cast the task of finding a good preference matrix 𝑊𝑃 for an arbitrary problem as a regression on feature vector representations of symbol pairs. To accomplish this we need to be able to represent each pair of symbols by a feature vector and to know target preference values for pairs of symbols in a training problem set. 4.3.1. Target preference values For each problem 𝑃 in the training problem set, we find a problem preference matrix by the method outlined in Section 4.2. The target value of an arbitrary pair of symbols 𝑠𝑖 , 𝑠𝑗 in 𝑃 is 𝑊𝑃 𝑖,𝑗 . 7 See the model LassoCV in the machine learning library scikit-learn [22]. 27 Filip Bártek et al. CEUR Workshop Proceedings 21–33 4.3.2. Symbol pair embedding We represent each symbol by a numeric feature vector that consists of the following compo- nents: symbol arity, the number of symbol occurrences in the problem, the number of clauses in the problem that contain at least one occurrence of the symbol, an indicator of occurrence in a conjecture clause, an indicator of occurrence in a unit clause, and an indicator of being introduced during clausification. This choice of symbol features is motivated by the fact that they are readily available in Vampire and that they suffice as a basis for common precedence generation schemes, such as the invfreq scheme. We denote the feature vector corresponding to symbol 𝑠 as fv (𝑠). We represent a pair of symbols 𝑠, 𝑡 by the concatenation of their feature vectors [fv (𝑠), fv (𝑡)]. 4.3.3. Training data The general preference regressor is trained on samples of the following structure: • the input: [fv (𝑠𝑖 ), fv (𝑠𝑗 )] – the embedding of a symbol pair 𝑠𝑖 , 𝑠𝑗 in problem 𝑃 , • the target: 𝑊𝑃 𝑖,𝑗 – an element of the preference matrix we learned for problem 𝑃 corresponding to the symbol pair (𝑠𝑖 , 𝑠𝑗 ). We sample problem 𝑃 from the training problem set with uniform probability. Thanks to how 𝑊𝑃 is constructed (see Section 4.2), preference values close to 0 are associated with symbol pairs whose mutual order has little effect on the outcome of the proof search. To focus the training on the symbol pairs whose order does matter, we weight the samples by the absolute value of the target. More precisely, given a problem 𝑃 , the probability of sampling the symbol pair 𝑖, 𝑗 is proportional to the absolute target value |𝑊𝑃 𝑖,𝑗 |. Experiments have shown that using sample weighting improves the performance of the resulting model (see Section 5.2). We denote the trained model as 𝑀 and its prediction of the preference value of the symbol pair 𝑠𝑖 , 𝑠𝑗 as 𝑀 ([fv (𝑠𝑖 ), fv (𝑠𝑗 )]). 4.4. Precedence construction When presented a new problem 𝑃 = (Σ, Cl ), we propose a symbol precedence by taking the following steps: 1. Estimate a preference matrix 𝑊 ̂︂ 𝑃. 2. Construct a precedence 𝜋̂︁ 𝑃 that approximately minimizes cost proxy (𝜋̂︁ 𝑃 , 𝑊𝑃 ). ̂︂ 4.4.1. Preference matrix construction To construct a preference matrix 𝑊̂︂𝑃 for a new problem 𝑃 , we evaluate the general preference regressor on the feature vectors of all symbol pairs in 𝑃 . More specifically, 𝑊 ̂︂𝑃 𝑖,𝑗 = 𝑀 ([fv (𝑠𝑖 ), fv (𝑠𝑗 )]) for all 𝑠𝑖 , 𝑠𝑗 ∈ Σ. At this moment, one can use 𝑊 ̂︂𝑃 to estimate the cost of an arbitrary symbol precedence. 28 Filip Bártek et al. CEUR Workshop Proceedings 21–33 4.4.2. Precedence construction from preference matrix The remaining task is, given a preference matrix 𝑊 ̂︂𝑃 , to find a precedence 𝜋 ̂︁𝑃 that minimizes 𝑃 , 𝑊𝑃 ). Since this task is NP-hard in general [20], we rely in this work on a greedy cost proxy (𝜋̂︁ ̂︂ 2-approximation algorithm proposed by Cohen et al. [20]. The rest of this section provides a brief description of the algorithm. The algorithm maintains a partially constructed symbol precedence p ∈ N* (a finite sequence over N; initially empty), a set of available symbols Σavail ⊆ Σ (initially the whole Σ) and a potential value for each of the symbols 𝑐 : Σavail → R. The potential value of a symbol corresponds to the relative increase in proxy cost associated with selecting the symbol as the next to append to the partial precedence: ∑︁ ∑︁ 𝑐(𝑠𝑖 ) = 𝑊 𝑃 𝑖,𝑗 − 𝑊 𝑃 𝑗,𝑖 ̂︂ ̂︂ 𝑠𝑗 ∈Σavail 𝑠𝑗 ∈Σavail In each iteration, a symbol 𝑠𝑖 with the smallest potential is selected from Σavail . This symbol is removed from Σavail and its index 𝑖 is appended to the partial precedence p. The potentials of the remaining symbols in Σavail are updated. This process is repeated until all symbols have been selected, yielding the final p as 𝜋̂︁ 𝑃. 5. Evaluation 5.1. Setup Since the simplification orderings under consideration (LPO and KBO) never use the symbol precedence to compare a predicate symbol with a function symbol, we can break down the sym- bol precedence into a predicate precedence and a function precedence. In this paper, we restrict our attention to predicate precedences, leaving function symbols to be ordered by the invfreq scheme. A more thorough evaluation of both predicate and function precedences and their interaction is left for future work. We use problems from the TPTP library v7.2.0 [14] for the evaluation. Let 𝒫train be the set of all FOL and CNF problems in TPTP with at most 200 predicate symbols such that at least 1 out of 24 random predicate precedences leads to a successful proof search (|𝒫train | = 8217). Let 𝒫test be the set of all FOL and CNF problems in TPTP with at most 1024 predicate symbols (|𝒫test | = 15751). In each of 5 evaluation iterations (splits), we sample 1000 training problems from 𝒫train and 1000 test problems from 𝒫test uniformly in a way that ensures that the sets do not overlap. We repeat the evaluation 5 times to evaluate the stability of the training. On each training problem we run Vampire with 100 uniformly random predicate precedences and a strategy fixed up to the predicate precedence.8 We limit the time to 10 seconds per execu- tion which is in our experience with Vampire sufficient to exhibit interesting behavior. Note that we use a customized version of Vampire to extract a symbol table from each of the problems.9 8 Time limit: 10 seconds, memory limit: 8192 MB, literal comparison mode: predicate, function symbol prece- dence: invfreq, saturation algorithm: discount, age-weight ratio: 1:10, AVATAR: disabled. 9 https://github.com/filipbartek/vampire/tree/926154f2 29 Filip Bártek et al. CEUR Workshop Proceedings 21–33 Table 1 Experiment results Case Successes out of 1000 per split 0 1 2 3 4 Mean Std Best of 10 random 514 499 509 511 476 501.8 13.85 invfreq 494 472 480 481 452 475.8 13.83 Elastic-Net 484 471 479 470 454 471.6 10.21 Gradient Boosting 473 462 475 475 439 464.8 13.78 Elastic-Net without sample weighting 475 454 465 459 453 461.2 8.11 Random 454 455 457 456 430 450.4 10.25 After we fit a preference matrix on each of the training problems (see Section 4.2), we create a batch of 106 symbol pair feature vectors with target values to train the general preference regressor (see Section 4.3). We evaluate the trained model by running Vampire on the test problem set with predicate precedences proposed by the trained model, counting the number of successfully solved problems. A collection of scripts created for the experimental evaluation can be found in the Git reposi- tory at https://github.com/filipbartek/vampire-ml/tree/75c693f3. The measurements presented below can be performed by running the script map-reduce/paar2020/run.sh. 5.2. Experimental results We trained two types of general preference regressors (see Section 4.3): • Elastic-Net – a linear regression model with L1 and L2 norm regularization; see ElasticNetCV in Pedregosa et al. [22] • Gradient Boosting regressor – see GradientBoostingRegressor in Pedregosa et al. [22] We compared the performance of the regressors with three baseline precedence generation schemes – random precedence, best of 10 random precedences and the invfreq scheme. Table 1 shows the results of evaluation on 1000 problems for 5 random choices of training and test problem set (splits). The case “Elastic-Net without sample weighting” shows the effect of sampling the symbol pairs uniformly. Inspection of the trained feature coefficients reveals that the fitting ends up with an all-zero feature weight vector on splits 1 and 2, signifying a complete failure to learn on these training sets. Using Elastic-Net for general preference prediction on average nearly matches the perfor- mance of Vampire with the invfreq precedence scheme. While Elastic-Net performs signifi- cantly better than a random precedence generator, it still performs significantly worse than a generator that, given a problem, tries 10 random precedences and chooses the best of these. This suggests that there is space for improvement, possibly with a more sophisticated, non- linear model. Plugging in a Gradient Boosting regressor does not show immediate improvement so more elaborate feature extraction may be necessary. 30 Filip Bártek et al. CEUR Workshop Proceedings 21–33 Table 2 Elastic-Net feature coefficients after fitting on each of the 5 training sets of 1000 problems and on the whole 𝒫train . “Frequency” is the number of occurrences of the symbol in the problem. “Unit frequency” is the number of clauses in the problem that contain at least one occurrence of the symbol. Training set Left symbol Right symbol Arity Frequency Unit frequency Arity Frequency Unit frequency 0 −0.01 −0.98 0.01 1 −0.48 0.08 0.44 2 −0.64 0.36 3 0.88 −0.03 0.01 −0.03 0.05 4 −0.62 0.30 0.07 𝒫train −0.57 0.43 5.3. Feature coefficients Since Elastic-Net is a linear regression model, we can easily inspect the coefficients it assigns to the input features (see Section 4.3.2). In each of the five splits, the final coefficients of the three indicator features (namely the indicators of presence in a conjecture clause, presence in a unit clause and being introduced during clausification) are 0. Table 2 shows the fitted non-zero coefficients of the remaining features. The coefficients were scaled so that their absolute values sum up to 1. Note that scaling the coefficients by a constant does not affect the precedence constructed using the greedy algorithm presented in Section 4.4.2. It is worth pointing out that the regressor fitted on the whole 𝒫train and on the training sets 1, 2 and 4 assigns a high preference value to symbol pairs (𝑠, 𝑡) such that 𝑡 has a higher frequency and unit frequency than 𝑠. Since unit frequency is positively correlated with frequency, minimizing cost proxy using this fitted regressor is consistent with the invfreq precedence generating scheme (ordering the symbols by frequency in descending order). Similarly, the model fitted on training sets 0 and 3 corresponds to ordering the symbols by arity in ascending order. 6. Conclusion This paper is, to the best of our knowledge, a first attempt to use machine learning for proposing symbol precedences for an ATP. This appears to be a potentially highly rewarding task with an access to effectively unlimited amount of training data generated on demand. Nevertheless, the journey from evaluating the prover on random precedences to proposing a good precedence when presented with a new problem is not straightforward and several conceptual gaps need to be bridged to connect these two tasks algorithmically. In this paper, we proposed a connection using the concept of pairwise symbol preferences that, as we have shown, can be learned as the coefficients of a linear regression model for which an order matrix provides the features of a precedence understood as a permutation. In a second stage, in which symbols are abstracted by their features, the preferences themselves become regression targets. In our initial experiments reported in this paper, the performance of our system does not 31 Filip Bártek et al. CEUR Workshop Proceedings 21–33 yet reach that of the human-designed heuristic invfreq. We believe, however, that further improvements are possible by using a more advanced regression model for the second stage and/or by further hyper-parameter tuning (e.g. of the Gradient Boosting model). Ultimately, we expect to gain the most by using a richer set of symbol features, ideally automatically extracted from the problems using graph neural networks [23]. Acknowledgments Supported by the ERC Consolidator grant AI4REASON no. 649043 under the EU-H2020 pro- gramme, the Czech Science Foundation project 20-06390Y and the Grant Agency of the Czech Technical University in Prague, grant no. SGS20/215/OHK3/3T/37. References [1] S. Schulz, S. Cruanes, P. Vukmirović, Faster, higher, stronger: E 2.3, in: P. Fontaine (Ed.), Automated Deduction - CADE 27, number 11716 in Lecture Notes in Computer Science, Springer, 2019, pp. 495–507. doi:10.1007/978-3-030-29436-6_29. [2] C. Weidenbach, D. Dimova, A. Fietzke, R. Kumar, M. Suda, P. Wischnewski, SPASS version 3.5, in: R. A. Schmidt (Ed.), Automated Deduction - CADE-22, number 5663 in Lecture Notes in Computer Science, 2009, pp. 140–145. doi:10.1007/978-3-642-02959-2_10. [3] L. Kovács, A. Voronkov, First-order theorem proving and Vampire, in: N. Sharygina, H. Veith (Eds.), Computer Aided Verification, Springer Berlin Heidelberg, Berlin, Heidel- berg, 2013, pp. 1–35. doi:10.1007/978-3-642-39799-8_1. [4] R. Nieuwenhuis, A. Rubio, Paramodulation-based theorem proving, in: J. A. Robinson, A. Voronkov (Eds.), Handbook of Automated Reasoning (in 2 volumes), Elsevier and MIT Press, 2001, pp. 371–443. doi:10.1016/b978-044450813-3/50009-6. [5] G. Robinson, L. Wos, Paramodulation and theorem-proving in first-order theories with equality, in: J. H. Siekmann, G. Wrightson (Eds.), Automation of Reasoning: 2: Classical Papers on Computational Logic 1967–1970, Springer Berlin Heidelberg, Berlin, Heidelberg, 1983, pp. 298–313. doi:10.1007/978-3-642-81955-1_19. [6] D. E. Knuth, P. B. Bendix, Simple word problems in universal algebras, in: J. H. Siekmann, G. Wrightson (Eds.), Automation of Reasoning: 2: Classical Papers on Computational Logic 1967–1970, Springer Berlin Heidelberg, Berlin, Heidelberg, 1983, pp. 342–376. doi:10. 1007/978-3-642-81955-1_23. [7] S. N. Kamin, J. Lévy, Two generalizations of the recursive path ordering, 1980. URL: http: //www.cs.tau.ac.il/~nachumd/term/kamin-levy80spo.pdf, unpublished letter to Nachum Dershowitz. [8] L. Kovács, G. Moser, A. Voronkov, On transfinite Knuth-Bendix orders, in: N. Bjørner, V. Sofronie-Stokkermans (Eds.), Automated Deduction – CADE-23, number 6803 in Lecture Notes in Computer Science, Springer Berlin Heidelberg, Berlin, Heidelberg, 2011, pp. 384– 399. doi:10.1007/978-3-642-22438-6_29. [9] L. Bachmair, H. Ganzinger, Rewrite-based equational theorem proving with selection 32 Filip Bártek et al. CEUR Workshop Proceedings 21–33 and simplification, Journal of Logic and Computation 4 (1994) 217–247. doi:10.1093/ logcom/4.3.217. [10] G. S. Tseitin, On the complexity of derivation in propositional calculus, in: J. H. Siekmann, G. Wrightson (Eds.), Automation of Reasoning: 2: Classical Papers on Computational Logic 1967–1970, Springer Berlin Heidelberg, Berlin, Heidelberg, 1983, pp. 466–483. doi:10. 1007/978-3-642-81955-1_28. [11] G. Reger, M. Suda, A. Voronkov, New techniques in clausal form generation, in: C. Benzmüller, G. Sutcliffe, R. Rojas (Eds.), GCAI 2016. 2nd Global Conference on Ar- tificial Intelligence, volume 41 of EPiC Series in Computing, EasyChair, 2016, pp. 11–23. URL: https://easychair.org/publications/paper/XncX. doi:10.29007/dzfz. [12] S. Schulz, E 2.4 User Manual, 2019. URL: http://wwwlehre.dhbw-stuttgart.de/~sschulz/ WORK/E_DOWNLOAD/V_2.4/eprover.pdf. [13] G. Reger, M. Suda, Measuring progress to predict success: Can a good proof strat- egy be evolved?, in: AITP 2017, 2017, pp. 20–21. URL: http://aitp-conference.org/2017/ aitp17-proceedings.pdf. [14] G. Sutcliffe, The TPTP problem library and associated infrastructure. From CNF to TH0, TPTP v6.4.0, Journal of Automated Reasoning 59 (2017) 483–502. doi:10.1007/ s10817-017-9407-7. [15] A. Nonnengart, C. Weidenbach, Computing small clause normal forms, in: J. A. Robinson, A. Voronkov (Eds.), Handbook of Automated Reasoning (in 2 volumes), Elsevier and MIT Press, 2001, pp. 335–367. URL: https://doi.org/10.1016/b978-044450813-3/50008-4. doi:10.1016/b978-044450813-3/50008-4. [16] A. Riazanov, A. Voronkov, Limited resource strategy in resolution theorem proving, Journal of Symbolic Computation 36 (2003) 101–115. doi:10.1016/S0747-7171(03)00040-3. [17] K. E. Iverson, A Programming Language, John Wiley & Sons, Inc., New York, NY, USA, 1962. URL: https://dl.acm.org/doi/book/10.5555/1098666. [18] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning, Springer Series in Statistics, Springer New York Inc., New York, NY, USA, 2009. doi:10.1007/b94608. [19] C. Kaliszyk, J. Urban, MizAR 40 for Mizar 40, Journal of Automated Reasoning 55 (2015) 245–256. doi:10.1007/s10817-015-9330-8. [20] W. W. Cohen, R. E. Schapire, Y. Singer, Learning to order things, Journal Of Artificial Intelligence Research 10 (1999) 243–270. doi:10.1613/jair.587. arXiv:1105.5464. [21] R. Tibshirani, Regression shrinkage and selection via the Lasso, Journal of the Royal Statistical Society. Series B (Methodological) 58 (1996) 267–288. URL: http://www.jstor.org/ stable/2346178. [22] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830. URL: https://scikit-learn.org/. [23] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, P. S. Yu, A comprehensive survey on graph neural networks, IEEE Transactions on Neural Networks and Learning Systems (2020) 1–21. doi:10.1109/tnnls.2020.2978386. 33