1. Introduction

Seventh Workshop on Practical Aspects of Automated Reasoning, June

Learning Precedences from Simple Symbol Features

Filip Bártek

0 1

Martin Suda

0 0 Czech Technical University in Prague - Czech Institute of Informatics , Robotics and Cybernetics. Jugoslávských partyzánů 1580/3, 160 00 Praha 6 - Dejvice , Czech Republic 1 Czech Technical University in Prague - Faculty of Electrical Engineering. Technická 2, 166 27 Praha 6 - Dejvice , Czech Republic

2020

2 9 30

A simplification ordering, typically specified by a symbol precedence, is one of the key parameters of the superposition calculus, contributing to shaping the search space navigated by a saturation-based automated theorem prover. Thus the choice of a precedence can have a great impact on the prover's performance. In this work, we design a system for proposing symbol precedences that should lead to solving a problem quickly. The system relies on machine learning to extract this information from past successful and unsuccessful runs of a theorem prover over a set of problems and randomly sampled precedences. It uses a small set of simple human-engineered symbol features as the sole basis for discriminating the symbols. This allows for a direct comparison with precedence generation schemes designed by prover developers.

eol>saturation-based theorem proving simplification orderings symbol precedences machine learning

1. Introduction

Modern saturation-based automated theorem provers (ATPs) such as E [ 1 ], SPASS [ 2 ] or Vampire [ 3 ] use the superposition calculus [ 4 ] as their underlying inference system. Superposition is built around the paramodulation inference [ 5 ] crucially constrained by simplification ordering on terms and literals, which is supplied as a parameter of the calculus. Both of the two main classes of simplification orderings used in practice, i.e., the Knuth-Bendix Ordering [ 6 ] and the Lexicographic Path Ordering [ 7 ], are mainly determined by a symbol precedence, a (partial) ordering on the signature symbols.1

While the superposition calculus is known [ 9 ] to be refutationally complete for any simplification ordering, the choice of the precedence may have a significant impact on how long it takes to solve a given problem. In a well-known example, prioritizing in the precedence the predicates introduced during the Tseitin transformation of an input formula [ 10 ] exposes the corresponding literals to resolution inference during early stages of the proof search, with the efect of essentially undoing the transformation and thus threatening with an exponential blow-up that the transformation is designed to prevent [ 11 ]. ATPs typically ofer a few heuristic schemes for generating the symbol precedences. For example, the successful invfreq scheme in E [ 12 ] orders the symbols by the number of occurrences in the input problem, prioritizing symbols that occur the least often for early inferences. Experiments with random precedences have shown that the existing schemes often fail to come close to the optimum precedence [ 13 ], revealing there is a large potential for further improvements.

In this work, we design a system that, when presented with a First-Order Logic (FOL) problem, proposes a symbol precedence that will likely lead to solving the problem quickly. The system relies on the techniques of supervised machine learning and extracts such theorem-proving knowledge from successful (and unsuccessful) runs of the Vampire theorem prover [ 3 ] when run over a variety of FOL problems equipped with randomly sampled symbol precedences. We assume that by learning to solve already solvable problems quickly, the acquired knowledge will generalize and help solving problems previously out of reach. As a first step in a more ambitious project, we focus here on representing the symbols in a problem by a fixed set of simple human-engineered features (such as the number of occurrences used by invfreq scheme mentioned above)2 and, to simplify the experimental setup, we restrict our attention to learning precedences for predicate symbols only.3

Learning to predict good precedences poses several interesting challenges that we address in this work. First, it is not immediately clear how to characterize a precedence, a permutation of a finite set of symbols, by a real-valued feature vector to serve as an input for a learning algorithm. Additionally, to be able to generalize across problems we need to do it in a way which does not presuppose a fixed signature size. There is also a complication, when sampling diferent problems, that some problems may be easy to solve for almost every precedence and others hard. In theorem proving, running times typically vary considerably. Finally, even with a regression model ready to predict the prover’s performance under a particular precedence , we still need to solve the task of finding an optimum precedence * according to this model, which cannot be simply solved by enumerating all the permutations and running the prediction for each due to their huge number.

Our way of addressing the above sketched challenges lies in using pairwise symbol preferences to characterize a precedence, normalizing the target prover run times on a per problem basis, and in the use of “second-order” learning of the preferences for symbols abstracted by their features. These concepts are introduced in Section 3 and later formalized in Section 4. Section 5 presents the results of our experimental evaluation of the proposed technique over the TPTP [ 14 ] benchmark. We start our exposition by fixing the notation and basic concepts in Section 2.

2. Preliminaries

We assume that the reader is familiar with basic concepts used in first order logic (FOL) theorem proving. We use this section to recall and formalize the key notions relevant for our work. Problem A (first-order) problem is a pair = (Σ, Cl ), where Σ = (1, 2, . . . , ) is a list of (predicate and function) symbols called the signature, and Cl is a set of first-order clauses

2Automatic feature extraction using neural networks is planned for future work. 3Our theoretical considerations, however, apply equally to learning function symbol precedences.

built over the symbols of Σ.

The problem is either given directly by the user or could be the result of clausifying a general FOL formula , in which case we know which of the symbols were introduced during the clausiifcation (namely during Tseitin transformation and skolemization; see e.g. Nonnengart and Weidenbach [ 15 ]) and which occurred in the conjecture (if it was present).

Precedence Given a problem = (Σ, Cl ) with Σ = (1, 2, . . . , ), a precedence is a permutation, i.e. a bijective mapping, of the set of indices {1, . . . , }. A precedence determines a (total) ordering on Σ as follows: (1) < (2) < . . . < (). Simplification Orderings are orderings on terms used to parameterize the superposition calculus [ 4 ] employed by modern saturation-based theorem provers. The two classes of simplification orderings most commonly used in practice, the Knuth-Bendix Orderings [ 6 ] and the Lexicographic Path Orderings [ 7 ], are both defined in terms of a user-supplied (possibly partial) ordering < on the given problem’s signature Σ. In this work, we assume that the theorem prover uses a simplification ordering from one of these two classes relying on the ordering on Σ determined by a precedence to construct such a simplification ordering. Performance measure A saturation-based ATP solves a problem = (Σ, Cl ) (under a particular fixed strategy and a determined symbol precedence ) by either • deriving from Cl a contradiction in the form of the empty clause, in which case is shown unsatisfiable , or • finitely saturating the set of clauses Cl without deriving the contradiction, in which case is shown satisfiable .4 In both cases, we take the number of iterations of the employed saturation algorithm (see, e.g., Riazanov and Voronkov [ 16 ] for an overview) as a measure of the efort that the ATP took to solve the problem. We refer to this measure as the abstract solving time and denote it ast (, ).5

In practice, an ATP can also run out of resources, typically out of the allocated time. In that case, the abstract solving time is undefined : ast (, ) = ⊥. While it may happen that running an ATP with the same problem and symbol precedence two times yields a diferent result each time (namely succeeding one time and failing another time), such cases are rare and we ensure they do not interfere with the learning process by caching the results.

Order matrix Given a permutation of the set of indices {1, . . . , }, the order matrix O ( ) is a binary matrix of size × defined in the following manner:

O ( ), = q − 1() < − 1()y ,

4We assume a refutationally complete calculus and saturation strategy.

5The advantage of using abstract solving time is that it does not depend on the hardware used for the computation. where we use J K to denote the Iverson bracket [ 17 ] applied to a proposition , evaluating to 1 if is true, and 0 otherwise. In other words, for a symbol precedence , ( ), = 1 if the precedence orders the symbol before the symbol , and ( ), = 0 otherwise.

Flattened matrix

lfattening :

Given a matrix of size × →,− is the vector of length 2 obtained by

→− (− 1)+ = , for every , ∈ {1, . . . , }. For our use the exact way of mapping the matrix elements to the vector indices is not important. We mostly just need a vector representation of the data contained in the given matrix to have access to the dot product operation.

Linear regression is an approach to modeling the relationship between scalar target values ∈ R and one or more input variables x = (1 , . . . , ), = 1, . . . , . The relationship is modeled using a linear predictor function: whose unknown model parameters w ∈ R and ∈ R are estimated from the data. We call the vector w the coeficients of the model and the intercept. Most commonly, the parameters are picked to minimize the so-called mean squared error:

^ = x · w + , MSE = 1 ∑︁(^ − )2,

=1 but other norms are also possible [ 18 ].

Basic assumptions For the discussion that follows, we assume a fixed ATP that uses the superposition calculus with a simplification ordering parameterized by a symbol precedence. While the practical experiments described in Section 5 use the ATP Vampire [ 3 ], the model architecture does not assume a particular ATP and is compatible with any superposition-based ATP such as E [ 1 ], SPASS [ 2 ] or Vampire. Within the prover a particular saturation strategy is ifxed including a time limit.

3. General considerations and overview

The aim of this work is to design a system that learns to suggest good symbol precedences to an ATP from observations of the ATP’s performance on a class train of problems with randomly sampled precedences. Given a problem = (Σ, Cl ) with | Σ | = , we consider a precedence good, if it leads to a low ast (, ) among the ! possible precedences for . Note that for problems with a signature with more than a few symbols, repeatedly running the prover with random precedences represents an efectively infinite source of training data.

Ideally, we would like to learn general theorem proving knowledge, not too dependent on train , which could be later explained and compared to precedence generation schemes manually designed by the prover developers. Let us quickly recall one such scheme, already mentioned in the introduction, called invfreq in E [ 1 ]. The prover’s manual [ 12 ] explains:

Sort symbols by frequency (frequently occurring symbols are smaller).

What is common to basically all manually designed schemes, is that they pick a certain scalar property of symbols (here it is the symbol frequency, i.e. the number of occurrences of the symbol in the given problem) and obtain a precedence by sorting the symbols using that property. Decomposing We might want our system to also learn a certain property of symbols and use sorting to generate and suggest a precedence. However, it is not clear how to “extract” such property from the observed data, since we only have access to the target values for full precedences. Our idea for “decomposing” these values into pieces that somehow relate to individual symbols (and can thus be “transferred” across problems) is to take a detour using symbol pairs: we assume that the performance of the ATP on given , i.e. our measure ast (, ), can be predicted from a sum of individual contributions corresponding to facts of the form

orders the symbol before the symbol .

This is in line with how a prover developer could reason about a precedence generating scheme: Even when it is not clear how good or bad a symbol is in absolute terms, one might have an intuition that a symbol from a certain class should preferably come before a symbol from another class in the precedence (e.g., symbols introduced during clausification should typically be smaller than others) and assign some weight to this piece of intuition.

In Section 4.2 we formalize this idea using the notion of preference matrix and show how, for each problem in isolation, such preference matrix can be obtained in the form of coeficients learned by linear regression.

Learning across problems Symbol preferences learned on a particular problem are inherently tied to that problem and do not immediately carry over to other problems. The main reason for this is that symbols themselves only appear in the context of a particular problem.6 That is why we resort to representing symbols by their features (cf. Section 4.3.2) when aggregating the learned preferences across diferent problems. This is in more detail explained further below and, formally, in Section 4.3.

We also strive to ensure that the preference values across problems have possibly the same magnitude. Note that ast (, ) may vary a lot for a fixed problem but all the more so across problems. To obtain commensurable values, we normalize (see Section 4.1) the prover performance data on a per problem basis before learning the preferences. Normalization also deals with supplying a concrete value to those runs which did not finish, i.e. have ast (, ) = ⊥.

6On certain benchmarks, such as those coming from translations of mathematical libraries [ 19 ], symbols maintain identity and meaning across individual problems. However, since our goal in this work is to learn general theorem proving knowledge, we do not use the assumption of aligned signatures. “Second-order” regression Once the symbols are abstracted by their feature vectors, we can collect symbol preferences from all the tested problems and turn this collection into another regression task. Note that at this moment, the preferences, which were obtained as the coeficients learned by linear regression, themselves become the regression target. Thus, in a certain sense, we now do second-order learning. It should be stressed though, that while the learning of the preferences requires a linear regression model by design, this second-order regression does not need to be linear and more sophisticated models can be experimented with.

The details of this step are given in Section 4.3.

Preference prediction and optimization Once the second-order model has been learned, we can predict preferences for any pair of symbols based on their feature vectors and thus also predict, given a problem , how many steps will a prover require to solve it using a particular precedence . (For this second step, we reverse the idea of decomposition: we sum up those predicted preferences that correspond to pairs of symbols , such that orders the symbol before the symbol – see Section 4.4 for details).

Having access to an estimate of performance for each precedence , the final step is to look for a precedence * that ideally minimizes the predicted performance measure over all the ! possible precedences on ’s signature. Since finding the true optimum could be computationally hard, we resort to using an approximation algorithm by Cohen et al. [ 20 ].

The algorithm is recalled in Section 4.4.2.

4. Architecture

4.1. Values of precedences We define the base cost value cost base (, ) of precedence on problem according to the outcome of the proof search configured to use this precedence: • If the proof terminates successfully, cost base ( ) is the number of iterations of the saturation loop started during the proof search: cost base ( ) = ast (, ). • If the proof search fails (meaning that ast (, ) = ⊥), then cost base ( ) is the maximum number of saturation loop iterations encountered in successful training proof searches on this problem: cost base ( ) = max ′ ∈Π+ ast (, ′ ), where Π+ is the set of all training precedences on problem that yield a successful proof search.

We further normalize the cost values by the following operations: 1. Logarithmic scaling: For each solvable problem, running proof search with uniformly random predicate precedences reveals a distribution of abstract solving times on successful executions. Examining these distributions for various problems suggests that they are usually approximately log-normal. To make further scaling by standardization reasonable, we first transform the base costs by taking their logarithm. 2. Standardization: Independently for each problem, we apply an afine transformation so that the resulting cost values have the mean 0 and standard deviation 1. This ensures that the values are comparable across problems. Let cost std ( ) denote the resulting cost value of precedence after the scaling and standardization. 4.2. Problem preference matrix learning Given a problem with symbols, a preference matrix is any matrix over R of size × . We define the proxy cost of precedence under preference to be the sum of the preference values , of all symbol pairs , ordered by such that comes before : →−− − −O ( ) →·−− cost proxy ( , ) = ∑︁ q − 1() < − 1()y , =

, whe→r−e−− −O ( ) →·−− is the dot product of the flattened matrices O ( ) and .

For any given problem we can uniformly sample precedences to form the training set = {( 1 , cost std ( 1 )), ( 2 , cost std ( 2 )), . . . , ( , cost std ( ))}. Having such training set allows us to find a vector →−− that minimizes the mean square error 1

∑︁ ( ,coststd ( ))∈

(cost proxy ( , ) − cost std ( ))2 by linear regression.

Minimizing the mean square error directly may lead to overfitting to the training set, especially in problems whose signature is relatively large in comparison to the size of the training set. To improve generalization, we use the Lasso regression algorithm [ 21 ] instead of standard linear regression. We use cross-validation to set the value of the regularization hyperparameter.7

Another reason to use the Lasso algorithm is that it performs regularization by imposing a penalty on coeficients with large absolute value, efectively shrinking the coeficients that correspond to symbol pairs whose mutual order does not afect the cost std ( ). We can use this property to interpret the absolute value of preference value as a measure of the importance of a given symbol pair.

In the following sections we assume that the preference matrix we find by Lasso regression yields cost proxy that approximates cost std well. 4.3. General preference matrix learning We proceed to cast the task of finding a good preference matrix for an arbitrary problem as a regression on feature vector representations of symbol pairs. To accomplish this we need to be able to represent each pair of symbols by a feature vector and to know target preference values for pairs of symbols in a training problem set.

4.3.1. Target preference values

For each problem in the training problem set, we find a problem preference matrix by the method outlined in Section 4.2. The target value of an arbitrary pair of symbols , in is , .

7See the model LassoCV in the machine learning library scikit-learn [ 22 ].

4.3.2. Symbol pair embedding

We represent each symbol by a numeric feature vector that consists of the following components: symbol arity, the number of symbol occurrences in the problem, the number of clauses in the problem that contain at least one occurrence of the symbol, an indicator of occurrence in a conjecture clause, an indicator of occurrence in a unit clause, and an indicator of being introduced during clausification. This choice of symbol features is motivated by the fact that they are readily available in Vampire and that they sufice as a basis for common precedence generation schemes, such as the invfreq scheme. We denote the feature vector corresponding to symbol as fv ().

We represent a pair of symbols , by the concatenation of their feature vectors [fv (), fv ()].

4.3.3. Training data

The general preference regressor is trained on samples of the following structure: • the input: [fv (), fv ( )] – the embedding of a symbol pair , in problem , • the target: , – an element of the preference matrix we learned for problem corresponding to the symbol pair (, ).

We sample problem from the training problem set with uniform probability.

Thanks to how is constructed (see Section 4.2), preference values close to 0 are associated with symbol pairs whose mutual order has little efect on the outcome of the proof search. To focus the training on the symbol pairs whose order does matter, we weight the samples by the absolute value of the target. More precisely, given a problem , the probability of sampling the symbol pair , is proportional to the absolute target value | , |. Experiments have shown that using sample weighting improves the performance of the resulting model (see Section 5.2).

We denote the trained model as and its prediction of the preference value of the symbol pair , as ([fv (), fv ( )]). 4.4. Precedence construction When presented a new problem = (Σ, Cl ), we propose a symbol precedence by taking the following steps: 1. Estimate a preference matrix ̂︂ .

2. Construct a precedence ̂︁ that approximately minimizes cost proxy ( ̂︁ , ̂︂ ).

4.4.1. Preference matrix construction

To construct a preference matrix ̂︂ for a new problem , we evaluate the general preference regressor on the feature vectors of all symbol pairs in . More specifically,

̂︂ , = ([fv (), fv ( )]) for all , ∈ Σ.

At this moment, one can use ̂︂ to estimate the cost of an arbitrary symbol precedence.

4.4.2. Precedence construction from preference matrix

The remaining task is, given a preference matrix ̂︂ , to find a precedence ̂︁ that minimizes cost proxy ( ̂︁ , ̂︂ ). Since this task is NP-hard in general [ 20 ], we rely in this work on a greedy 2-approximation algorithm proposed by Cohen et al. [ 20 ]. The rest of this section provides a brief description of the algorithm.

The algorithm maintains a partially constructed symbol precedence p ∈ N* (a finite sequence over N; initially empty), a set of available symbols Σavail ⊆ Σ (initially the whole Σ) and a potential value for each of the symbols : Σavail → R. The potential value of a symbol corresponds to the relative increase in proxy cost associated with selecting the symbol as the next to append to the partial precedence: () =

∑︁ ∈Σavail ̂︂ , −

∑︁ ∈Σavail ̂︂ ,

In each iteration, a symbol with the smallest potential is selected from Σavail . This symbol is removed from Σavail and its index is appended to the partial precedence p. The potentials of the remaining symbols in Σavail are updated. This process is repeated until all symbols have been selected, yielding the final p as ̂︁ .

5. Evaluation

5.1. Setup Since the simplification orderings under consideration (LPO and KBO) never use the symbol precedence to compare a predicate symbol with a function symbol, we can break down the symbol precedence into a predicate precedence and a function precedence. In this paper, we restrict our attention to predicate precedences, leaving function symbols to be ordered by the invfreq scheme. A more thorough evaluation of both predicate and function precedences and their interaction is left for future work.

We use problems from the TPTP library v7.2.0 [ 14 ] for the evaluation. Let train be the set of all FOL and CNF problems in TPTP with at most 200 predicate symbols such that at least 1 out of 24 random predicate precedences leads to a successful proof search (|train | = 8217). Let test be the set of all FOL and CNF problems in TPTP with at most 1024 predicate symbols (|test | = 15751). In each of 5 evaluation iterations (splits), we sample 1000 training problems from train and 1000 test problems from test uniformly in a way that ensures that the sets do not overlap. We repeat the evaluation 5 times to evaluate the stability of the training.

On each training problem we run Vampire with 100 uniformly random predicate precedences and a strategy fixed up to the predicate precedence. 8 We limit the time to 10 seconds per execution which is in our experience with Vampire suficient to exhibit interesting behavior. Note that we use a customized version of Vampire to extract a symbol table from each of the problems.9 8Time limit: 10 seconds, memory limit: 8192 MB, literal comparison mode: predicate, function symbol precedence: invfreq, saturation algorithm: discount, age-weight ratio: 1:10, AVATAR: disabled.

9https://github.com/filipbartek/vampire/tree/926154f2

After we fit a preference matrix on each of the training problems (see Section 4.2), we create a batch of 106 symbol pair feature vectors with target values to train the general preference regressor (see Section 4.3). We evaluate the trained model by running Vampire on the test problem set with predicate precedences proposed by the trained model, counting the number of successfully solved problems.

A collection of scripts created for the experimental evaluation can be found in the Git repository at https://github.com/filipbartek/vampire-ml/tree/75c693f3. The measurements presented below can be performed by running the script map-reduce/paar2020/run.sh. 5.2. Experimental results

We trained two types of general preference regressors (see Section 4.3):

• Elastic-Net – a linear regression model with L1 and L2 norm regularization; see ElasticNetCV in Pedregosa et al. [ 22 ] • Gradient Boosting regressor – see GradientBoostingRegressor in Pedregosa et al.

[ 22 ]

We compared the performance of the regressors with three baseline precedence generation schemes – random precedence, best of 10 random precedences and the invfreq scheme. Table 1 shows the results of evaluation on 1000 problems for 5 random choices of training and test problem set (splits).

The case “Elastic-Net without sample weighting” shows the efect of sampling the symbol pairs uniformly. Inspection of the trained feature coeficients reveals that the fitting ends up with an all-zero feature weight vector on splits 1 and 2, signifying a complete failure to learn on these training sets.

Using Elastic-Net for general preference prediction on average nearly matches the performance of Vampire with the invfreq precedence scheme. While Elastic-Net performs significantly better than a random precedence generator, it still performs significantly worse than a generator that, given a problem, tries 10 random precedences and chooses the best of these. This suggests that there is space for improvement, possibly with a more sophisticated, nonlinear model. Plugging in a Gradient Boosting regressor does not show immediate improvement so more elaborate feature extraction may be necessary. 5.3. Feature coeficients Since Elastic-Net is a linear regression model, we can easily inspect the coeficients it assigns to the input features (see Section 4.3.2). In each of the five splits, the final coeficients of the three indicator features (namely the indicators of presence in a conjecture clause, presence in a unit clause and being introduced during clausification) are 0. Table 2 shows the fitted non-zero coeficients of the remaining features. The coeficients were scaled so that their absolute values sum up to 1. Note that scaling the coeficients by a constant does not afect the precedence constructed using the greedy algorithm presented in Section 4.4.2.

It is worth pointing out that the regressor fitted on the whole train and on the training sets 1, 2 and 4 assigns a high preference value to symbol pairs (, ) such that has a higher frequency and unit frequency than . Since unit frequency is positively correlated with frequency, minimizing cost proxy using this fitted regressor is consistent with the invfreq precedence generating scheme (ordering the symbols by frequency in descending order). Similarly, the model fitted on training sets 0 and 3 corresponds to ordering the symbols by arity in ascending order.

6. Conclusion

This paper is, to the best of our knowledge, a first attempt to use machine learning for proposing symbol precedences for an ATP. This appears to be a potentially highly rewarding task with an access to efectively unlimited amount of training data generated on demand. Nevertheless, the journey from evaluating the prover on random precedences to proposing a good precedence when presented with a new problem is not straightforward and several conceptual gaps need to be bridged to connect these two tasks algorithmically.

In this paper, we proposed a connection using the concept of pairwise symbol preferences that, as we have shown, can be learned as the coeficients of a linear regression model for which an order matrix provides the features of a precedence understood as a permutation. In a second stage, in which symbols are abstracted by their features, the preferences themselves become regression targets.

In our initial experiments reported in this paper, the performance of our system does not yet reach that of the human-designed heuristic invfreq. We believe, however, that further improvements are possible by using a more advanced regression model for the second stage and/or by further hyper-parameter tuning (e.g. of the Gradient Boosting model). Ultimately, we expect to gain the most by using a richer set of symbol features, ideally automatically extracted from the problems using graph neural networks [ 23 ].

Acknowledgments

Supported by the ERC Consolidator grant AI4REASON no. 649043 under the EU-H2020 programme, the Czech Science Foundation project 20-06390Y and the Grant Agency of the Czech Technical University in Prague, grant no. SGS20/215/OHK3/3T/37.

[1]

Schulz ,

Cruanes ,

Vukmirović , Faster, higher, stronger: E 2 .3, in: P. Fontaine (Ed.), Automated Deduction - CADE 27, number 11716 in Lecture Notes in Computer Science , Springer, 2019 , pp. 495 - 507 . doi: 10 .1007/978-3- 030 -29436-6_ 29 .

[2]

Weidenbach ,

Dimova ,

Fietzke ,

Kumar ,

Suda , P. Wischnewski, SPASS version 3 .5, in: R. A. Schmidt (Ed.), Automated Deduction - CADE-22, number 5663 in Lecture Notes in Computer Science , 2009 , pp. 140 - 145 . doi: 10 .1007/978-3- 642 -02959-2_ 10 .

[3]

Kovács ,

Voronkov , First-order theorem proving and Vampire , in: N. Sharygina , H. Veith (Eds.), Computer Aided Verification, Springer Berlin Heidelberg, Berlin, Heidelberg, 2013 , pp. 1 - 35 . doi: 10 .1007/978-3- 642 -39799- 8 _ 1 .

[4]

Nieuwenhuis ,

Rubio , Paramodulation-based theorem proving , in: J. A. Robinson , A . Voronkov (Eds.), Handbook of Automated Reasoning (in 2 volumes) , Elsevier and MIT Press, 2001 , pp. 371 - 443 . doi: 10 .1016/b978-044450813-3/ 50009 -6.

[5]

Robinson ,

Wos , Paramodulation and theorem-proving in first-order theories with equality , in: J. H. Siekmann , G. Wrightson (Eds.), Automation of Reasoning: 2: Classical Papers on Computational Logic 1967 -1970, Springer Berlin Heidelberg, Berlin, Heidelberg, 1983 , pp. 298 - 313 . doi: 10 .1007/978-3- 642 -81955-1_ 19 .

[6]

D. E.

Knuth ,

P. B.

Bendix , Simple word problems in universal algebras , in: J. H. Siekmann , G. Wrightson (Eds.), Automation of Reasoning: 2: Classical Papers on Computational Logic 1967 -1970, Springer Berlin Heidelberg, Berlin, Heidelberg, 1983 , pp. 342 - 376 . doi: 10 . 1007/978-3- 642 -81955-1_ 23 .

[7]

S. N.

Kamin ,

Lévy , Two generalizations of the recursive path ordering , 1980 . URL: http: //www.cs.tau.ac.il/~nachumd/term/kamin-levy80spo.pdf, unpublished letter to Nachum Dershowitz.

[8]

Kovács ,

Moser ,

Voronkov , On transfinite Knuth-Bendix orders , in: N. Bjørner , V. Sofronie-Stokkermans (Eds.), Automated Deduction - CADE-23, number 6803 in Lecture Notes in Computer Science , Springer Berlin Heidelberg, Berlin, Heidelberg, 2011 , pp. 384 - 399 . doi: 10 .1007/978-3- 642 -22438-6_ 29 .

[9]

Bachmair ,

Ganzinger , Rewrite-based equational theorem proving with selection and simplification , Journal of Logic and Computation 4 ( 1994 ) 217 - 247 . doi: 10 .1093/ logcom/4.3.217.

[10]

G. S.

Tseitin , On the complexity of derivation in propositional calculus , in: J. H. Siekmann , G. Wrightson (Eds.), Automation of Reasoning: 2: Classical Papers on Computational Logic 1967 -1970, Springer Berlin Heidelberg, Berlin, Heidelberg, 1983 , pp. 466 - 483 . doi: 10 . 1007/978-3- 642 -81955-1_ 28 .

[11]

Reger ,

Suda ,

Voronkov , New techniques in clausal form generation , in: C. Benzmüller, G. Sutclife, R. Rojas (Eds.), GCAI 2016. 2nd Global Conference on Artificial Intelligence , volume 41 of EPiC Series in Computing, EasyChair, 2016 , pp. 11 - 23 . URL: https://easychair.org/publications/paper/XncX. doi: 10 .29007/dzfz.

[12]

Schulz , E 2 .4

User

Manual , 2019 . URL: http://wwwlehre.dhbw-stuttgart.de/~sschulz/ WORK/E_DOWNLOAD/V_2.4/eprover.pdf.

[13]

Reger ,

Suda , Measuring progress to predict success: Can a good proof strategy be evolved? , in: AITP 2017 , 2017 , pp. 20 - 21 . URL: http://aitp-conference.org/ 2017 / aitp17-proceedings.pdf.

[14]

Sutclife , The TPTP problem library and associated infrastructure . From CNF to TH0 , TPTP v6.4.0, Journal of Automated Reasoning 59 ( 2017 ) 483 - 502 . doi: 10 .1007/ s10817-017-9407-7.

[15]

Nonnengart ,

Weidenbach , Computing small clause normal forms , in: J. A. Robinson , A . Voronkov (Eds.), Handbook of Automated Reasoning (in 2 volumes) , Elsevier and MIT Press, 2001 , pp. 335 - 367 . URL: https://doi.org/10.1016/b978-044450813-3/ 50008 - 4 . doi: 10 .1016/b978-044450813-3/ 50008 -4.

[16]

Riazanov ,

Voronkov , Limited resource strategy in resolution theorem proving , Journal of Symbolic Computation 36 ( 2003 ) 101 - 115 . doi: 10 .1016/S0747- 7171 ( 03 ) 00040 - 3 .

[17]

K. E.

Iverson ,

A Programming

Language , John Wiley & Sons, Inc., New York, NY, USA, 1962 . URL: https://dl.acm.org/doi/book/10.5555/1098666.

[18]

Hastie ,

Tibshirani ,

Friedman , The Elements of Statistical Learning , Springer Series in Statistics, Springer New York Inc., New York, NY, USA, 2009 . doi: 10 .1007/b94608.

[19]

Kaliszyk , J. Urban, MizAR 40 for Mizar 40, Journal of Automated Reasoning 55 ( 2015 ) 245 - 256 . doi: 10 .1007/s10817-015-9330-8.

[20]

W. W.

Cohen ,

R. E.

Schapire ,

Singer , Learning to order things, Journal Of Artificial Intelligence Research 10 ( 1999 ) 243 - 270 . doi: 10 .1613/jair.587. arXiv: 1105 . 5464 .

[21]

Tibshirani , Regression shrinkage and selection via the Lasso , Journal of the Royal Statistical Society. Series B (Methodological) 58 ( 1996 ) 267 - 288 . URL: http://www.jstor.org/ stable/2346178.

[22]

Pedregosa ,

Varoquaux ,

Gramfort ,

Michel ,

Thirion ,

Grisel ,

Blondel ,

Prettenhofer ,

Weiss ,

Dubourg ,

Vanderplas ,

Passos ,

Cournapeau ,

Brucher ,

Perrot , E. Duchesnay, Scikit-learn: Machine learning in Python , Journal of Machine Learning Research 12 ( 2011 ) 2825 - 2830 . URL: https://scikit-learn.org/.

[23]

Wu ,

Pan ,

Chen ,

Long ,

Zhang ,

P. S.

Yu , A comprehensive survey on graph neural networks , IEEE Transactions on Neural Networks and Learning Systems ( 2020 ) 1 - 21 . doi: 10 .1109/tnnls. 2020 . 2978386 .