Dynamic Strategy Priority: Empower the Strong and Abandon the Weak. Michael Rawson and Giles Reger University of Manchester, Manchester, UK Abstract. Automated theorem provers are often used in other tools as black-boxes for discharging proof obligations. One example where this has enjoyed a lot of success is within interactive theorem proving. A key usability factor in such settings is the time taken for the provers to com- plete. Automated theorem provers typically run lists of proof strategies sequentially, which can cause proofs to be found more slowly than neces- sary if the ordering is suboptimal. We show that it is possible to predict which strategies are likely to succeed while they are running using an ar- tificial neural network. We also implement a run-time strategy scheduler in the Vampire prover which utilises a trained neural network to improve average proof search time, and hence increases usability as a black-box prover. 1 Introduction Modern automated theorem provers (e.g. E [30], iProver [12], Vampire [13], CVC4 [7]) for first-order logic rely on portfolio modes [25], which utilise tens to hundreds of distinct strategies. Of these strategies, only a few might solve a hard problem, often rapidly. Typically, a portfolio of strategies has a pre-defined execution order: the prover will process strategies in this order, running each until a winning strategy is found or the prover runs out of time or strategies. Portfolio modes are important as, in practice, there is no best strategy. Fur- thermore, it is uncommon that two hard problems are efficiently solved by the same strategy. However, portfolio execution is not without problems: deciding the optimal ordering and time allocation is hard in general [24], and produces overly-rigid, brittle engineering when applied to specific domains, such as those found in the TPTP problem set [32]. Moreover, for any particular problem, some lengthy strategies that are successful on other problems are doomed to failure from the outset — as illustrated in Figure 1 — but are left to run unchecked by the prover, wasting time that could be spent on more productive strategies. This work makes two contributions. We first demonstrate correlation be- tween trends in dynamic properties of proof search, and the pending success or failure of a strategy (Sections 4, 5). We then utilise this to implement strat- egy scheduling, prioritising those strategies most likely to succeed (Section 6). This approach differs from previous work [30,20,16] which attempts to predict successful strategies a priori from static features of the input problem; instead (a) 17 27 37 43 ... (b) 1 2 3 4 ... 1 2 3 43 ... (c) 1 2 3 43 ... Fig. 1: Several strategy schedules and their effects. (a) Failing strategies might block a successful strategy. (b) Round-robin scheduling can help mitigate this problem. (c) This approach is improved if promising strategies run for longer. we tip running strategies for success based on dynamic, run-time features and use this information to make decisions at runtime. Our experiments (Section 7) show that guiding scheduling in this way can significantly speed up portfolio- based approaches. Before introducing these contributions and results we first present necessary background in Sections 2 and 3. 2 Strategies in Vampire Vampire is a first-order superposition theorem prover. This section reviews its basic structure and components relevant to the rest of this paper. We will use the word strategy to refer to a set of configuration parameter values that control proof search and proof attempt to refer to a run of the prover using such a strategy. 2.1 Input and preprocessing Vampire accepts problems in first-order logic with equality and a pre-defined set of first-order theories (e.g. arithmetic, arrays, datatypes). Vampire typically works with problems in the TPTP format [32] but also accepts problems in SMT-LIB [3]. Problems are parsed and transformed into a set of input clauses used by the saturation algorithm. This process involves clausification and a number of preprocessing steps. These preprocessing steps can alter certain properties of the problem e.g. its size and distribution across the signature, sometimes sig- nificantly: for example, the E.T. system [10] implements a pre-processor for E which selects small sets of axioms from a larger set, for large-theory reasoning. This means that any prediction method that relies solely on the input problem (rather than dynamic characteristics such as post-preprocessing properties) will find it harder to predict the best parameters for proof search. 59 2.2 Saturation algorithms Superposition provers such as Vampire use saturation algorithms with redun- dancy elimination. They work with a search space consisting of a set of clauses and use a collection of generating, simplifying and deleting inferences to explore this space. Generating inferences, such as superposition, extend this search space by adding new clauses obtained by applying inferences to existing clauses. Sim- plifying inferences, such as demodulation, replace a clause by a simpler one. Deleting inferences, such as subsumption, delete a clause, typically when it be- comes redundant (see [2]). Simplifying and deleting inferences must satisfy this condition to preserve completeness. The goal is to saturate the clause set with respect to the inference system. If the empty clause is derived then the input clauses are unsatisfiable. If no empty clause is derived and the search space is saturated then the input clauses are guaranteed to be satisfiable only if a complete strategy is used. A strategy is complete if it is guaranteed that all inferences between non-deleted clauses in the search space will be applied. Vampire includes many incomplete strategies as they can be very efficient at finding unsatisfiability. All saturation algorithms implemented in Vampire belong to the family of given clause algorithms, which achieve completeness via a fair clause selection process that prevents the indefinite skipping of old clauses. These algorithms typically divide clauses into three sets, unprocessed, passive and active, and follow a simple saturation loop: 1. Add non-redundant unprocessed clauses to passive. Redundancy is checked by attempting to forward simplify the new clause using processed clauses. 2. Remove processed (passive and active) clauses made redundant by newly processed clauses, i.e. backward simplify existing clauses using these clauses. 3. Select a given clause from passive, move it to active and perform all generat- ing inferences between the given clause and all other active clauses, adding generated clauses to unprocessed. Later we will show how iterations of this saturation loop from different proof attempts can be interleaved. Vampire implements three saturation algorithms: 1. Otter uses both passive and active clauses for simplifications. 2. Limited Resource Strategy (LRS) [28] extends Otter with a heuristic that discards clauses that are unlikely to be used with the current resources, i.e. time and memory. This strategy is incomplete but also generally the most effective at proving unsatisfiability. 3. Discount uses only active clauses for simplifications. A recent development in Vampire is AVATAR [37,26] which integrates with the saturation loop to perform clause splitting. The general idea is to use a SAT solver to select a subproblem by naming each clause sub-component by a propo- sitional variable, running a SAT solver on these abstracted clauses, and using the subsequent propositional model to select components to include in proof search. 60 At the end of each iteration of the loop we check whether the underlying sub- problem has changed. AVATAR can occasionally make loops run a little longer, but no more than other steps such as backward subsumption. Otherwise, the notion of saturation loop remains the same when using AVATAR. There are also other proof strategies that fit into the above loop format and can be interleaved with superposition based proof attempts. For example, instance generation [8] saturates the set of clauses with respect to the instance generation rule and finite model finding [27] iteratively checks larger model sizes (these loops tend to be much longer than those from other algorithms). 2.3 Strategies in Vampire Vampire includes more than 50 parameters. By only varying parameters and values used by Vampire at the last CASC competition, we obtain over 500 million strategies. These parameters control – Preprocessing steps (24 different parameters) – The saturation algorithm and related behaviour e.g. clause selection – Inferences used (16 different kinds with variations) Even restricting these parameters to a single saturation algorithm and straight- forward preprocessing steps, the number of possible strategies is vast. For this reason, Vampire implements a portfolio CASC mode [13] that categorises prob- lems based on syntactic features and attempts a sequence of approximately 50 strategies over a five minute period (this number can vary significantly). These strategies are the result of extensive benchmarking and have been shown, exper- imentally, to work well on unseen problems i.e. those not used for training. The syntactic features used by the current portfolio mode are coarse-grained and include the rough size of the problem (e.g. bucketed into tiny, small, medium, large, huge), the presence of features such as equality or certain theories, and whether the problem is effectively propositional or horn. Not all combinations of these are considered. Portfolio mode is created by considering a set of training data over a list of strategies and attempting to cover as much of it as possible by splitting the set of problems into smaller groups based on these features until all solutions fit into a given time limit. This process is greedy and places the strategy that solves the most problems during training first and then the next best strategy after removing the problems solved by the first solver and so on. 3 Machine Learning and Theorem Proving In this section we review the necessary background in machine learning in the context of theorem proving, in particular neural networks. Other technologies exist which are capable of similar performance, but neural networks produced the best combination of convenience and performance for our purposes. 61 Input Layer Hidden Layer Output Layer Input 1 Input 2 Classification Input 3 Fig. 2: A neural network with 3 inputs, a single densely-connected hidden layer, and a single output neuron. This is a type of network that might be used for simple binary classification tasks. A version of this architecture with more input and hidden neurons is used for this work. 3.1 Classification with neural networks Binary classification problems can be heuristically solved by means of supervised learning methods. A binary classification problem is, informally, the statement “given input X, does X belong to class A or class B”, where X is a member of the target domain D, and A, B are disjoint subsets of D such that D = A ∪ B. Supervised learning methods attempt to solve binary classification problems by starting with a neutral, untrained heuristic for classifying X into A or B, then changing their internal state to improve the heuristic by looking at examples of A and B (“training”). Typically, supervised learning methods do not work on X directly, but from a set of “features” derived from X. Engineering features, or choosing which of a set of supplied features to use, is a field of study in itself. A neural network consists of a collection of artificial “neurons”, wired to- gether with connections into a neural network architecture (see Figure 2 for an example). Each neuron n computes a real-valued function fn , which may be tuned with real-valued parameters: every neuron has a bias bn , and weight vector wn . fn is defined on an input vector x as fn (x) = A (wn · x + bn ) i.e. a weighted sum of all connected input neurons’ activation functions, plus the bias of the neuron, all fed through an activation function A(x). The choice of activation function is dependent on both the position of the neuron in the architecture, and on the target domain: a common choice is the sigmoid function σ(x), as shown in Figure 3. 62 σ(x) x Fig. 3: The sigmoid function σ(x) = 1+e1−x . This function “squeezes” the output to the range [0, 1]. Neural network parameters may be trained by a variety of different techniques (randomly, neuro-evolution methods, backpropagation methods, etc.). We used a backpropagation method, which, given a training input X and correct output y, tunes the network parameters by propagating an error signal backwards through the network to obtain updated parameters for each neuron, hence reducing the error signal for that example [29]. Repeating this process for multiple training examples can produce very good results, but frequently only with the correct combination of learning algorithm, starting weights, network architecture, and training examples. 3.2 Related work Machine learning has potential for significant effect in this area: despite the computational difficulties experienced by theorem provers, frequently proofs have human (or otherwise regular) structure, which may be exploited by the use of learning methods. The authors are aware of work in the fields of premise selection [9,15,35,38], static strategy selection [4,18,17], and more recently, direct proof guidance [36,11,19]. However, we are not aware of any previous work in the area of strategy selection at runtime for conventional theorem provers. The closest area of work is that of static strategy selection where work, with the exception of Bridge et al. [4] focusses on static properties of the input problem rather than dynamic properties of the proof search space. Bridge considered various dynamic features of the search space after 100 steps of the saturation algorithm in the default mode (of the E theorem prover [30]). So far, machine learning techniques have typically used classical machine- learning techniques such as naı̈ve Bayesian methods, SVMs [34], and decision trees [23]. However, some newer work has begun to utilise recently-developed deep-neural-network methods [9,38,19]. 63 4 Feature collection This section discusses which features are extracted from Vampire for prediction and how they are extracted. This information is firstly required to identify a correlation and will then be used to train a predictor. Modifying Vampire to log execution data (such as memory usage, or more specific metrics such as the number of generated clauses) for different strategies obtained from its primary portfolio mode1 is straightforward, but some data- collection decisions were made: – Only numerical data immediately available in the prover was collected, but there is scope here for both non-numeric and derivative data sources, which may provide greater insight into the proof state in future work. Suppose that quantities A and B are measured directly, but are more often discussed in terms of categories C that all A, B pairs fall into. Data could then include C as well, or instead of, A and B, embedding C into the neural network via a one-hot encoding. – Data was collected at intervals of a fixed number of internal resolution steps (in the experiments presented in this paper, this value was 10). This may not necessarily correspond to fixed time intervals, as each step may be more or less expensive, depending on the strategy and the problem. – All available data was collected, even if it emerged to be constant or un- helpful. This allowed an agnostic approach to learning in which the neural network training procedure selected relevant features. In all, 376 features are recorded, including the number of active clauses, passive clauses, and subsumptions, for instance. The execution traces produced are diffi- cult to work with, however: some are short or non-existent, others are extremely lengthy. The mean execution trace length was 536 (standard deviation: 1208) with the longest trace being 9971 recorded steps long. Some feature values also have an extremely high variance. To deal with these problems, a post-processing step is applied. For each feature, the mean over the entire data set is then mapped to 0, and the dataset is scaled to unit variance. Data that are too short (fewer than 10 steps) are discarded (the strategy likely did not take very long in any case). The remaining data is then sliced into 10 evenly-sized “buckets”, then an average of each bucket is taken to produce 10 values for every trace, and hence a fixed data size. However, even now these data are not representative of the classification prob- lem desired: these traces show completed runs of Vampire, whereas the classifier will be used to predict the success or otherwise of runs of Vampire that are still in progress. Hence, we take “snapshots” at various stages (in these experiments, at every quarter) of the trace, discarding the rest of the data, then post-process the remaining trace as described above. Conveniently, this also provides 4 times the original number of training examples. An example post-processed trace is shown in Figure 4. 1 CASC-mode, a portfolio designed for the CASC competition [33]. 64 0 buckets 5 0 50 100 150 200 250 300 350 feature # Fig. 4: An example trace, displayed as a colourmap, after post-processing. 5 Predicting successful strategies Being able to predict which proof search attempts will succeed in time, and which will fail, based solely on information in the execution trace may seem unlikely. However, it is known that the “slowly-growing search space” maxim, which states that strategies which minimise the number of derived clauses over time are more likely to succeed, is an effective heuristic for finding good strategies in saturation- based theorem proving [24]. Since the data we use includes the number of derived clauses, among many other features, it appears more plausible that this approach might work at least as well as the slow-growth heuristic alone. Engineering a prediction algorithm that attempts to partition traces into “succeeding” and “failing” classes is possible with the use of modern machine-learning techniques. Conveniently, these methods do not usually produce a binary output, but instead some f (X) ∈ [0, 1] which might be seen as the “level of confidence” in success of the trace, X. This success score can be used to apply an ordering to executing strategies, allowing “smart” dynamic scheduling of strategies. In particular, we evaluated (using the Keras [5] backend-agnostic neural net- work suite and the scikit-learn [22] utilities) 1. A simple neural network with one input for each datum in the trace and a single hidden layer. 2. A convolutional network [14] which performs a 1-dimensional convolutional pass along the time axis for each feature before the hidden layer. 3. A recurrent network [6], feeding the time series into a gated recurrent unit before processing. Results for this classification task are shown in Figure 5. The consistent level of accuracy achieved is encouraging. Both of the more-advanced classifiers per- formed better than the simple neural network. However, the simple network was chosen for integration into Vampire for implementation simplicity, and for performance reasons — it is “good enough”. 65 Method Mean Accuracy Standard Deviation Simple neural network 81.5% 2.0% Convolutional network 82.4% 3.1% Recurrent network 83.9% 1.9% Fig. 5: k-fold cross-validation classification accuracy on a balanced dataset of around 10,000 (total) succeeding and failing execution traces. k = 5. 6 Intelligent scheduling for Vampire We show that this abstract predictor can be used in a concrete implementa- tion for the Vampire prover. In the modified prover, it is used to run several strategies from Vampire’s portfolio in a modified scheduler: strategies self-report their own execution data to the supervisor process and halt for re-evaluation at regular intervals. When a strategy halts, the scheduler then decides whether to re-schedule the strategy for some more time, or to swap it out for a different strategy. The algorithm used is as follows, taking as input a set of strategies to run: 1. Initialize an empty “run pool” of processes, with a maximum size equal to the desired number of workers (e.g. CPU cores available). Also initialize an empty priority queue of paused processes. 2. Whenever the pool is under capacity, take the following steps: (a) If the best process in the paused queue has a priority greater than the static priority pstatic , wake it and move it into the running pool. In tests, a static priority of around 0.5 appeared to work best. (b) Otherwise, take a new strategy from the input set and start a new process to run that strategy. (c) If the input set is empty, take the process from the queue regardless of its priority. 3. When a strategy pauses: (a) Remove the process from the pool. (b) Re-evaluate the process priority using the neural network and the data it provided. (c) Insert the process into the queue with the computed priority. 4. When a strategy terminates, check if it succeeded. Otherwise, remove it from the pool. 5. If the pool, the queue, and the set of input strategies are all depleted, all strategies have failed. Exit. To embed the neural network into Vampire, we took the trained network weights from our Python-based experiments, and generated a C source file with these weights included as a large array. The neural network’s architecture was then re-implemented manually in C++, using the network weights compiled into the new program. This approach had several advantages: while perhaps not as efficient as an architecture such as TensorFlow [1] (which may use graphics 66 Table 1: Results for the three variations. Solved New Unique Average time (wall) Average time (cpu) baseline 12,899 - 420 3.03s ± 9.91s 10.53s ± 33.60s no-prediction 11,827 56 33 2.86s ± 8.33s 10.38s ± 32.24s prediction 12,425 111 88 2.59s ± 9.55s 8.74s ± 31.03s hardware to accelerate computations), our approach is reasonably efficient, but is also low-latency, does not incur any additional dependencies, and does not add significantly to program start-up time. 7 Experimental results We evaluate whether our prediction results carry through to improved perfor- mance in Vampire. Experimental setup We take all relevant problems for Vampire from TPTP 6.4.0 (17,281 problems in total) and run three variations of Vampire 4.2.1’s CASC portfolio mode: – baseline was the standard portfolio mode – no-prediction uses the dynamic scheduling architecture without prediction. This effectively produces a round-robin scheduling of strategies. – prediction uses the trained neural network to predict whether a strategy will be successful as previously described. The default 3GB memory limit and 300s time limit from the standard portfolio mode were kept but in addition each variation was run in a multicore setup using 4 cores to distribute strategies over.Experiments were run on the StarExec cluster [31] where each node contains an Intel Xeon 2.4GHz processor. Results Figure 6 illustrates the overall results by plotting the number of problems solved against time taken. Table 1 gives some raw numbers. In this table New refers to the number of problems solved by this variation that were not solved by baseline, Unique refers to the number of problems solved by this variation and not by the other two, and the average solution times are given for wall-clock and cpu-time with standard deviation. Table 2 gives further statistics comparing the two scheduling variations with the baseline. For the purposes of this table we only compare the results on prob- lems that both the given variation and baseline solve and where the difference between solution times is greater than 1 second. This second part is important as it allows for small variations in solution times due to natural non-determinism2 . 2 This allowance is often not made but it can heavily skew results. Without it all variations look almost identical as the cases where variations behave in the same way dominate. This does not indicate that the variations are not an improvement, 67 Fig. 6: Plot of problems solved against time for the three variations. Table 2: Amount of speedup or slowdown compared to the baseline solver. no-prediction prediction wall cpu wall cpu Average Speedup 1.54 6.90 1.94 6.24 Number of times better 451 788 650 1,297 Number of times worse 1,895 2,518 642 940 Average Speedup (when better) 7.02 28.19 3.47 10.46 Average Slowdown (when worse) 29.02 78.82 18.83 44.52 Speedup is a multiplication factor (a speedup of 2 means a proof was found in half the time), similarly for slowdown. Better means that the given variation was faster, whereas worse means that the variation was slower. Discussion An immediate observation is that the overall number of problems solved is slightly worse for the variations performing scheduling. An immediate explanation for this could be that the strategies in the portfolio mode might still be quite fragile i.e. their performance might be degrading with the small overhead of context-switching and additional memory contention. Further experiments could explore this by using more generic strategies with longer individual time limits (many strategies in CASC mode run for less than 1 second). However, the overall number of problems solved is still high enough to make the result useful. The new problems solved by the scheduling variations could also be attributed to this non-determinism, suggesting again that more effort should be spent to provide a broader set of complementary strategies to choose from. We note that one explanation for this second effect could have been if the strategy schedule but there are many cases where there is no difference. The most likely explanation for this is that solutions are quick and no scheduling is required e.g. if the problem is solved during preprocessing. 68 was longer than the time given and prediction moved a strategy into the allowed time. This is a behaviour we might expect for short time limits but for this experiment all strategies were run. The average solution times are improved with the two variations and the variation using prediction achieves the best solution time on average. When looking at the more detailed results of Table 2 we see that in the majority of cases the prediction variant was faster than the baseline, which was not the case when no prediction was applied. Furthermore, the impact of getting it wrong was larger without prediction (i.e. the slowdown is bigger). 8 Conclusions and future work The aim of these experiments was to improve Vampire’s overall performance, if not in the number of total theorems proved, but in the average time taken to prove problems. This approach has been shown to produce a significant increase in speed without an excessive penalty in the number of problems solved. There are several routes that could be explored in order to further improve performance. As well as improving predictor performance by use of more sophis- ticated data curation, processing, and machine-learning techniques, it may also be possible to improve the naı̈ve scheduling algorithm. Further research might include designing scheduling algorithms which keep predictions as up-to-date as possible, maximise processor utilisation, minimise memory usage/swapping, re- duce context-switching overhead, or even minimise the number of required calls to the prediction algorithm. This form of optimisation for Vampire is relatively novel: historically the aim of the team has been to prove as many theorems as possible, rather than to improve the speed of moderately-hard problem solving. As immediate im- pact, these developments may be useful in improving Vampire’s performance in the new SLH division in the CASC competition, as well as improving the over- all usability of ITP via quicker “hammer” results, such as those reported via Sledgehammer [21]. References 1. Martı́n Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016. 2. L. Bachmair and H. Ganzinger. Resolution theorem proving. In A. Robinson and A. Voronkov, editors, Handbook of Automated Reasoning, volume I, chapter 2, pages 19–99. Elsevier Science, 2001. 3. Clark Barrett, Cesare Tinelli, et al. The smt-lib standard: Version 2.0. 4. James P Bridge, Sean B Holden, and Lawrence C Paulson. Machine learning for first-order theorem proving. Journal of automated reasoning, 53(2):141–172, 2014. 5. François Chollet et al. Keras. https://keras.io, 2015. 69 6. Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Gated feedback recurrent neural networks. In International Conference on Machine Learning, pages 2067–2075, 2015. 7. Morgan Deters, Andrew Reynolds, Tim King, Clark W. Barrett, and Cesare Tinelli. A tour of CVC4: how it works, and how to use it. In Formal Methods in Computer- Aided Design, FMCAD 2014, Lausanne, Switzerland, October 21-24, 2014, page 7, 2014. 8. H. Ganzinger and K. Korovin. New directions in instantiation-based theorem proving. In Proc. LICS’03, pages 55–64, 2003. 9. Geoffrey Irving, Christian Szegedy, Alexander A Alemi, Niklas Een, Francois Chol- let, and Josef Urban. DeepMath — deep sequence models for premise selection. In Advances in Neural Information Processing Systems, pages 2235–2243, 2016. 10. Cezary Kaliszyk, Stephan Schulz, Josef Urban, and Jiřı́ Vyskočil. System descrip- tion: Et 0.1. In International Conference on Automated Deduction, pages 389–398. Springer, 2015. 11. Cezary Kaliszyk and Josef Urban. FEMaLeCoP: Fairly efficient machine learning connection prover. In Logic for Programming, Artificial Intelligence, and Reason- ing, pages 88–96. Springer, 2015. 12. K. Korovin. iProver – an instantiation-based theorem prover for first-order logic (system description). In A. Armando, P. Baumgartner, and G. Dowek, editors, Proceedings of the 4th International Joint Conference on Automated Reasoning, (IJCAR 2008), volume 5195 of Lecture Notes in Computer Science, pages 292– 298. Springer, 2008. 13. Laura Kovács and Andrei Voronkov. First-order theorem proving and Vampire. In International Conference on Computer Aided Verification, pages 1–35. Springer, 2013. 14. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information pro- cessing systems, pages 1097–1105, 2012. 15. Daniel Kühlwein, Jasmin Christian Blanchette, Cezary Kaliszyk, and Josef Ur- ban. MaSh: machine learning for sledgehammer. In International Conference on Interactive Theorem Proving, pages 35–50. Springer, 2013. 16. Daniel Kühlwein, Stephan Schulz, and Josef Urban. E-MaLeS 1.1. In International Conference on Automated Deduction, pages 407–413. Springer, 2013. 17. Daniel Kühlwein, Stephan Schulz, and Josef Urban. E-MaLeS 1.1. In International Conference on Automated Deduction, pages 407–413. Springer, 2013. 18. Daniel Kühlwein and Josef Urban. Males: A framework for automatic tuning of automated theorem provers. Journal of Automated Reasoning, 55(2):91–116, 2015. 19. Sarah Loos, Geoffrey Irving, Christian Szegedy, and Cezary Kaliszyk. Deep net- work guided proof search. arXiv preprint arXiv:1701.06972, 2017. 20. William McCune and Larry Wos. Otter — the CADE-13 competition incarnations. Journal of Automated Reasoning, 18(2):211–220, 1997. 21. Lawrence C Paulson and Jasmin Christian Blanchette. Three years of experience with sledgehammer, a practical link between automatic and interactive theorem provers. IWIL-2010, 1, 2010. 22. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. 23. J Ross Quinlan. Learning efficient classification procedures and their application to chess end games. In Machine Learning, Volume I, pages 463–482. Elsevier, 1983. 70 24. Giles Reger and Martin Suda. Measuring progress to predict success: Can a good proof strategy be evolved? AITP 2017, 2017. 25. Giles Reger, Martin Suda, and Andrei Voronkov. The challenges of evaluating a new feature in Vampire. In Vampire Workshop, pages 70–74, 2014. 26. Giles Reger, Martin Suda, and Andrei Voronkov. Playing with AVATAR. In P. Amy Felty and Aart Middeldorp, editors, Automated Deduction - CADE-25: 25th In- ternational Conference on Automated Deduction, Berlin, Germany, August 1-7, 2015, Proceedings, pages 399–415, Cham, 2015. Springer International Publishing. 27. Giles Reger, Martin Suda, and Andrei Voronkov. Finding finite models in multi- sorted first-order logic. In Nadia Creignou and Daniel Le Berre, editors, Theory and Applications of Satisfiability Testing – SAT 2016: 19th International Con- ference, Bordeaux, France, July 5-8, 2016, Proceedings, pages 323–341. Springer International Publishing, 2016. 28. A. Riazanov and A. Voronkov. Limited resource strategy in resolution theorem proving. J. Symb. Comp., 36(1-2):101–115, 2003. 29. David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning repre- sentations by back-propagating errors. nature, 323(6088):533, 1986. 30. Stephan Schulz. E — a brainiac theorem prover. AI Communications, 15(2, 3):111– 126, 2002. 31. Aaron Stump, Geoff Sutcliffe, and Cesare Tinelli. StarExec, a cross community logic solving service. https://www.starexec.org, 2012. 32. G. Sutcliffe. The TPTP problem library and associated infrastructure, from CNF to TH0, TPTP v6.4.0. Journal of Automated Reasoning, 59(4):483–502, 2017. 33. Geoff Sutcliffe and Christian Suttner. The state of CASC. AI Communications, 19(1):35–48, 2006. 34. Johan AK Suykens and Joos Vandewalle. Least squares support vector machine classifiers. Neural processing letters, 9(3):293–300, 1999. 35. Josef Urban. MaLARea: a metasystem for automated reasoning in large theories. ESARLT, 257, 2007. 36. Josef Urban, Jiřı́ Vyskočil, and Petr Štěpánek. MaLeCoP machine learning connec- tion prover. In International Conference on Automated Reasoning with Analytic Tableaux and Related Methods, pages 263–277. Springer, 2011. 37. Andrei Voronkov. AVATAR: The architecture for first-order theorem provers. In Armin Biere and Roderick Bloem, editors, Computer Aided Verification, volume 8559 of Lecture Notes in Computer Science, pages 696–710. Springer International Publishing, 2014. 38. Mingzhe Wang, Yihe Tang, Jian Wang, and Jia Deng. Premise selection for theorem proving by deep graph embedding. In Advances in Neural Information Processing Systems, pages 2783–2793, 2017. 71